CN113065551B

CN113065551B - Method for performing image segmentation using deep neural network model

Info

Publication number: CN113065551B
Application number: CN202110294862.XA
Authority: CN
Inventors: 杨林; 亢宇鑫; 李涵生; 崔磊; 费达; 付士军; 徐黎; 杨海英
Original assignee: Hangzhou Diyingjia Technology Co ltd; AstraZeneca Investment China Co Ltd
Current assignee: Hangzhou Diyingjia Technology Co ltd; AstraZeneca Investment China Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2023-08-08
Anticipated expiration: 2041-03-19
Also published as: CN113065551A

Abstract

The present disclosure provides a method, apparatus, device, and medium for performing image segmentation using a deep neural network model. The deep neural network model includes an encoder and a decoder, the method comprising: acquiring an input image containing a region of interest; extracting semantic features of different multiple scales of the input image by using a plurality of coding layers of an encoder respectively; decoding the semantic features of the multiple scales by utilizing a plurality of decoding layers of a decoder to obtain prediction results corresponding to the semantic features of each scale; integrating the prediction results corresponding to the semantic features of each scale to obtain a final prediction result as to whether each pixel in the input image belongs to the region of interest; and outputting an output image for dividing the region of interest from the input image according to the final prediction result. The image segmentation method provided by the invention can improve the overall continuity of the segmented region of interest in the image, thereby improving the accuracy of the segmentation result.

Description

Method for performing image segmentation using deep neural network model

Technical Field

The present disclosure relates to the field of image segmentation, and more particularly, to a method of performing image segmentation using a deep neural network model.

Background

Image segmentation plays an important role as an image processing technique in various fields in real life, for example, in the medical field, segmentation of a region of interest (for example, a tumor region) is required. For example, the previous tumor region segmentation often adopts a method of manual identification and segmentation, however, there are a number of obstacles to manually identifying and segmenting tumor regions: (1) The slice volume of the pathology department faces an increase of more than 15% per year, and the daily diagnostic workload of doctors is greatly increased; (2) The morphological characteristics of the tumor area and the normal area have small difference, and the difficulty of interpretation by a pathologist is high; (3) The accuracy of clinical pathological diagnosis is directly related to the usual study, study experience and number of reading sheets of pathologists, and experiments prove that when the same sections are interpreted, the interpretation among pathologists often has subjective differences. For this reason, an automated tumor region segmentation method is required for assisting the pathologist in performing daily pathological section analysis.

For example, in the conventional immunohistochemical membrane staining section interpretation work, accurate quantitative statistics of tumor cells have a crucial meaning for the hierarchical diagnosis of cancers, and tumor region segmentation can help pathologists to lock a region of interest first, on the basis, accurate quantitative statistics can be carried out in the region, and finally pathological diagnosis is made. However, accurate and reliable segmentation of tumor regions is critical for pathological diagnosis.

At present, although some automatic image segmentation methods exist, the accuracy of region-of-interest segmentation still has difficulty in meeting the practical application requirements, because the core idea of the traditional automatic segmentation method is to classify each pixel point of an image, and then output a mask of the image and complete segmentation of the whole image. However, the pixel-based segmentation method will lose the continuity of the whole object in the image, resulting in low accuracy of the segmentation result.

Therefore, an image segmentation method that can more accurately segment a region of interest is required.

Disclosure of Invention

In view of the above problems, the present disclosure provides a method for performing image segmentation using a deep neural network model, which can obtain accurate region of interest segmentation results by extracting multi-scale semantic features of an image and integrating prediction results corresponding to the multi-scale semantic features.

Embodiments of the present disclosure provide a method of performing image segmentation using a deep neural network model, wherein the deep neural network model includes an encoder and a decoder, the method comprising: acquiring an input image containing a region of interest; extracting semantic features of different multiple scales of the input image by using a plurality of coding layers of an encoder respectively; decoding the semantic features of the multiple scales by utilizing a plurality of decoding layers of a decoder to obtain prediction results corresponding to the semantic features of each scale; integrating the prediction results corresponding to the semantic features of each scale to obtain a final prediction result as to whether each pixel in the input image belongs to the region of interest; and outputting an output image in which the region of interest is segmented from the input image according to the final prediction result.

According to an embodiment of the disclosure, the plurality of encoding layers includes N encoding layers, wherein a size of the semantic feature of the N-th scale extracted by the N-th encoding layer is smaller than a size of the semantic feature of the N-1-th scale extracted by the N-1-th encoding layer, the N-th scale is a smallest scale among the plurality of scales, N is a positive integer greater than or equal to 2, and N is less than or equal to N and greater than or equal to 2.

According to an embodiment of the present disclosure, the coding layer includes a convolution layer, a pooling layer, a batch normalization layer, and an activation layer.

According to an embodiment of the present disclosure, N is 4, the extracting semantic features of different multiple scales of the input image by using multiple encoding layers of an encoder includes: extracting semantic features of a first scale of the input image using a first encoding layer of the plurality of encoding layers; extracting semantic features of a second scale of the input image using a second encoding layer of the plurality of encoding layers based on the semantic features of the first scale; extracting semantic features of a third scale of the input image with a third encoding layer of the plurality of encoding layers based on the semantic features of the second scale; and extracting semantic features of a fourth scale of the input image by using a fourth coding layer in the plurality of coding layers based on the semantic features of the third scale.

According to an embodiment of the present disclosure, after the semantic features of the nth scale extracted by the nth encoding layer, the method further includes: and inputting the semantic features of the Nth scale into a spatial pyramid structure for further semantic feature extraction.

According to an embodiment of the present disclosure, the spatial pyramid structure includes: the first convolution layers are provided with different expansion rates and are used for carrying out cavity convolution on the semantic features of the Nth scale respectively so as to further capture multi-scale information; the second convolution layer is used for further convolving the semantic features of the Nth scale to enhance the coupling of the semantic feature channels; and the pooling layer is used for pooling the input image to obtain semantic features of the image level.

According to an embodiment of the present disclosure, the plurality of decoding layers includes N decoding layers, wherein the decoding the semantic features of the multiple scales with the plurality of decoding layers of the decoder includes: based on the semantic features of the nth scale extracted by the nth encoding layer, decoding is performed by an nth decoding layer corresponding to the nth encoding layer to obtain an nth prediction result corresponding to the semantic features of the nth scale.

According to an embodiment of the present disclosure, N is 4, the decoding with an nth decoding layer corresponding to an nth encoding layer based on the semantic features of an nth scale extracted by the nth encoding layer to obtain an nth prediction result includes: decoding with a first decoding layer corresponding to the first encoding layer based on semantic features of a first scale extracted by the first encoding layer to obtain a first prediction result; decoding with a second decoding layer corresponding to the second encoding layer based on semantic features of a second scale extracted by the second encoding layer to obtain a second prediction result; decoding with a third decoding layer corresponding to the third encoding layer based on semantic features of a third scale extracted by the third encoding layer to obtain a third prediction result; based on the semantic features of the fourth scale extracted by the fourth coding layer, decoding is performed by a fourth decoding layer corresponding to the fourth coding layer to obtain a fourth prediction result.

According to an embodiment of the present disclosure, the decoding, by using a plurality of decoding layers of a decoder, the semantic features of the plurality of scales to obtain a prediction result corresponding to the semantic features of each scale includes: fusing the semantic features of the nth scale with the semantic features of the nth-1 scale, and decoding by using an nth-1 decoding layer based on the fused semantic features to obtain an nth-1 prediction result.

According to an embodiment of the disclosure, the fusing the semantic features of the nth scale and the semantic features of the nth-1 scale and decoding with the nth-1 decoding layer based on the fused semantic features to obtain the nth-1 prediction result includes: upsampling the semantic features of the n-th scale, splicing the upsampled result with the semantic features of the n-1-th scale, upsampling the spliced result, and decoding by using the n-1-th decoding layer based on the upsampled semantic features to obtain an n-1-th prediction result.

According to an embodiment of the present disclosure, N is 4, the decoding the semantic features of the multiple scales with multiple decoding layers of the decoder to obtain a prediction result corresponding to the semantic features of each scale includes: upsampling the fourth scale semantic features and decoding with a fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result; upsampling the semantic features of the fourth scale, splicing the upsampled result with the semantic features of the third scale to obtain a first spliced result, upsampling the first spliced result, and decoding by using a third decoding layer based on the upsampled semantic features to obtain a third prediction result; upsampling the first splicing result, splicing the upsampled result with semantic features of a second scale to obtain a second splicing result, upsampling the second splicing result, and decoding by a second decoding layer based on the upsampled semantic features to obtain a second prediction result; and upsampling the second splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a third splicing result, upsampling the third splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to an embodiment of the present disclosure, the decoding, by using a plurality of decoding layers of a decoder, the semantic features of the plurality of scales to obtain a prediction result corresponding to the semantic features of each scale includes: fusing the semantic features of the m-th scale with the semantic features of one of the 1 st to m-2 th scales and decoding the semantic features based on the fused semantic features by using a decoding layer corresponding to the one scale to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2.

According to an embodiment of the present disclosure, the fusing the semantic features of the mth scale and the semantic features of one of the 1 st to m-2 th scales and decoding with a decoding layer corresponding to the one scale based on the fused semantic features to obtain a prediction result corresponding to the one scale includes: upsampling the semantic features of the m-th scale, splicing the upsampled result with the semantic features of the one scale, upsampling the spliced result, and decoding by using a decoding layer corresponding to the one scale based on the upsampled semantic features to obtain a prediction result corresponding to the one scale.

According to an embodiment of the present disclosure, N is 4, the decoding the semantic features of the multiple scales with multiple decoding layers of the decoder to obtain a prediction result corresponding to the semantic features of each scale includes: upsampling the fourth scale semantic features and decoding with a fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result; upsampling the semantic features of the fourth scale, splicing the upsampled result with the semantic features of the second scale to obtain a first spliced result, upsampling the first spliced result, and decoding by using a second decoding layer based on the upsampled semantic features to obtain a second prediction result; upsampling the semantic features of the third scale and decoding with a third decoding layer based on the upsampled semantic features to obtain a third prediction result; and upsampling the first splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a second splicing result, upsampling the second splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to an embodiment of the present disclosure, the decoding, by using a plurality of decoding layers of a decoder, the semantic features of the multiple scales to obtain a prediction result corresponding to the semantic features of each scale, further includes: and splicing the further extracted semantic features, then upsampling, and obtaining an Nth prediction result corresponding to the semantic features of the Nth scale by utilizing an Nth decoding layer based on the upsampled semantic features.

According to an embodiment of the present disclosure, the integrating the prediction results corresponding to the semantic features of each scale includes: and integrating the prediction results of the plurality of decoding layers from bottom to top according to the scale of the semantic features corresponding to the prediction results, and sequentially integrating the prediction results of the lower layer with the prediction results of the adjacent upper layer to obtain a final prediction result.

According to an embodiment of the disclosure, the integrating the prediction results of the plurality of decoding layers from bottom to top according to the scale of the semantic feature corresponding to the prediction results, sequentially integrating the prediction results of the lower layer with the prediction results of the adjacent upper layer to obtain a final prediction result includes: under the condition that n is larger than 2, integrating the n-th predicted result with the n-1-th predicted result, and integrating the integrated result with the n-2-th predicted result; in the case where n is equal to 2, the n-th predictor is integrated with the n-1-th predictor.

According to an embodiment of the present disclosure, where N is 4, the integrating the prediction results corresponding to the semantic features of each scale includes: integrating the fourth predicted result with the third predicted result, further integrating the integrated result with the second predicted result, integrating the further integrated result with the first predicted result, and taking the integrated result as a final predicted result.

According to an embodiment of the present disclosure, the integrating includes comparing two objects to be integrated at a pixel level, and searching a maximum prediction value corresponding to each pixel as a final prediction result.

According to an embodiment of the disclosure, the deep neural network model is trained by using a deep supervised mode, wherein during the deep neural network model training process, a counter-propagating gradient is obtained by calculating a prediction loss of each of the plurality of decoding layers of the decoder to update parameters of each encoding layer in the encoder.

An embodiment of the present disclosure provides an apparatus for performing image segmentation using a deep neural network model, wherein the deep neural network model includes an encoder and a decoder, the apparatus comprising: an acquisition module configured to acquire an input image containing a region of interest; an extraction module configured to extract semantic features of different multiple scales of the input image, respectively, using multiple encoding layers of an encoder; the decoding module is configured to respectively decode the semantic features of the multiple scales by utilizing a plurality of decoding layers of the decoder to obtain a prediction result corresponding to the semantic features of each scale; an integration module configured to integrate prediction results corresponding to semantic features of each scale to obtain a final prediction result as to whether individual pixels in the input image belong to the region of interest; and an output module configured to output an output image in which the region of interest is segmented from the input image according to the final prediction result.

According to an embodiment of the disclosure, N is 4, and the extracting module includes: extracting semantic features of a first scale of the input image using a first encoding layer of the plurality of encoding layers; extracting semantic features of a second scale of the input image using a second encoding layer of the plurality of encoding layers based on the semantic features of the first scale; extracting semantic features of a third scale of the input image with a third encoding layer of the plurality of encoding layers based on the semantic features of the second scale; and extracting semantic features of a fourth scale of the input image by using a fourth coding layer in the plurality of coding layers based on the semantic features of the third scale.

According to an embodiment of the present disclosure, the decoding module includes: fusing the semantic features of the nth scale with the semantic features of the nth-1 scale, and decoding by using an nth-1 decoding layer based on the fused semantic features to obtain an nth-1 prediction result.

According to an embodiment of the disclosure, N is 4, and the decoding module includes: upsampling the fourth scale semantic features and decoding with a fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result; upsampling the semantic features of the fourth scale, splicing the upsampled result with the semantic features of the third scale to obtain a first spliced result, upsampling the first spliced result, and decoding by using a third decoding layer based on the upsampled semantic features to obtain a third prediction result; upsampling the first splicing result, splicing the upsampled result with semantic features of a second scale to obtain a second splicing result, upsampling the second splicing result, and decoding by a second decoding layer based on the upsampled semantic features to obtain a second prediction result; and upsampling the second splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a third splicing result, upsampling the third splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to an embodiment of the present disclosure, the decoding module includes: fusing the semantic features of the m-th scale with the semantic features of one of the 1 st to m-2 th scales and decoding the semantic features based on the fused semantic features by using a decoding layer corresponding to the one scale to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2.

According to an embodiment of the disclosure, N is 4, and the decoding module includes: upsampling the fourth scale semantic features and decoding with a fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result; upsampling the semantic features of the fourth scale, splicing the upsampled result with the semantic features of the second scale to obtain a first spliced result, upsampling the first spliced result, and decoding by using a second decoding layer based on the upsampled semantic features to obtain a second prediction result; upsampling the semantic features of the third scale and decoding with a third decoding layer based on the upsampled semantic features to obtain a third prediction result; and upsampling the first splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a second splicing result, upsampling the second splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

According to an embodiment of the disclosure, the decoding module further includes: and splicing the further extracted semantic features, then upsampling, and obtaining an Nth prediction result corresponding to the semantic features of the Nth scale by utilizing an Nth decoding layer based on the upsampled semantic features.

The embodiment of the disclosure provides a device for performing image segmentation by a deep neural network model, which comprises: a processor, and a memory storing computer-executable instructions that, when executed by the processor, cause the processor to perform the above-described method.

The disclosed embodiments provide a computer-readable recording medium storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, cause the processor to perform the above-described method.

The embodiment of the disclosure provides a method, a device, equipment and a medium for performing image segmentation by using a deep neural network model. According to the method for executing image segmentation by using the deep neural network model, provided by the invention, the multi-scale semantic features of the image can be extracted, and the prediction results corresponding to the multi-scale semantic features are integrated, so that the continuity of the whole segmented region of interest in the image is improved, and the accuracy of the segmentation results is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. It should be apparent that the drawings in the following description are only some exemplary embodiments of the present disclosure, and that other drawings may be obtained from these drawings by those of ordinary skill in the art without undue effort.

Fig. 1 is a flowchart illustrating an image segmentation method according to an embodiment of the present disclosure;

Fig. 2 is a schematic diagram illustrating an image segmentation method according to a first embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an image segmentation method according to a second embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating an image segmentation method according to a third embodiment of the present disclosure;

fig. 5 is a view showing an example of a divided image obtained using an image dividing method according to an embodiment of the present disclosure;

fig. 6 is a block diagram illustrating an image segmentation apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram illustrating an image segmentation apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

In the present specification and drawings, substantially the same or similar steps and elements are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.

In the present specification and drawings, elements are described in the singular or plural form according to an embodiment. However, the singular and plural forms are properly selected for the proposed case only for convenience of explanation and are not intended to limit the present disclosure thereto. Accordingly, the singular may include the plural and the plural may include the singular unless the context clearly indicates otherwise.

Currently, image segmentation may be performed using image semantic segmentation algorithms. The image semantic segmentation method mainly comprises a method based on manual semantic features and a method based on depth semantic features, wherein the method mainly comprises the steps of manually extracting semantic features such as textures, gray scales and edges in an image, and further classifying the image at a pixel level by using a threshold method, a segmentation method based on pixel clustering and a segmentation method based on image division. And the latter mainly uses a deep learning technology to automatically extract semantic features in the image, further uses deconvolution, up-sampling, activation and other modes to decode the deep semantic features, and finally classifies pixels to obtain a segmentation result.

However, in the prior art, the pixel-based segmentation method will lose the continuity of the whole object in the image, resulting in low accuracy of the segmentation result.

In order to solve the above-described problems, the present disclosure provides a method of performing image segmentation using a deep neural network model. According to the method for executing image segmentation by using the deep neural network model, provided by the invention, the multi-scale semantic features of the image can be extracted, and the prediction results corresponding to the multi-scale semantic features are integrated, so that the continuity of the whole segmented region of interest in the image is improved, and the accuracy of the segmentation results is improved.

The image segmentation method provided in the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating an image segmentation method according to an embodiment of the present disclosure.

Referring to fig. 1, in step S110, an input image including a region of interest is acquired.

Here, the region of interest may be any target region of interest to the user, for example, the region of interest may be a tumor region, but is not limited thereto. It should be noted that, according to the difference in the fields to which the image segmentation method of the present disclosure is applied, the input image and the region of interest may be changed accordingly. Furthermore, the present disclosure is not limited to the form of the input image, and for example, the input image may be either a gray-scale image or a color image. In addition, the present disclosure is not limited in any way of acquiring an input image, and for example, an input image including a region of interest may be captured in response to a user request, a pre-acquired input image may be acquired directly from an external device, or the like. The input image may be an image of various resolutions, for example, an image of 512×512 resolution.

After the input image is acquired, image segmentation may be performed using the deep neural network model. Here, the deep neural network model may include an encoder and a decoder. The encoder may include a plurality of encoding layers and the decoder may include a plurality of decoding layers.

The deep learning technology based on the convolutional neural network is difficult to be applied in practice due to the lack of interpretability. Existing methods for increasing interpretable approaches represent network-focused image areas mainly by means of visual feature thermodynamic diagrams, however, such methods still differ far from human cognition and are not accepted by the actual user. The present disclosure finds that the interpretability of the deep learning-based image segmentation method can be further improved by combining the deep learning with the cognitive psychology theory under the condition of improving the image segmentation accuracy. The feature integration theory focuses on early vision processing problems, and can be divided into two stages.

(1) Feature registration: feature registration may help the human body conduct a guided search of the surrounding environment. There are two more processes at this stage, one is the feature extraction process and the human feature encoding process. First, the vision system extracts features from the way the light is stimulated, a process that is parallel and automated. In this stage, only individual features can be detected, including color, size, orientation, contrast, slope, curvature and end points of the line segment. It may also include the distance difference between motion and distance, with these features being in a free state (not limited by the object to which they belong, its position being subjectively uncertain). The perception system then encodes the features of each dimension independently, the result of which is called a feature map. It may be noted that from the perspective of deep learning, the feature registration phase of feature integration theory is similar to the shallow convolution operation, with extraction of the underlying features.

(2) Feature integration stage: the perception system correctly links the different features (feature representations) together, thereby obtaining a representation that is descriptive of the object. At this stage, feature localization is first required, i.e. the boundaries of the feature space are determined. Second, the focused attention process is focusing more on the features of a particular location. Finally, the original features are integrated. Deep learning ultimately corresponds to the feature integration stage for the decoding process of the feature.

In cognitive psychology, psychologists find that the human visual cognitive process is a bottom-up process. While in the brain neuroscience it is found that the higher cortex also has information flowing to the lower cortex, the higher information is further used to direct the lower cortex to focus on specific areas. Common to feature integration theory, visual pathway model of brain neuroscience, cognitive psychology is that they summarize and demonstrate the effectiveness and reasonable existence of bottom-up and top-down processes. Inspired by the theory, the method and the device consider that the combination of the visual cognition process of the person and the deep learning technology can effectively solve the accurate segmentation of the region of interest.

Specifically, at step S120, semantic features of different multiple scales of the input image may be extracted using multiple encoding layers of an encoder, respectively.

For example, the plurality of encoding layers may include N encoding layers, wherein a size of the semantic features of the N-th scale extracted by the N-th encoding layer is smaller than a size of the semantic features of the N-1 th scale extracted by the N-1 th encoding layer, the N-th scale being a smallest scale among the plurality of scales, N being a positive integer greater than or equal to 2, N being less than or equal to N and greater than or equal to 2.

According to embodiments of the present disclosure, the coding layer may include a convolution layer, a pooling layer, a batch normalization layer, and an activation layer. Details of the convolution layer, the pooling layer, the batch normalization layer, and the activation layer are not described herein, as those skilled in the art will appreciate the convolution layer, the pooling layer, the batch normalization layer, and the activation layer.

As an example, N may be 4. In this case, as shown in fig. 2, semantic features of a first scale of the input image may be first extracted using a first encoding layer (denoted as "encoding layer 1" in fig. 2) of the plurality of encoding layers; secondly, extracting semantic features of a second scale of the input image using a second encoding layer (denoted as "encoding layer 2" in fig. 2) of the plurality of encoding layers based on the semantic features of the first scale; next, extracting semantic features of a third scale of the input image with a third encoding layer (denoted as "encoding layer 3" in fig. 2) of the plurality of encoding layers based on the semantic features of the second scale; finally, semantic features of a fourth scale of the input image are extracted using a fourth encoding layer (denoted as "encoding layer 4" in fig. 2) of the plurality of encoding layers based on the semantic features of the third scale.

As shown in fig. 2, for example, in the case where the input image is "512×512×1" (where 512×512 is the resolution (i.e., size) of the semantic features of the input image, 1 is the number of semantic feature channels of the input image), the semantic features of the first scale may be 256×256×64 semantic features, where 256×256 is the resolution (i.e., size) of the semantic features, and 64 is the number of semantic feature channels; the semantic features of the second scale may be 128 x 128 semantic features, wherein, 128 x 128 is the resolution of the semantic features, 128 is the number of semantic feature channels; the semantic features of the third scale may be semantic features of 64×64×256, where 64×64 is the resolution of the semantic features and 256 is the number of semantic feature channels; the fourth scale of semantic features may be 32 x 512 semantic features, where 32 x 32 is the resolution of the semantic features and 512 is the number of semantic feature channels. It follows that the size of the semantic features extracted from the top down of the coding layer becomes gradually smaller, so that semantic features of various scales of the input image are extracted.

According to an embodiment of the present disclosure, after the semantic features of the nth scale extracted by the nth encoding layer may further include: and inputting the semantic features of the Nth scale into a spatial pyramid structure for further semantic feature extraction.

As an example, when N is 4, referring to fig. 2, after the fourth-scale semantic feature extracted by the fourth coding layer, the fourth-scale semantic feature (32×32×512) and the input image (512×512×1) may be further input into the spatial pyramid structure, so as to perform further extraction of the semantic feature.

According to an embodiment of the present disclosure, the spatial pyramid structure may include a plurality of first convolution layers, second convolution layers, and pooling layers having different expansion rates. Specifically, a plurality of first convolution layers with different expansion rates may be used to perform hole convolution on the semantic features of the nth scale to further capture multi-scale information, respectively; the second convolution layer may be configured to further convolve the semantic features of the nth scale to enhance the coupling of the semantic feature channels; the pooling layer may be configured to pool the input image to obtain semantic features at an image level. The above spatial pyramid structure will be described in detail with reference to fig. 4, and will not be described in detail herein.

The above top-down extraction of multi-scale features process may simulate the visual-recognition process: the low-level cranial nerves extract low-level features and continuously distribute information to higher levels, which re-extract and re-combine the low-level features. This process resembles a commonly used Convolutional Neural Network (CNN), and thus, as an example, a residual network (Res-Net) may be employed to simulate this process.

The residual network may be composed of four residual blocks, each of which may in turn be composed of a convolution layer, a batch normalization layer, an activation function layer, a dropout layer, and a full connection layer. Those skilled in the art will know how to extract the semantic features of the image by using the residual network, and therefore details of extracting the semantic features by using the residual network will not be described in detail here.

Referring back to fig. 1, after the semantic features of different multiple scales of the input image are obtained, the semantic features of the multiple scales may be decoded using multiple decoding layers of a decoder, respectively, to obtain prediction results corresponding to the semantic features of each scale in step S130.

According to an embodiment of the present disclosure, the plurality of decoding layers may include N decoding layers. Specifically, the nth prediction result corresponding to the semantic feature of the nth scale may be obtained by decoding with the nth decoding layer corresponding to the nth encoding layer based on the semantic feature of the nth scale extracted by the nth encoding layer at step S130.

As an example, N may be 4. In this case, referring again to fig. 2, a first prediction result may be obtained by first decoding with a first decoding layer (denoted as "decoding layer 1" in fig. 2) corresponding to a first encoding layer based on semantic features of a first scale extracted by the first encoding layer; next, a second prediction result may be obtained by decoding with a second decoding layer (denoted as "decoding layer 2" in fig. 2) corresponding to the second encoding layer based on semantic features of a second scale extracted by the second encoding layer; next, a third prediction result may be obtained by decoding with a third decoding layer (denoted as "decoding layer 3" in fig. 2) corresponding to the third encoding layer based on semantic features of a third scale extracted by the third encoding layer; finally, a fourth prediction result may be obtained by decoding with a fourth decoding layer (denoted as "decoding layer 4" in fig. 2) corresponding to the fourth encoding layer based on the semantic features of the fourth scale extracted by the fourth encoding layer.

Specifically, the prediction result corresponding to the semantic features of each scale may be obtained by decoding the semantic features of the plurality of scales in the following manner: fusing the semantic features of the nth scale with the semantic features of the nth-1 scale, and decoding by using an nth-1 decoding layer based on the fused semantic features to obtain an nth-1 prediction result. For example, the n-1 th prediction result may be obtained by upsampling the n-th scale semantic feature, splicing the upsampled result with the n-1 th scale semantic feature, upsampling the spliced result, and decoding with the n-1 th decoding layer based on the upsampled semantic feature.

As an example, the splicing may be splicing along a channel.

As an example, low-resolution semantic features (i.e., small-scale semantic features) may be upsampled to high-resolution semantic features (i.e., large-scale semantic features) and stitched with the same-scale semantic features using a linear interpolation (e.g., bilinear interpolation). The above fusion process may be as follows:

X＝f(X ₁ )+Bilinear(f(X ₂ ))

where f (·) is the convolution operation performed by the coding layer, in this context the convolution kernel may be, for example, 256, X ₁ ,X ₂ For the coding layer, bilinear (·) is a Bilinear interpolation method, where f (X) ₁ ) For coding layer X ₁ Semantic features extracted by performing convolution operations, f (X ₂ ) For coding layer X ₂ Semantic features extracted by the convolution operation are performed.

In the above fusion mode, the small-scale semantic features and the large-scale semantic features are fused in sequence according to the size of the semantic feature scale, and the fused semantic features are utilized to obtain the corresponding prediction result. For example, N may be 4, in which case, referring to fig. 2, for example, first, the semantic features of the fourth scale may be upsampled and decoded using a fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result (denoted as "512 x 2" in fig. 2);

next, the fourth-scale semantic features may be upsampled, the upsampled result may be spliced with the third-scale semantic features to obtain a first spliced result (denoted as "64×64×768" in fig. 2), the first spliced result may be upsampled, and decoded using a third decoding layer based on the upsampled semantic features to obtain a third prediction result (denoted as "512×512×2" in fig. 2);

The first stitched result may then be upsampled, the upsampled result may be stitched with semantic features of a second scale to obtain a second stitched result (denoted as "128 x 896" in fig. 2), the second stitched result may be upsampled, and decoded using a second decoding layer based on the upsampled semantic features to obtain a second predicted result (denoted as "512 x 2" in fig. 2);

finally, the second stitching result may be upsampled, the upsampled result may be stitched with the semantic features of the first scale to obtain a third stitching result (denoted as "256 x 960" in fig. 2), the third stitching result may be upsampled, and decoded using the first decoding layer based on the upsampled semantic features to obtain a first prediction result (denoted as "512 x 2" in fig. 2).

Alternatively, according to another embodiment of the present disclosure, the prediction result corresponding to the semantic features of each scale may also be obtained by decoding the semantic features of the multiple scales by: fusing the semantic features of the m-th scale with the semantic features of one of the 1 st to m-2 th scales and decoding the semantic features based on the fused semantic features by using a decoding layer corresponding to the one scale to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2.

For example, the semantic features of the m-th scale may be up-sampled first, then the up-sampled result may be spliced with the semantic features of the one scale, then the spliced result may be up-sampled, and the prediction result corresponding to the one scale may be obtained by decoding with a decoding layer corresponding to the one scale based on the up-sampled semantic features.

In the above alternative manner, the small-scale semantic features and the large-scale semantic features may not be fused in sequence according to the size of the semantic feature scale, but may be connected in a jumping manner, so that the time complexity is reduced, the computing resources are saved, and further the prediction result is facilitated to be obtained quickly.

Specifically, for example, in the case where N may be 4, referring to fig. 3, for example, in the case where the input image is "512×512×1" (where 512×512 is the resolution (i.e., size) of the semantic features of the input image, and 1 is the number of semantic feature channels of the input image), first, the semantic features of the fourth scale may be upsampled and decoded using the fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result (denoted as "512×512×2" in fig. 3);

Secondly, the fourth scale semantic features may be upsampled, the upsampled result is spliced with the second scale semantic features to obtain a first spliced result (denoted as "128 x 640" in fig. 3), the first spliced result is upsampled, and decoding is performed using a second decoding layer based on the upsampled semantic features to obtain a second prediction result (denoted as "512 x 2" in fig. 3);

then upsampling the third scale semantic features and decoding with a third decoding layer based on the upsampled semantic features to obtain a third prediction result (denoted as "512 x 2" in fig. 3);

finally, the first concatenation result is upsampled, the upsampled result is spliced with the semantic features of the first scale to obtain a second concatenation result (denoted as "256 x 704" in fig. 3), the second concatenation result is upsampled, and the first prediction result (denoted as "512 x 2" in fig. 3) is decoded using the first decoding layer based on the upsampled semantic features.

In both of the above two ways of obtaining the prediction result, the semantic features of the nth scale (i.e., the semantic features of the smallest scale) are directly up-sampled, and the nth prediction result (i.e., the prediction result of the highest layer) is obtained by decoding with the nth decoding layer based on the up-sampled semantic features, however, the present disclosure is not limited thereto.

Alternatively, as described above, the semantic features of the nth scale and the input image may be input into the spatial pyramid structure to perform further semantic feature extraction, then the further extracted semantic features are spliced and then upsampled, and the nth prediction result corresponding to the semantic features of the nth scale is obtained by using the nth decoding layer based on the upsampled semantic features.

For example, as shown in fig. 4, for example, in the case where the input image is "512×512×1" (where 512×512 is the resolution (i.e., size) of the semantic features of the input image, 1 is the number of semantic feature channels of the input image), the semantic features of the fourth scale (denoted as "32×32×512" in fig. 4) extracted by the fourth coding layer (denoted as "coding layer 4" in fig. 4) may be input into the spatial pyramid structure together with the input image. In the spatial pyramid structure, the input image may be pooled, e.g., average pooled, weighted average pooled, etc., using a pooling layer to obtain image-level semantic features (denoted as "32 x 1" in fig. 4). In addition, fourth-scale semantic features may be convolved with a second convolution layer (e.g., a 1x 1 convolution) to enhance the coupling of the semantic feature channels, the resulting image characteristics are represented in fig. 4 as "32 x 512", and the fourth-scale semantic features may be respectively convolved with three first convolution layers (e.g., a 3x 3 convolution) at different expansion rates to further capture multi-scale information. Here, as an example, the expansion rates may be 6, 12, and 18, respectively. In fig. 4, the semantic features of the image obtained after the processing of each first convolution layer are "32×32×512", and then the semantic features of the image obtained after the processing of the three first convolution layers, the second convolution layers and the pooling layers are spliced and reduced in dimension to obtain the semantic features "32×32×512" shown in fig. 4.

Referring back to fig. 1, after obtaining the prediction results corresponding to the semantic features of each scale, the prediction results corresponding to the semantic features of each scale may be integrated to obtain a final prediction result as to whether each pixel in the input image belongs to the region of interest in step S140.

Specifically, according to an exemplary embodiment, in step S140, the prediction results of the plurality of decoding layers may be integrated from bottom to top according to the scale of the semantic feature corresponding to the prediction results, and the prediction results of the lower layer and the prediction results of the adjacent upper layer may be sequentially integrated to obtain the final prediction result.

For example, in the case where n is greater than 2, the nth predicted result may be integrated with the nth-1 predicted result, and the integrated result may be integrated with the nth-2 predicted result; in the case where n is equal to 2, the nth predictor may be integrated with the n-1 th predictor.

As shown in fig. 2 or 3 or 4, in case N may be 4, the above integration process may include: integrating the fourth predicted result with the third predicted result, further integrating the integrated result with the second predicted result, integrating the further integrated result with the first predicted result, and taking the integrated result as a final predicted result.

As described above, according to the top-down path, semantic features of various scales extracted through each encoding layer already have strong semantic features, so that the extracted semantic features can be used for prediction independently, and the functions of prediction of the semantic features of each scale are not the same: the large-scale semantic features can give a large-scale segmentation result because the receptive field is large and covers the whole original image, but the result is rough because the resolution is small, more spatial semantic features are lost, and the edge segmentation result is not ideal; on the contrary, the semantic feature receptive field of a small scale is smaller, a better segmentation result can be obtained, but the continuity of the segmentation result is not as strong as that of a high-level result. Therefore, independent prediction results based on semantic features of various scales can be complemented, a plurality of prediction results corresponding to the semantic features of various scales are comprehensively utilized by using semantic feature integration operation, and finer segmentation results can be obtained.

The core idea of the semantic feature integration operation is as follows: starting from the prediction result corresponding to the semantic feature of the minimum scale, integrating the prediction result corresponding to the semantic feature of the larger scale sequentially, and continuously integrating the prediction result corresponding to the semantic feature of the larger scale to correct the prediction result, wherein the prediction result corresponding to the semantic feature of the minimum scale is a coarse-granularity prediction result, and the prediction result corresponding to the semantic feature of the larger scale is a fine-granularity prediction result finer than the coarse-granularity prediction result.

According to an exemplary embodiment, the above integration process may be: and comparing the two objects to be integrated at the pixel level, and searching the maximum predicted value corresponding to each pixel to serve as a final predicted result. For example, the two objects to be integrated may be compared at the pixel level using a non-maximum suppression method to find the maximum predicted value corresponding to each pixel. Non-maximum suppression, i.e., suppressing elements of non-maxima, can be understood as local maximum searching.

For example, in the examples of fig. 2, 3 or 4, four independent prediction results of different scales may be simultaneously fed into a layer of the decoder (e.g., a Look Inside operation layer) performing an integration operation, and then in this layer, the obtained nth prediction result is compared with the n-1 th prediction result at a pixel level using non-maximum suppression to obtain a local optimum value, and the above steps are repeated until a first prediction result is obtained, which is a final prediction result.

Referring back to fig. 1, after the final prediction result is obtained, an output image that partitions the region of interest from the input image according to the final prediction result may be output at step S150.

As an example, the decoder may further include an output layer, and the final output image may be obtained after the final prediction result passes through the output layer. As shown in fig. 2, 3 and 4, the region of interest in the input image can be effectively segmented by using the image segmentation method described above, wherein the dark region in the output image is the region of interest.

The image segmentation method according to the exemplary embodiments of the present disclosure may be applied to various fields related to image processing. For example, in the medical image processing field, the above-described image segmentation method of the present disclosure may be utilized to segment tumor regions in immunohistochemical membrane-stained slice images. Fig. 5 is a view showing an example of a divided image obtained using the image dividing method according to the embodiment of the present disclosure. As shown in FIG. 5, the tumor regions in immunohistochemical membrane stained sections can be effectively segmented, wherein dark regions represent tumor regions and light regions represent non-tumor regions.

As described above, the image segmentation method according to various exemplary embodiments of the present disclosure performs image segmentation using a deep neural network model, which requires training to be performed in advance before being used for image segmentation.

As an example, the deep neural network model may be trained using a deep supervised approach, and further, during the deep neural network model training process, a counter-propagating gradient may be derived by calculating a predictive loss for each of the plurality of decoding layers of the decoder to update parameters for each encoding layer in the encoder.

The deep supervision refers to a method for supervising the backbone network by adding auxiliary classifiers to perform independent prediction on some middle layers of the deep neural network. The deep neural network is easy to have gradient elimination and too slow convergence speed along with the deepening of the network in the training process, and the deep supervision method effectively accelerates the parameter update of each coding layer in the encoder in the back propagation process of the network, so that more useful semantic and spatial characteristics can be extracted. Independent prediction results are obtained by independently predicting the extracted semantic features of multiple scales, and a data tool of deep supervision is used for carrying out loss calculation on each prediction result and obtaining a counter-propagating gradient, so that the parameters of each coding layer can be updated rapidly.

As an example, when the number of coding layers and the number of decoding layers (which may also be referred to as prediction layers) are four, the above-mentioned deep supervision process may be that four semantic features of different scales obtained after spatial pyramid structures after coding the coding layer 1, the coding layer 2, the coding layer 3 and the coding layer 4 are coded, four prediction results of 512 x 2 size are obtained by convolution and upsampling and combining with softmax activation, then loss calculation is performed on the four prediction results obtained by the four decoding layers respectively by using cross entropy loss, and the loss of the four decoding layers and the loss of the final output layer are added to obtain a final total loss value L, so as to perform gradient calculation and back propagation. The formula is as follows:

In the above formula, B represents the number of batch images, B represents the B-th image, N represents the number of deeply supervised prediction layers, i represents the i-th prediction layer,representing the ith predictive probability map obtained for the b-th image,/th predictive probability map obtained for the b-th image>Predictive probability map representing final output layer output, Y _b And (5) representing the manual annotation of the b-th image.

In the training process, the model carries out parameter updating through continuous iteration. And performing performance verification on the model and parameters of each generation through the divided verification set, and selecting a proper evaluation index to evaluate the performance of the model. For example, when the region of interest is a tumor region, the evaluation index may include a tumor region segmentation-blending ratio and an average pixel classification accuracy, where a closer to 1 the tumor region segmentation-blending ratio and the average pixel classification accuracy indicates that the current model and the parameter perform better, and finally, the model with the highest tumor region segmentation-blending ratio and the highest average pixel classification accuracy is used as the optimal model.

The image segmentation method according to the embodiment of the present disclosure has been described above with reference to fig. 1 to 5. According to the image segmentation method, the multi-scale semantic features of the image can be extracted, prediction results corresponding to the multi-scale semantic features are integrated, the continuity of the whole region of interest segmented in the image is improved, and therefore the accuracy of the image segmentation results is improved. In addition, according to the embodiment of the disclosure, by combining the human visual cognition process with the deep learning technology, not only is the accurate segmentation of the region of interest effectively solved, but also the interpretability of the image segmentation method based on the deep learning is further improved.

The present disclosure provides a corresponding image segmentation apparatus and device in addition to the above image segmentation method, which will be described with reference to fig. 6 and 7.

Fig. 6 is a block diagram illustrating an image segmentation apparatus 600 according to an embodiment of the disclosure.

Referring to fig. 6, the illustrated image segmentation apparatus may include: the system comprises an acquisition module 610, an extraction module 620, a decoding module 630, an integration module 640 and an output module 650.

The acquisition module 610 may be configured to acquire an input image containing a region of interest.

Here, the region of interest may be any target region of interest to the user, for example, the region of interest may be a tumor region, but is not limited thereto. It should be noted that, according to the difference in the fields to which the image segmentation method of the present disclosure is applied, the input image and the region of interest may be changed accordingly. Furthermore, the present disclosure is not limited to the form of the input image, and for example, the input image may be either a gray-scale image or a color image. In addition, the present disclosure is not limited in any way of acquiring an input image, and for example, an input image including a region of interest may be captured in response to a user request, a pre-acquired input image may be acquired directly from an external device, or the like. In addition, the input image may be an image of various resolutions.

The extraction module 620 may be configured to extract semantic features of different multiple scales of the input image using multiple encoding layers of an encoder, respectively.

According to an embodiment of the present disclosure, the plurality of encoding layers may include N encoding layers, wherein a size of the semantic feature of the N-th scale extracted by the N-th encoding layer is smaller than a size of the semantic feature of the N-1-th scale extracted by the N-1-th encoding layer, the N-th scale is a smallest scale among the plurality of scales, N is a positive integer greater than or equal to 2, and N is less than or equal to N and greater than or equal to 2.

As an example, N may be 4. In this case, as shown in fig. 2 above, semantic features of the first scale of the input image may be first extracted using a first encoding layer (denoted as "encoding layer 1" in fig. 2) of the plurality of encoding layers; secondly, extracting semantic features of a second scale of the input image using a second encoding layer (denoted as "encoding layer 2" in fig. 2) of the plurality of encoding layers based on the semantic features of the first scale; next, extracting semantic features of a third scale of the input image with a third encoding layer (denoted as "encoding layer 3" in fig. 2) of the plurality of encoding layers based on the semantic features of the second scale; finally, semantic features of a fourth scale of the input image are extracted using a fourth encoding layer (denoted as "encoding layer 4" in fig. 2) of the plurality of encoding layers based on the semantic features of the third scale.

According to an embodiment of the present disclosure, the method may further include, after the nth scale semantic features are extracted by the nth encoding layer (i.e., encoding layer 4): and inputting the semantic features of the Nth scale into a spatial pyramid structure for further semantic feature extraction.

According to an embodiment of the present disclosure, the spatial pyramid structure may include a plurality of first convolution layers, second convolution layers, and pooling layers having different expansion rates. Specifically, a plurality of first convolution layers with different expansion rates may be used to perform hole convolution on the semantic features of the nth scale to further capture multi-scale information, respectively; the second convolution layer may be configured to further convolve the semantic features of the nth scale to enhance the coupling of the semantic feature channels; the pooling layer may be configured to pool the input image to obtain semantic features at an image level.

The decoding module 630 may be configured to decode the semantic features of the multiple scales with multiple decoding layers of the decoder to obtain prediction results corresponding to the semantic features of each scale, respectively.

According to an embodiment of the present disclosure, the plurality of decoding layers may include N decoding layers. Specifically, the nth prediction result corresponding to the semantic feature of the nth scale may be obtained by decoding with the nth decoding layer corresponding to the nth encoding layer based on the semantic feature of the nth scale extracted by the nth encoding layer in the decoding module 630.

Specifically, the prediction result corresponding to the semantic features of each scale may be obtained by decoding the semantic features of the plurality of scales in the following manner: fusing the semantic features of the nth scale with the semantic features of the nth-1 scale, and decoding by using an nth-1 decoding layer based on the fused semantic features to obtain an nth-1 prediction result. For example, the semantic features of the nth scale may be first upsampled, then the upsampled result may be spliced with the semantic features of the nth-1 scale, then the spliced result may be upsampled, and decoded using the nth-1 decoding layer based on the upsampled semantic features to obtain the nth-1 prediction result, where the splicing may be splicing along the channel.

The integration module 640 may be configured to integrate the predictions corresponding to the semantic features of each scale to obtain a final prediction as to whether individual pixels in the input image belong to the region of interest.

According to an exemplary embodiment, in the integration module 640, the prediction results of the plurality of decoding layers may be integrated from bottom to top according to the scale of the semantic feature corresponding to the prediction results, and the prediction results of the lower layer and the prediction results of the adjacent upper layer may be sequentially integrated to obtain the final prediction result.

As shown in fig. 2 or 3 or 4 above, in the case where N may be 4, the above integration process may include: integrating the fourth predicted result with the third predicted result, further integrating the integrated result with the second predicted result, integrating the further integrated result with the first predicted result, and taking the integrated result as a final predicted result.

The output module 650 may be configured to output an output image in which the region of interest is segmented from the input image according to the final prediction result.

Since details of the above operations are described in the course of describing the image segmentation method according to the present disclosure, details thereof will not be repeated herein for brevity, and reference may be made to the above description with respect to fig. 1 to 5.

Fig. 7 is a block diagram illustrating an image segmentation apparatus 700 according to an embodiment of the present disclosure.

Referring to fig. 7, an image segmentation apparatus 700 may include a processor 701 and a memory 702. The processor 701 and the memory 702 may be connected by a bus 703.

The processor 701 may perform various actions and processes in accordance with programs stored in the memory 702. In particular, the processor 701 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.

The memory 702 stores computer instructions that, when executed by the processor 701, implement the method of image segmentation described above. The memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The present disclosure also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, can implement the above-described method. Similarly, the computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. It should be noted that the computer-readable storage media described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The exemplary embodiments of the invention described in detail above are illustrative only and are not limiting. It will be appreciated by those skilled in the art that various modifications and combinations of the embodiments or features thereof can be made without departing from the principles and spirit of the invention, and such modifications are intended to be within the scope of the invention.

Claims

1. A method of performing image segmentation using a deep neural network model, wherein the deep neural network model includes an encoder and a decoder, the method comprising:

Acquiring an input image containing a region of interest;

extracting semantic features of different multiple scales of the input image by using a plurality of coding layers of an encoder respectively;

decoding the semantic features of the multiple scales by utilizing a plurality of decoding layers of a decoder to obtain prediction results corresponding to the semantic features of each scale;

integrating the prediction results corresponding to the semantic features of each scale to obtain a final prediction result as to whether each pixel in the input image belongs to the region of interest; and

outputting an output image in which the region of interest is segmented from the input image according to the final prediction result,

wherein the plurality of encoding layers includes N encoding layers, wherein the size of the semantic features of the N-th scale extracted by the N-th encoding layer is smaller than the size of the semantic features of the N-1 th scale extracted by the N-1-th encoding layer, the N-th scale is the smallest scale among the plurality of scales, N is a positive integer greater than or equal to 2, N is less than or equal to N and greater than or equal to 2,

wherein decoding the semantic features of the multiple scales with multiple decoding layers of the decoder to obtain prediction results corresponding to the semantic features of each scale includes:

Fusing the semantic features of the m-th scale with the semantic features of one of the 1 st to m-2 th scales and decoding by using a decoding layer corresponding to the one scale based on the fused semantic features to obtain a prediction result corresponding to the one scale, wherein m is less than or equal to N and greater than 2;

the method for obtaining the prediction result corresponding to the one scale by fusing the semantic features of the m-th scale and the semantic features of the one scale from 1 st to m-2 nd scales and decoding the semantic features based on the fused semantic features by using a decoding layer corresponding to the one scale comprises the following steps:

upsampling the semantic features of the m-th scale, splicing the upsampled result with the semantic features of the one scale, upsampling the spliced result, and decoding by using a decoding layer corresponding to the one scale based on the upsampled semantic features to obtain a prediction result corresponding to the one scale;

wherein, the method further comprises the following steps of extracting the semantic features of the Nth scale by the Nth coding layer:

inputting the semantic features of the Nth scale into a spatial pyramid structure for further semantic feature extraction, wherein the spatial pyramid structure comprises:

The first convolution layers are provided with different expansion rates and are used for carrying out cavity convolution on the semantic features of the Nth scale respectively so as to further capture multi-scale information;

the second convolution layer is used for further convolving the semantic features of the Nth scale to enhance the coupling of the semantic feature channels;

and the pooling layer is used for pooling the input image to obtain semantic features of the image level.

2. The method of claim 1, wherein the encoding layer comprises a convolutional layer, a pooling layer, a batch normalization layer, and an activation layer.

3. The method of claim 1, wherein N is 4, the extracting semantic features of different multiple scales of the input image with multiple encoding layers of an encoder, respectively, comprises:

extracting semantic features of a first scale of the input image using a first encoding layer of the plurality of encoding layers;

extracting semantic features of a second scale of the input image using a second encoding layer of the plurality of encoding layers based on the semantic features of the first scale;

extracting semantic features of a third scale of the input image with a third encoding layer of the plurality of encoding layers based on the semantic features of the second scale;

And extracting semantic features of a fourth scale of the input image by using a fourth coding layer in the plurality of coding layers based on the semantic features of the third scale.

4. The method of claim 1, wherein the plurality of decoding layers comprises N decoding layers, wherein the decoding the semantic features of the multiple scales with the plurality of decoding layers of the decoder, respectively, comprises:

based on the semantic features of the nth scale extracted by the nth encoding layer, decoding is performed by an nth decoding layer corresponding to the nth encoding layer to obtain an nth prediction result corresponding to the semantic features of the nth scale.

5. The method of claim 4, wherein N is 4, the decoding with an nth decoding layer corresponding to the nth encoding layer based on the semantic features of the nth scale extracted by the nth encoding layer to obtain the nth prediction result, comprising:

decoding with a first decoding layer corresponding to the first encoding layer based on semantic features of a first scale extracted by the first encoding layer to obtain a first prediction result;

decoding with a second decoding layer corresponding to the second encoding layer based on semantic features of a second scale extracted by the second encoding layer to obtain a second prediction result;

Decoding with a third decoding layer corresponding to the third encoding layer based on semantic features of a third scale extracted by the third encoding layer to obtain a third prediction result;

based on the semantic features of the fourth scale extracted by the fourth coding layer, decoding is performed by a fourth decoding layer corresponding to the fourth coding layer to obtain a fourth prediction result.

6. The method of claim 1, wherein N is 4, the decoding the semantic features of the multiple scales with multiple decoding layers of a decoder to obtain a prediction result corresponding to the semantic features of each scale, comprising:

upsampling the fourth scale semantic features and decoding with a fourth decoding layer based on the upsampled semantic features to obtain a fourth prediction result;

upsampling the semantic features of the fourth scale, splicing the upsampled result with the semantic features of the second scale to obtain a first spliced result, upsampling the first spliced result, and decoding by using a second decoding layer based on the upsampled semantic features to obtain a second prediction result;

upsampling the semantic features of the third scale and decoding with a third decoding layer based on the upsampled semantic features to obtain a third prediction result;

And upsampling the first splicing result, splicing the upsampled result with the semantic features of the first scale to obtain a second splicing result, upsampling the second splicing result, and decoding by using a first decoding layer based on the upsampled semantic features to obtain a first prediction result.

7. The method of claim 1, wherein the decoding the semantic features of the multiple scales with multiple decoding layers of a decoder to obtain a prediction result corresponding to the semantic features of each scale further comprises:

and splicing the further extracted semantic features, then upsampling, and obtaining an Nth prediction result corresponding to the semantic features of the Nth scale by utilizing an Nth decoding layer based on the upsampled semantic features.

8. The method of claim 4, wherein the integrating the predictions corresponding to semantic features of each scale comprises:

and integrating the prediction results of the plurality of decoding layers from bottom to top according to the scale of the semantic features corresponding to the prediction results, and sequentially integrating the prediction results of the lower layer with the prediction results of the adjacent upper layer to obtain a final prediction result.

9. The method of claim 8, wherein integrating the prediction results of the plurality of decoding layers from bottom to top according to the scale of the semantic feature corresponding to the prediction results, and sequentially integrating the prediction results of the lower layer with the prediction results of the adjacent upper layer to obtain the final prediction results, comprises:

Under the condition that n is larger than 2, integrating the n-th predicted result with the n-1-th predicted result, and integrating the integrated result with the n-2-th predicted result;

in the case where n is equal to 2, the n-th predictor is integrated with the n-1-th predictor.

10. The method of claim 9, wherein N is 4, said integrating the predictions corresponding to semantic features of each scale comprising:

integrating the fourth predicted result with the third predicted result, further integrating the integrated result with the second predicted result, integrating the further integrated result with the first predicted result, and taking the integrated result as a final predicted result.

11. The method of claim 9, wherein the integrating includes comparing two objects to be integrated at a pixel level, looking for a maximum prediction value corresponding to each pixel as a final prediction result.

12. The method of claim 1, wherein the deep neural network model is trained using a deep supervised approach,

wherein the counter-propagating gradient is derived by calculating a predictive loss for each of the plurality of decoding layers of the decoder during the deep neural network model training process to update parameters for each encoding layer in the encoder.

13. An apparatus for performing image segmentation using a deep neural network model, wherein the deep neural network model includes an encoder and a decoder, the apparatus comprising:

an acquisition module configured to acquire an input image containing a region of interest;

an extraction module configured to extract semantic features of different multiple scales of the input image, respectively, using multiple encoding layers of an encoder;

the decoding module is configured to respectively decode the semantic features of the multiple scales by utilizing a plurality of decoding layers of the decoder to obtain a prediction result corresponding to the semantic features of each scale;

an integration module configured to integrate prediction results corresponding to semantic features of each scale to obtain a final prediction result as to whether individual pixels in the input image belong to the region of interest; and

an output module configured to output an output image in which the region of interest is segmented from an input image according to the final prediction result,

Wherein the decoding module comprises:

14. The apparatus of claim 13, wherein the encoding layer comprises a convolutional layer, a pooling layer, a batch normalization layer, and an activation layer.

15. The apparatus of claim 13, wherein N is 4, the extraction module comprising:

16. The apparatus of claim 13, wherein the plurality of decoding layers comprises N decoding layers, wherein the decoding the semantic features of the multiple scales with the plurality of decoding layers of a decoder, respectively, comprises:

17. The apparatus of claim 16, wherein N is 4, the decoding with an nth decoding layer corresponding to the nth encoding layer based on the semantic features of the nth scale extracted by the nth encoding layer to obtain the nth prediction result comprises:

18. The apparatus of claim 13, wherein N is 4, the decoding module comprising:

19. The apparatus of claim 13, wherein the decoding module further comprises:

20. The apparatus of claim 16, wherein the integrating the predictions corresponding to semantic features of each scale comprises:

21. The apparatus of claim 20, wherein integrating the prediction results of the plurality of decoding layers from bottom to top according to the scale of the semantic feature corresponding to the prediction results, sequentially integrating the prediction results of the lower layer with the prediction results of the adjacent upper layer to obtain the final prediction results, comprises:

22. The apparatus of claim 21, wherein N is 4, the integrating the predictions corresponding to semantic features for each scale comprising:

23. The apparatus of claim 21, wherein the integrating comprises comparing two objects to be integrated at a pixel level, looking for a maximum prediction value corresponding to each pixel as a final prediction result.

24. The apparatus of claim 13, wherein the deep neural network model is trained using a deep supervised approach,

25. An apparatus for performing image segmentation using a deep neural network model, comprising:

A processor, and

a memory storing computer-executable instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-12.

26. A computer readable recording medium storing computer executable instructions, wherein the computer executable instructions when executed by a processor cause the processor to perform the method of any one of claims 1-12.