WO2020093042A1

WO2020093042A1 - Neural networks for biomedical image analysis

Info

Publication number: WO2020093042A1
Application number: PCT/US2019/059653
Authority: WO
Inventors: Md Zahangir ALOM
Original assignee: Deep Lens, Inc.
Priority date: 2018-11-02
Filing date: 2019-11-04
Publication date: 2020-05-07

Abstract

Aspects of the subject matter disclosed herein include techniques for processing a data map with a neural network subsystem (NNS). The NNS can include a plurality of encoder units and a plurality of decoder units, each decoder unit corresponding to a different encoder unit. Processing the data map can include processing successive representations of the data map with the plurality of encoder units to generate a set of feature maps for the data map, each feature map having a lower dimensionality than the data map, each encoder unit comprising a recurrent convolutional block or a recurrent-residual convolutional unit; and upsampling the set of feature maps with the plurality of decoder units to generate a final feature map for the data map that has a higher dimensionality than feature maps in the set of feature maps.

Description

NEURAL NETWORKS FOR BIOMEDICAL IMAGE ANALYSIS

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of priority to U.S. Provisional Application

No. 62/755,097, filed on November 2, 2018, the disclosure of which is incorporated by reference in its entirety.

BACKGROUND

[0002] Deep Learning (DL) approaches have been applied in the field of bio-medical imaging including Digital Pathology Image Analysis (DPIA) for purposes such as classification, segmentation, and detection tasks.

SUMMARY

[0003] The present disclosure describes techniques for using artificial neural networks and other machine-learning models for image processing tasks, including digital pathology image analysis.

[0004] Aspects of the subj ect matter described in this specification can be implemented as a computer-based method; a non-transitory, computer-readable medium storing computer- readable instructions to perform the computer-implemented method; and a computer- implemented system comprising a computer memory interoperably coupled with a hardware processor configured to perform the computer-implemented method/the instructions stored on the non-transitory, computer-readable medium.

[0005] The subject matter described in this specification can be implemented in certain examples so as to realize one or more of the following advantages. First, a residual unit can help when training deep architectures. Second, feature accumulation with recurrent residual convolutional layers can assure better feature representation for segmentation tasks. Third, feature accumulation can facilitate designing better U-Net architectures with the same or fewer number of network parameters and with better performance for medical image segmentation.

[0006] Some implementations of the subject matter disclosed herein include a system having one or more data processing apparatuses and an image segmentation neural network subsystem implemented on the data processing apparatus(es). The image segmentation neural network subsystem can include a plurality of encoding units arranged in succession so that each encoding unit after a first encoding unit is configured to process an input set of feature maps from a preceding encoding unit to generate an output set of feature maps having a lower dimensionality than the input set of feature maps, wherein the first encoding unit is configured to process a neural network input representing a data map to generate a first output feature map, and each encoding unit comprises a recurrent convolutional block or a recurrent-residual convolutional unit; and a plurality of decoding units arranged in succession so that each decoding unit after a first decoding unit is configured to process a first input set of feature maps from a preceding decoding unit and a second input set of feature maps from a corresponding encoding unit to generate an output set of feature maps having a higher dimensionality than the input set of feature maps, wherein the first decoding unit is configured to process as input the output set of feature maps from a last of the encoding units in the succession of encoding units, and a last of the decoding units in the succession of decoding units is configured to generate a final feature map for the data map.

[0007] These and other implementations can further include one or more of the following features.

[0008] Each encoding unit of the plurality of encoding units can include a recurrent convolutional block.

[0009] The recurrent convolutional block can include a plurality of forward recurrent convolutional layers.

[0010] Each encoding unit of the plurality of encoding units can include a recurrent- residual convolutional unit.

[0011] The recurrent-residual convolutional unit can include a plurality of recurrent convolution layers having residual connectivity.

[0012] The data map can be or include an input image.

[0013] The final feature map can be or include a segmentation map for the data map.

[0014] The system can further include a segmentation engine on the one or more data processing apparatuses, the segmentation engine configured to segment the data map using the segmentation map.

[0015] The final feature map can be or include a density heap map for the data map.

[0016] The data map can be an input image that depicts a slide of cells, and the neural network subsystem can be configured for use in a nuclei segmentation task to identify nuclei in the slide of cells.

[0017] The data map can be an input image that depicts a slide of cells, and the neural network subsystem can be configured for use in an epithelium segmentation task to identify epithelium in the slide of cells.

[0018] The data map can be an input image that depicts a slide of cells, and the neural network subsystem can be configured for use in a tubule segmentation task to identify tubules in the slide of cells.

[0019] Some implementations of the subject matter disclosed herein include methods for processing a data map with a neural network subsystem having a plurality of encoder units and a plurality of decoder units, each decoder unit corresponding to a different encoder unit. The method can include actions of processing successive representations of the data map with the plurality of encoder units to generate a set of feature maps for the data map, each feature map having a lower dimensionality than the data map, each encoder unit comprising a recurrent convolutional block or a recurrent-residual convolutional unit; and upsampling the set of feature maps with the plurality of decoder units to generate a final feature map for the data map that has a higher dimensionality than feature maps in the set of feature maps.

[0020] These and other implementations can further include one or more of the following features.

[0021] Each encoding unit of the plurality of encoding units can include a recurrent convolutional block.

[0022] The recurrent convolutional block can include a plurality of forward recurrent convolutional layers.

[0023] Each encoding unit of the plurality of encoding units can include a recurrent- residual convolutional unit.

[0024] The recurrent-residual convolutional unit can include a plurality of recurrent convolution layers having residual connectivity.

[0025] The data map can be or include a medical image.

[0026] Some implementations of the subject matter disclosed herein include a system having one or more data processing apparatuses and an image segmentation neural network subsystem implemented on the data processing apparatus(es). The image segmentation neural network subsystem can include one or more first convolutional layers, one or more inception recurrent residual convolutional neural network (IRRCNN) blocks, and one or more transition blocks.

[0027] These and other implementations can further include one or more of the following features.

[0028] Each IRRCNN block can include an inception unit and a residual unit, the inception unit including recurrent convolutional layers that are merged by concatenation, the residual unit configured to sum input features to the IRRCNN block with an output of the inception unit.

[0029] The neural network subsystem can be configured to process a data map to perform a classification task based on the data map.

[0030] The neural network subsystem can further include a softmax layer.

[0031] Some implementations of the subject matter disclosed herein include methods that include actions of obtaining a neural network input, the neural network input representing a data map; processing the neural network input with a neural network system to generate a classification for one or more items shown in the data map, the neural network system including one or more first convolutional layers, one or more inception recurrent residual convolutional (IRRCNN) blocks, and one or more transition blocks; and providing the classification for storage, processing, or presentation.

[0032] The details of one or more implementations of the subject matter of this specification are set forth in the Detailed Description, the Claims, and the accompanying drawings. Other features, aspects, and advantages of the subject matter will become apparent from the Detailed Description, the Claims, and the accompanying drawings.

DESCRIPTION OF DRAWINGS

[0033] FIG. 1 is a pictorial representation of an example of a densely connected recurrent convolutional (DCRC) block.

[0034] FIG. 2 is a diagram showing examples of unfolded recurrent convolutional units for t = 2.

[0035] FIGS. 3A-3C are images showing examples of three different types of cancer cells, including chronic lymphocytic leukemia (CLL) cells, follicular lymphoma (FL) cells, and mantle cell lymphoma (MCL) cells respectively.

[0036] FIGS. 4A-4C are images showing examples of non-overlapping patches from original samples. [0037] FIG. 5 is a graph showing example values for training and validation accuracy for lymphoma classification for 40 epochs.

[0038] FIG. 6A is a graph showing examples of area under receiver operating characteristics (ROC) curve values for an image-based method.

[0039] FIG. 6B is a graph showing examples of area under ROC curve values for a patch-based method.

[0040] FIG. 7A shows images of example samples of tissue without invasive ductal carcinoma (IDC).

[0041] FIG. 7B shows images of example samples of tissue with IDC.

[0042] FIGS. 8A and 8B are diagrams showing examples of images of randomly- selected samples for first class samples and second class samples, respectively.

[0043] FIG. 9 is a graph showing examples of training and accuracy data for IDC classification.

[0044] FIG. 10 is a graph showing examples of area under ROC curve values for invasive ductal classification.

[0045] FIGS. 11A-11B are diagrams showing examples of images for randomly- selected samples from nuclei segmentation dataset from ISMI-2017.

[0046] FIG. 12 is a graph showing example values for training and validaiton accuracy for nuclei segmenation.

[0047] FIGS. 13A-13C are diagrams showing examples of images for quantitative results for nuclei segmentation.

[0048] FIGS. 14A-14B are diagrams showing examples of images of database samples from Epithelium segmentation.

[0049] FIGS. 15A-15B are diagrams showing examples of images of database samples for epithelium segmentation.

[0050] FIGS. 16A-16C are diagrams showing examples of experimental outputs for

Epithelium segmentation.

[0051] FIG. 17 is a graph showing an example of a plot of an under ROC curve for

Epithelium segmentation.

[0052] FIGS. 18A-18B are diagrams showing examples of images of database samples for Tubule segmentation. [0053] FIGS. 19A-19D are diagrams showing examples of patches from an input sample used for training and testing.

[0054] FIGS. 20A-20D are diagrams of examples of images for quantitiave results for tubule segmentation.

[0055] FIGS. 21A-21D are diagrams showing examples of images for quantitiave results for tubule segmentation.

[0056] FIG. 22 is a graph showing an example of an ROC curve for Tubule segmentation.

[0057] FIGS. 23A and 23B are drawings of examples of input samples and label masks with single pixel annotation, respectively.

[0058] FIG. 24 is a graph showing examples of training and validation accuracy for lymphocyte detection.

[0059] FIGS. 25A-25D are diagrams showing examples of qualitative results for lymphocyte detection with UD-Net.

[0060] FIG. 26 is a diagram showing examples of an image from the dataset.

[0061] FIGS. 27-28 are drawings of examples of patches for non-mitosis and mitosis cells, respectively.

[0062] FIG. 29 shows a graph of example training and validation accuracy values for mitosis detection.

[0063] FIG. 30 is a diagram of a visual information processing pipeline of the human brain.

[0064] FIG. 31 is an overall layer flow diagram of a presently-disclosed IRRCNN.

[0065] FIG. 32 is a diagram of an example of an architecture for the Inception

Recurrent Residual Convolutional Neural Network (IRRCNN) block.

[0066] FIG. 33 is a diagram showing example images.

[0067] FIG. 34 is a graph showing example training and validation accuracy values for

IRRCNN, IRCNN, BIN, and BIRN on CIFAR-100.

[0068] FIG. 35 is a graph showing examples of values for testing accuracy of the

IRRCNN model against IRCNN, EIN, and EIRN on the augmented CIFAR-100 dataset.

[0069] FIG. 36 is a diagram showing examples of sample images from the

TinyImageNet-200 dataset. [0070] FIGS. 37 A and 37B are graphs showing examples of accuracy values during training and validation, respectively, for the TinyImageNet-200 dataset.

[0071] FIG. 38 is a graph showing examples of values for validation accuracy for various models on the Tiny-ImageNet dataset.

[0072] FIG. 39 is a graph showing examples of values for the top 1% and top 5% testing accuracy on TinyImageNet-200 dataset.

[0073] FIG. 40 is a diagram showing examples of images from the CU3D-100 dataset.

[0074] FIGS. 41A-41C are diagrams showing sample images in the fish category with different lighting conditions and affine transformations.

[0075] FIGS. 42A and 42B are graphs showing examples of values for training and validation accuracy with respect to epoch for the CU3D-100 dataset.

[0076] FIGS. 43A and 43B are graphs showing examples of errors versus split ratio for five different trials on CU3D-100 dataset for training and validation, respectively.

[0077] FIG. 44 is a graph showing examples of values for testing accuracy for different trials on CU3D-100 dataset.

[0078] FIGS. 45A-45C are diagrams showing medical image segmentation examples.

[0079] FIG. 46 is a diagram showing an example of an RU-Net architecture with convolutional encoding and decoding units using recurrent convolutional layers (RCL) which is based on a U-Net architecture.

[0080] FIGS. 47A-47D are diagrams showing examples of different variants of the convolutional and recurrent convolutional units.

[0081] FIGS. 48 A and 48B are diagrams showing examples of unfolded recurrent convolutional units for / = 2 and I = 3. respectively.

[0082] FIGS. 49A-49C are diagrams showing example images from training datasets.

[0083] FIGS. 50A and 50B are diagrams showing example patches and corresponding outputs, respectively.

[0084] FIGS. 51A and 51B are graphs showing examples of values for training and validation accuracy of the presently-disclosed RU-Net and R2U-Net models compared to the ResU-Net and U-Net models for 150 epochs.

[0085] FIGS. 52A-52C are diagrams showing examples of experimental outputs for three different datasets for retina blood vessel segmentation using R2UNet. [0086] FIG. 53 is a diagram showing examples of AUC values for retina blood vessel segmentation for the best performance achieved with R2U-Net on three different datasets.

[0087] FIGS. 54A and 54B are diagrams showing example values for training and validation accuracy, respectively, of R2U-Net, RU-Net, ResU-Net, and U-Net for skin lesion segmentation.

[0088] FIG. 55 is a diagram illustrating a qualitative assessment of the presently- disclosed R2U-Net for the skin cancer segmentation task.

[0089] FIG. 56 is a diagram showing experimental results for lung segmentation.

[0090] FIG. 57 is a graph showing example values on an ROC curve for lung segmentation for four different models where I = 3.

[0091] FIGS. 58A-58B are graphs showing examples of values for the performance of three different models (SegNet, U-Net and R2U-Net) for different numbers of training and validation samples.

[0092] FIG. 59 is a diagram showing examples of testing errors of the R2U-Net,

SegNet, and U-Net models for different split rations for the lung segmentation application.

[0093] FIG 60 is a diagram showing examples of Recurrent Multilayer Perceptron

(RMLP), Convolutional Neural Network (CNN), and Recurrent Convolutional Neural Network (RCNN) models.

[0094] FIG. 61 is a diagram showing an overall operational flow diagram of the presently-disclosed Inception Recurrent Convolutional Neural Network (IRCNN).

[0095] FIG. 62 is a diagram showing Inception-Recurrent Convolutional Neural

Network (IRCNN) block with different convolutional layers with respect to different size of kernels.

[0096] FIG. 63 is a graph showing example values for training and validation loss.

[0097] FIG. 64 is a graph showing examples of values for training and validation accuracy of IRCNN with SGD and LSUV+EVE.

[0098] FIG. 65 is a graph showing examples of values for the training and validation loss of the IRCNN for both experiments using the CIFAR-100 dataset and data augmentation (with and without initialization and optimization).

[0099] FIG. 66 is a graph showing example values for the training and testing accuracy of the IRCNN with LSUV and EVE. [00100] FIGS. 67 and 68 are graphs showing the model loss and accuracy for both training and validation phases, respectively.

[00101] FIG. 69 is a graph showing example values for the testing accuracy of IRCNN, EIN, and EIRN on CIFAR-100 dataset.

[00102] FIG. 70 is a diagram showing examples of images.

[00103] FIG. 71 is a graph showing example values for validation accuracy of IRCNN, EIRN, EIN, and RCNN.

[00104] FIG. 72 is a graph showing example values for validation accuracy of DenseNet and DenseNet with a Recurrent Convolutional Layer (RCL).

[00105] Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

I - Advanced Deep Convolutional Neural Network Approaches for Digital Pathology Image Analysis: A Comprehensive Evaluation With Different Use Cases

[00106] The present disclosure applies advanced deep convolutional neural network (DCNN) techniques including IRRCNN-, DCRN-, R2U-Net-, and R2U-Net based regression models for solving different DPIA problems that are evaluated on different publicly available benchmark datasets related to seven unique tasks of DPIA. These tasks include: invasive ductal carcinoma detection, lymphoma classification, nuclei segmentation, epithelium segmentation, tubule segmentation, lymphocyte detection, and mitosis detection. Details of these various networks (e.g., IRRCNN, R2U-Net, RU-Net are described further below in this specification).

[00107] Densely Connected Recurrent Convolutional Network (DCRN)

[00108] According to the basic structure of Densely Connected Networks (DCNs), the outputs from the prior layers are used as input for the subsequent layers. This architecture ensures the reuse of the features inside the model, providing better performance on different computer vision tasks which is empirically investigated on different datasets. However, in this implementation, the present disclosure provides an improved version of DCN, which is named DCRN in short, and which is used for nuclei classification. The UD-Net is the building block of several Recurrent Connected Convolutional (DCRC) blocks and transition blocks. FIG. 1 is a pictorial representation of an example of a densely connected recurrent convolutional (DCRC) block.

[00109] According to the basic mathematical model of DenseNet, the 1^th layer receives all the feature maps (x₀, Cc, x₂ ··· Xi-i) from the previous layers as input:

x, = H,([X₀, X_I, X₂ - XI-_I]) (1) where [x₀, Xi, x₂ ^•••Xi_i] includes the concatenated features from 0, . , 1— 1 layers and

H_|(·) is a single tensor. Consider the H_|(·) input sample from 1^th DCRN block that contains

0, . , F— 1 feature maps which are feed in the recurrent convolutional layers. This convolutional layer performs three consecutive operations which include Batch Normalization (BN), followed by ReLU and a 3 x 3 convolution (conv). Consider a center pixel of a patch located at (i, j) in an input sample on the k^th feature of H_{(| k)} ( ) . Additionally, assume the output of the network is H_lk(t) for I^th layer and k^lh feature maps at the time step t. The output can be expressed as follows:

[00110] Here,

— 1) are the inputs to the standard convolution layers and the 1^th recurrent convolution layers respectively. The w , _k) and w , _k) values are the weights of the standard convolutional layer and the recurrent convolutional layers of I^th layer and k^lh feature map respectively, and b_{( k)} is the bias. The recurrent convolution operations are performed with respect to t.

[00111] FIG. 2 is a diagram showing examples of unfolded recurrent convolutional units for t = 2. For example, a pictorial representation of convolutional operation for t = 2 is shown in FIG. 2.

[00112] In the transition block, l x l convolutional operations are performed with BN followed by a 2 x 2 average pooling layer. The DenseNet model includes several dense blocks with feedforward convolutional layers and transition blocks, whereas the DCRN uses the same number DCRC units and transition blocks. For both models, four blocks, three layers per block, and a growth rate of five can be used in this implementation.

[00113] Regression Model with R2U-Net

[00114] In general, for cell detection and counting problem, the ground truth is created with a single pixel annotation, where the individual dot represents a cell. Datasets that can be used in various implementations contain around at least five to five hundred nuclei with a center pixel of the cell in input samples. For training with regression model, each dot is represented with a Gaussian density. In case of a regression model, the R2U-Net model was applied to estimate the Gaussian densities from the input samples instead of computing the class or pixel level probability which is considered for DL-based classification and segmentation model respectively. This model is named University of Dayton Network (UD-Net). For each input sample, a density surface D(x) is generated with superposition of the Gaussian values. The objective is to regress this density surface for the corresponding input cell image I (x) . The goal is achieved with R2U-Net model with the mean squared error loss between the output heat maps and the target Gaussian density surface which is the ultimate loss function for regression problem. However, in the inference phase, for the given input cell image I(x), the model R2U- Net computes the density heat maps D(x).

Table 1: Statistics for the dataset

[00115] Results

[00116] Implementations can advance DL approaches such as IRRCNN, DCRCN, and

R2U-Net for solving different tasks in digital pathology images. The images can include, for example, Lymphoma classification, Invasive ductal carcinoma (IDC) detection, Epithelium segmentation, Tubule segmentation, Nuclei segmentation, Lymphocyte detection, and Mitosis detection.

[00117] For this implementation, the Keras and TensorFlow frameworks were used on a single GPU machine with 56G of RAM and an NIVIDIA GEFORCE GTX-980 Ti.

[00118] Experimental Results [00119] The performance of the IRRCNN, DCRCN, and R2U-Net models was evaluated with different performance metrics: precision, recall, accuracy, Fl-score, Area under Receiver Operating Characteristics (ROC) curve, dice coefficient (DC), and Means Squared Errors (MSE). The equations for accuracy, Fl-score, precision, and recall are shown as follows:

TP+TN

Accuracy =

TP+FP+TN+FN (3)

2TP

FI— score = (4)

2TP+FP+FN

TP

Precision =

TP+FP (5)

TP

Recall

TP+FN (6)

[00120] Lymphoma Classification

[00121] Even the expert pathologist sometimes faces difficulties to differential sub-type of H&E. Consistent diagnosis of different disease sub-types of H&E classification, is very import in the field of digital pathology. In this implementation, three different Lymphoma subtypes are considered to classify from pathological images. FIGS. 3A-3B are images showing examples of three different types of cancer cells, including Chronic lymphocytic leukemia (CLL) cells, Follicular lymphoma (FL) cells, and Mantle cell lymphoma (MCL) cells respectively.

[00122] Lymphoma Classification Dataset

[00123] The original size of the image is 1338x1040 pixels, the images are down sampled to 1344x 1024 to crop the non-overlapping and sequential patches of size, and 64x64 for non-overlapping. FIGS. 4A-4C are images showing examples of non-overlapping patches from original samples. The actual database sample and first five non-overlapping patches from the original images are shown in FIGS. 4A-4C. The statistics of the original dataset and the number of samples after extracting non-overlapping patches are shown in Table 2.

I :

Table 2: statistics of original dataset and the number of samples

[00124] In this implementation, the performance of IRRCNN model can be evaluated with two different approaches: an entire image-based approach, and a patch-based approach. In the image-based approach, the original sample is resized to 256x256. During training of the IRRCNN model, 8 and 32 samples per batch can be considered for image-based and patch- based methods respectively. The Stochastics Gradient Descent (SGD) optimization method is used with initial learning rate 0.01. The model was trained for only 40 epochs, where after 20 epochs the learning rate is decreased with the factor of 10. FIG. 5 is a graph showing example values for training and validation accuracy for lymphoma classification for 40 epochs.

[00125] Results for Lymphoma Classification

[00126] After successfully training the model, the testing accuracy is computed with a testing dataset, which is totally different samples from training samples. Testing accuracies of around 92.12% and 99.8% were achieved for entire image based and patch-based method, as shown in Table 3. From this evaluation, it can be concluded that, as the number of samples increases, the performance of DL approach increases significantly. The highest accuracy is achieved in patch-based method and has an around 3.22% better performance compared to existing deep learning-based approaches for Lymphoma classification.

Table 3: Testing accuracy for Lymphoma classification for image and patch-based methods

[00127] FIG. 6A is a graph showing examples of area under ROC curve values for an image-based method. FIG. 6B is a graph showing examples of area under ROC curve values for a patch-based method.

[00128] The confusion matrix with respect to the input patches of three different classes of Lymphoma is given in Table 4. Total number of testing patches is 13,500.

Table 4: Confusion matrix for lymphoma classification with respect to number of patches

[00129] Invasive Ductal Carcinoma (IDC) Detection

[00130] One of the very common type of breast cancer is IDC, and most of the time pathologist focus on regions to identify of IDC cancer. A common preprocessing step called automatic aggressiveness grading method is used to define the exact region of IDC from whole slides images. FIG. 7A shows images of example samples of tissue without IDC. FIG. 7B shows images of example samples of tissue with IDC. The IRRCNN model with four IRRCNN and transition blocks was used for implementations of the present disclosure.

[00131] The database that was used is from a recently published paper. The samples are in the database down sampled their original images x40 by a factor of 16: 1 for an apparent magnification of x2.5, since the data sample size is 50x50 pixels and is pre-processed. Therefore, entire images were considered where the input samples are resized to 48x48 pixels. The total number samples in database is 275,223, where 196,455 sample are for the first class one and the remaining 78,768 samples for second class two. To resolve class imbalance problem, 78,000 samples were randomly selected from each class. FIGS. 8A and 8B are diagrams showing examples of images of randomly-selected samples for first class samples and second class samples, respectively.

[00132] Experimental Results

[00133] Stochastics Gradient Descent (SGD) is used with a learning rate starting with 0.01. The training is performed for 60 epochs, and after 20 epochs the learning rate is decrease with the factor of 10. FIG. 9 is a graph showing examples of training and accuracy data for invasive ductal carcinoma (IDC) classification.

[00134] IDC Classification Testing Accuracy

[00135] The performance using a testing database containing 31,508 samples was evaluated. Scores approximating 89.06% for aFl-score and 89.07% for atesting accuracy were achieved. The testing results are shown in Table 5. From this table, it can be clearly observed that around a 4.39% better accuracy is achieved compared to existing latest published deep learning-based approaches for Invasive ductal carcinoma (IDC) detection. The previous method provides results for 32x32 pixels, with resizing, center cropping, and center cropping with different rotation. However, data augmentation techniques were not applied.

Table 5: Testing accuracy for invasive ductal carcinoma classification

[00136] FIG. 10 is a graph showing examples of area under ROC curve values for invasive ductal classification. For example, in the testing phase, the testing result show around 0.9573 as Area under the ROC curve. The total testing time for 31508 samples is 109.585 seconds. Therefore, testing time per sample is 0.0035 sec.

[00137] The confusion matrix for testing samples is as follows in Table 6:

Table 6: confusion matrix

[00138] Nuclei segmentation is a very import problem in the field of digital pathology for several reasons. First, nuclei morphology can be an important component in most cancer grading processes. Second, efficient nuclei segmentation techniques can significantly reduce the human effort needed for cell level analysis. Therefore, it can drastically reduce the cell analysis cost. However, there are several challenges to segmentation of the nuclei region: finding an accurate bounding box, and segmenting the overlapping nuclei. In this implementation, R2U-Net (1 - 32- 64- 128- 256- 128- 64- 32- 1) were used with 4M network parameters.

[00139] In this implementation, a database from ISMI-2017, published in 2017, can be used. A total number of samples is 100 images with 100 annotated masks. The size of samples is 512x512. These samples are connected from 11 patients where each patient has a different number of sample varying from 3 to 8 images. FIGS. 11 A-l 1B are diagrams showing examples of images for randomly-selected samples from a nuclei segmentation dataset from ISMI-2017. During training, one patient out-method is used where one patient is randomly selected for testing, and training is conducted on remaining ten patients. An Adam optimizer with a learning rate of 2xe-4 can be applied with a cross entropy loss, a batch size of two, and a number of epochs of 1000.

[00140] The entire model was trained with 1000 epochs, and transfer learning approaches are used after 200 epochs. FIG. 12 is a graph showing example values for training and validaiton accuracy for nuclei segmenation. From FIG 12, it can be observed that the model shows very high accuracy for training, however, around 98% accuracy was achieved during validation phase.

[00141] However, in the testing phase, the method shows around a 97.70% testing accuracy on testing dataset which is 20% of total samples. The experimental results are shown in Table 7. From Table 7, it can be seen that around a 3.31% better performance can be achieved as compared to existing deep learning-based approaches for nuclei segmentation on same dataset.

Table 7: One patient output method-based testing accuracy for nuclei segmentation

[00142] FIGS. 13A-13C are diagrams showing examples of images for quantitative results for nuclei segmentation. FIG. 13 A shows the inputs images, FIG. 13B shows the ground truth, FIG. 13C shows the model outputs, and FIG 13D shows the only nuclei on the inputs sample. [00143] The R2U-Net is applied for nuclei segmentation from whole slide images (WSI) which is evaluated on an ISMI-2017 dataset published in 2017. One patient out based approach is used for analysis of the accuracy. Experimentation has achieved a 97.7% testing accuracy for nuclei segmentation, which is around 3.31% better testing accuracy compared to the presently-disclosed DL based approach. A conclusion can be made that qualitative results demonstrate very accurate segmentation compared to ground truth.

[00144] Epithelium Segmentation

[00145] An R2U-Net model was used, which is consistent with encoding and decoding units. The total number network parameters can be, for example, 1.107 million.

[00146] In most cases, the regions of cancer are manifested in the epithelium area. Therefore, epithelium and stoma regions are very important for identifying the cancer cancerous cells. It is very difficult to predict the overall survival and outcome of breast cancer patients based on the histological pattern within stroma region. Epithelium segmentation can help to identify the cancerous region where most of the cancer cells manifests. FIGS. 14A-14B are diagrams showing examples of images of database samples from Epithelium segmentation. FIG. 14A shows the input samples, and FIG. 14B shows the corresponding binary masks for input samples.

[00147] For epithelium segmentation, only 42 images in total were used. The size of the sample is 1000 x 1000 pixels. In this implementation, non-overlapping patches with a size of 128x128 have been cropped. Therefore, the total number of patches is 11,780.

[00148] FIGS. 15A-15B are diagrams showing examples of images of database samples for epithelium segmentation. FIG. 15 A shows an input sample and ground truth of the corresponding samples. FIG. 15B shows extracted non-overlapping patches for input mages and output mask are shown.

[00149] Testing accuracies are shown in Table 8. 80% of (or 9,424) patches were used for training, and the remaining 20% (or 2,356) were used for testing. The Adam optimizer was used with learning rate of 2*e-4 and cross entropy loss. The experiment was conducted with a batch size of 16 and a number of epochs of 150.

Table 8: Testing accuracy for epithelium segmentation

[00150] Through experimental results, the performance was evaluated with different testing metrics. Accuracies of around 92.54% for testing and 90.5% for Fl -score were achieved. The method shows around 6.5% better performance compared to existing deep learning-based approached for Epithelium Segmentation.

[00151] FIGS. 16A-16C are diagrams showing examples of experimental outputs for Epithelium segmentation. FIG. 16A shows inputs samples, FIG. 16B shows the ground truth, and FIG. 16C shows the model outputs.

[00152] FIG. 17 is a graph showing an example of a plot of an under ROC curve for Epithelium segmentation. Analysis shows that in the area under ROC curve of FIG. 17, the results have achieved a 92.02% area under ROC curve for epithelium segmentation.

[00153] The R2U-Net is applied for Epithelium segmentation from whole slide images (WSI). The experiment has been done with an Epithelium segmentation dataset and achieved 90.50% and 92.54% for Fl-score and accuracy, respectively. A conclusion can be made that qualitative results demonstrate very accurate segmentation compared to ground truth

[00154] Tubule segmentation

[00155] The aggressiveness of cancer can be determined based on the morphology of tubule from the pathological images. The tubule region become massively disorganized in the later stage of cancer. FIGS. 18A-18B are diagrams showing examples of images of database samples for Tubule segmentation.

[00156] The R2U-Net model was used, which is an end-to-end model consisting of encoding and decoding units. The total number of network parameters can be, for example, 1.107 million.

[00157] The total number samples in the database was 42, and the size of samples was 775 x 522 pixels. The number of samples are too low to train a deep learning approach. Some of the example samples are shown in FIGS. 18A-18B. Therefore, 256 x 256 non-overlapping patches and a total number of patches of 970 were considered. From these samples, 402 patches were used for benign and the remaining 568 patches were used for malignant. FIGS. 19A-19D are diagrams showing examples of patches from an input sample used for training and testing. 80% of the patches were used for training, and the remaining 20% were used for testing. [00158] Database samples for tubule segmentation include input sample (FIG. 19A) and ground truth (FIG. 19B) of the corresponding samples. Extracted non-overlapping patches for input mages and output mask are shown in FIGS. 19C and 19D, respectively.

[00159] Experimental Results

[00160] The Adam optimizer was applied with a learning rate of 2e-4 and a cross entropy loss. A batch size of 16 and a number of epochs of 500 were used during training for tubule segmentation.

Table 9: Testing results for tubule segmentation

[00161] Testing Accuracy

[00162] The testing accuracies and the comparison against the existing approaches are shown in Table 9. Around 90.31% for testing accuracy and around 90.13% for Fl-score can be achieved, which is around 4.13% better performance compared to existing deep learning-based approach for Epithelium Segmentation.

[00163] Qualitative Results With Transparency

[00164] FIGS. 20A-20D are diagrams of examples of images for quantitiave results for tubule segmentation. FIG. 20A shows the inputs samples. FIG. 20B shows the label masks. FIG. 20C shows the model outputs. FIG. 20D shows the only tubule part from benign images.

[00165] FIGS. 21A-21D are diagrams showing examples of images for quantitiave results for tubule segmentation. FIG. 21 A shows the inputs samples. FIG. 21B shows the label masks. FIG. 21C shows the model outputs. FIG. 21D shows the only tubule part from benign images.

[00166] FIG. 22 is a graph showing an example of an ROC curve for Tubule segmentation. Analysis can show that it is possible to achieve 90.45% for the area under ROC curve, which is shown in FIG. 22. [00167] The R2U-Net is applied for tubule segmentation from whole slide images (WSI). The performance of R2U-Net is analyzed on publicly available dataset for tubule segmentation. 90.13% and 90.31% were achieved for the Fl-score and the accuracy, respectively. Qualitative results demonstrate very accurate segmentation compared to ground truth.

[00168] Lymphocyte Detection

[00169] Lymphocyte is a very important part of the human’s immune system and a subtype of white blood cell (WBC). This type of cell is used to determine different types of cancer such as breast cancer and ovarian cancer. A few main challenges and applications exist for lymphocyte detection. First, in general lymphocytes looks like a blue tint from the absorption of hematoxylin. Second, the appearance and other morphology is very similar in hue to nuclei. The applications of Lymphocyte detection are to identify cancer patient to place on immunotherapy and other therapies.

[00170] Lymphocyte Detection Dataset

[00171] FIGS. 23A and 23B are drawings of examples of input samples and label masks with single pixel annotation, respectively. FIG. 23A shows the input images, and FIG. 23B shows the label masks with single pixel annotation for each Lymphocyte which is indicated with green dot pixels. The boundaries of the masks are indicated with a black border in FIG. 23B.

[00172] Experimental Results

[00173] The dataset can be taken from published papers. The total number of samples are 100 with 100 center pixel annotated masks. The size of the image is 100x 100. 90% of patches were used for training, and the remaining 10% were used for testing. An Adam optimizer was applied, with a learning rate of 2xe-4 and cross entropy loss. In this implementation, a batch size of 32 and a number of epochs of 1000 were used. FIG. 24 is a graph showing examples of training and validation accuracy for lymphocyte detection.

Table 10: Testing results for lymphocyte detection

[00174] Testing Accuracy [00175] The testing accuracy and comparison with others existing approaches are shown in Table 10. Around 90.92% testing accuracy and around 0.82% Fl-score were achieved, which is better performance compared to existing deep learning-based approaches for lymphocyte detection. FIGS. 25A-25D are diagrams showing examples of qualitative results for lymphocyte detection with UD-Net.

[00176] FIG. 25 A represents inputs samples, FIG. 25B shows ground truth, FIG. 25C shows the models outputs, and FIG. 25D shows the final outputs where the blue dots are for ground truth and green dots for model outputs.

[00177] The R2U-Net is applied for tubule segmentation from whole slide images (WSI) for Lymphocyte detection. 90.23% accuracy for lymphocyte detection was achieved. A conclusion can be made that qualitative results demonstrate very accurate segmentation compared to ground truth.

[00178] Mitosis Detection

[00179] The cell growth rate can be determined with counting of mitotic events from the pathological images, which is an important aspect to determine the aggressiveness of breast cancer diagnosis. Presently, the manually counting process is applied in pathological practices that are very extremely difficult and time consuming. Therefore, an automatic mitosis detection approach has application in pathological practice.

[00180] Studies used a publicly available dataset. The total number of images was 302, collected from 12 patients. The actual size of the input sample was 2000x2000 pixels. FIG. 26 is a diagram showing examples of an image from the dataset.

[00181] The images in FIG. 26 include images for an input, an actual mask, a dilated mask, and a mask with target mitosis. As the number of mitosis cell are very less, different augmentation approaches were applied with {0,45,90,135,180,215,270}. In one study, 32x32 patches were extracted for the input images, with a total number of patches being 728,073. From the patches, 100,000 patches were randomly selected, with 80,000 patches being used for training and the remaining 20,000 patches used for testing. FIGS. 27-28 are drawings of examples of patches for non-mitosis and mitosis cells, respectively.

[00182] Training Approach.

[00183] The SGD optimization approach was used with initial learning rate of 0.01. Also, 30 epochs were used, and after each 10 epochs, the learning rate was decreased by a factor of 10. FIG. 29 shows a graph of example training and validation accuracy values for mitosis detection.

[00184] The performance of the IRRCNN model for image-level and patient-level mitosis detection was analyzed. Results are shown in Table 11.

Table 11: Testing results for mitosis detection

[00185] The testing phase achieved 99.54% testing accuracy for mitosis detection. In addition, the experimental results show 99.68% area under ROC curve. It takes 138.94 seconds for 20,000 samples.

[00186] Evaluation occurred for different advanced DCNN approaches including IRRCNN, DCRCN, and R2U-Net for solving classification, segmentation and detection problems for digital pathology image analysis. For the classification task, 99.8% and 89.07% testing accuracies were achieved, which is 3.22% and 4.39% better accuracy for lymphoma and invasive ductal carcinoma (IDC) detection. For segmentation tasks, the experimental results show the 3.31%, 6.5%, 4.13% superior performance for nuclei, epithelium, and tubule segmentation compared to existing Deep Learning (DL) based approaches. For lymphocyte detection, a 0.82% improvement in testing accuracy with 97.32% and around 60% testing accuracies were achieved for mitosis detection for image-level and patient-level respectively, which is significantly higher compared to existing methods. A conclusion can be made that experimental results demonstrate the robustness and efficiency of the presently-disclosed DCNN methods for different use cases of computational pathology.

II - Improved Inception-Residual Convolutional Neural Network for Object Recognition

[00187] The present disclosure provides a DCNN model called the Inception Recurrent Residual Convolutional Neural Network (IRRCNN), which utilizes the power of the Recurrent Convolutional Neural Network (RCNN), the Inception network, and the Residual network. This approach improves the recognition accuracy of the Inception-residual network with same number of network parameters. In addition, the presently-disclosed architecture generalizes the Inception network, the RCNN, and the Residual network with significantly improved training accuracy. The performance of the IRRCNN model was empirically evaluated on different benchmarks including CIFAR-10, CIFAR-100, TinyImageNet-200, and CU3D-100. The experimental results show higher recognition accuracy against most of the popular DCNN models including the RCNN. The performance of the IRRCNN approach was also investigated against the Equivalent Inception Network (EIN) and the Equivalent Inception Residual Network (EIRN) counterpart on the CIFAR-100 dataset. Improvement in classification accuracy of around 4.53%, 4.49% and 3.56% were reported as compared with the RCNN, EIN, and EIRN on the CIFAR-100 dataset respectively. Furthermore, the experiment has been conducted on the TinyImageNet-200 and CU3D-100 datasets where the IRRCNN provides better testing accuracy compared to the Inception Recurrent CNN (IRCNN), the EIN, and the EIRN.

[00188] FIG. 30 is a diagram of a visual information processing pipeline of the human brain. For example, vl though v4 represent the visual cortex areas. The visual context areas of vl though v4 process information using recurrent techniques.

[00189] FIG. 31 is an overall layer flow diagram of a presently-disclosed IRRCNN. The IRRCNN includes the IRRCNN-Block, the IRRCNN-Transition block, and the Softmax layer at the end.

[00190] IRRCNN Architecture

[00191] The present disclosure provides an improved DCNN architecture based on Inception, Residual networks and the RCNN architecture. Therefore, the model can be called the Inception Recurrent Residual Convolutional Neural Network (IRRCNN).

[00192] An objective of this model is to improve recognition performance using the same number or fewer computational parameters when compared to alternative equivalent deep learning approaches. In this model, the inception-residual units utilized are based on Inception- v4. The Inception-v4 network is a deep learning model that concatenates the outputs of the convolution operations with different sized convolution kernels in the inception block. Inception-v4 is a simplified structure of Inception-v3, containing more inception modules using lower rank filters. Furthermore, Incpetion-v4 includes a residual concept in the inception network called the Inception-v4 Residual Network, which improves overall accuracy of recognition tasks. In the Inception-Residual network, the outputs of the inception units are added to the inputs of the respective units. The overall structure of the presently-disclosed IRRCNN model is shown in FIG. 31. From FIG. 31, it can be seen that the overall model consists of several convolution layers, IRRCNN blocks, transition blocks, and a Softmax at the output layer.

[00193] FIG. 32 is a diagram of an example of an architecture for the Inception Recurrent Residual Convolutional Neural Network (IRRCNN) block. The block consists of the inception unit at the top which contains recurrent convolutional layers that are merged by concatenation, and the residual units. A summation of the input features with the outputs of the inception unit can be seen at the end of the block.

[00194] A part of this presently-disclosed architecture is the IRRCNN block that includes RCLs, inception units, and residual units (shown in detail in FIG. 32). The inputs are fed into the input layer, then passed through inception units where RCLs are applied, and finally the outputs of the inception units are added to the inputs of the IRRCNN-block. The recurrent convolution operations perform with respect to the different sized kernels in the inception unit. Due to the recurrent structure within the convolution layer, the outputs at the present time step are added with the outputs of previous time step. The outputs at the present time step are then used as inputs for the next time step. The same operations are performed with respect to the time steps that are considered. For example, here k=2 means that 3 RCLs are included in IRRCNN-block. In the IRRCNN-block, the input and output dimensions do not change, as this is simply an accumulation of feature maps with respect to the time steps. As a result, the healthier features ensure that better recognition accuracy is achieved with the same number of network parameters.

[00195] The operations of the RCL are performed with respect to the discrete time steps that are expressed according to the RCNN. Consider the x_L input sample in the I^th layer of the IRRCNN-block and a pixel located at (i,j) in an input sample on the k^lh feature map in the RCL. Additionally, assume that the output of the network 0-_;fc(t) is at the time step t. The output can be expressed as follows:

[00196] Here,

(t— 1) are the inputs for the standard convolution layers and for the I^th RCL respectively. The w[ and w values are the weights for the standard convolutional layer and the RCL of the k^lh feature map respectively, and b_k is the bias.

y = f(⁰i_jk(t )) = max(0, O_l ^l _jk(t)) (8)

[00197] Here, / is the standard Rectified Linear Unit (ReLU) activation function. The performance of this model was also explored with the Exponential Linear Unit (ELU) activation function in the following experiments. The outputs y of the inception units for the different size kernels and average pooling layer are defined as yi_xi( ), ysxsM- and y _xl(x) respectively. The final outputs of Inception Recurrent Convolutional Neural Networks (IRCNN) unit are defined as T(x_b w_L) which can be expressed as

[00198] Here, 0 represents the concatenation operation with respect to the channel or feature map axis. The outputs of the IRCNN-unit are then added with the inputs of the IRRCNN-block. The residual operation of the IRRCNN-block can be expressed by the following equation.

where x_i+1 refers to the inputs for the immediate next transition block, x_L represents the input samples of the IRRCNN-block, iv₍ represents the kernel weights of the I^th IRRCNN-block, and T(x_L, Wi) represents the outputs from of I^th layer of the IRCNN-unit. However, the number of feature maps and the dimensions of the feature maps for the residual units are the same as in the IRRCNN-block shown in FIG. 32. Batch normalization is applied to the outputs of the IRRCNN-block. Eventually, the outputs of this IRRCNN-block are fed to the inputs of the immediate next transition block.

[00199] In the transition block, different operations are performed including convolution, pooling, and dropout, depending upon the placement of the transition block in the network. Inception units were not included in the transition block on the small-scale implementation for CIFAR-10 and CIFAR-100. However, inception units were applied to the transition block during the experiment using the TinyImageNet-200 dataset and for the large- scale model which is the equivalent model of Inception-v3. The down-sampling operations are performed in the transition block where max-pooling operations are performed with a 3x3 patch and a 2x2 stride. The non-overlapping max-pooling operation has a negative impact on model regularization. Therefore, overlapped max-pooling for regularizing the network were used, which is very important when training a deep network architecture. Late use of a pooling layer helps to increase the non-linearity of the features in the network, as this results in higher dimensional feature maps being passed through the convolution layers in the network. Two special pooling layers were applied in the model with three IRRCNN-blocks and a transition- block for the experiments that use the CIFAR-10 or CIFAR-100 dataset.

[00200] Only l x l and 3x3 convolution filters were used in this implementation, as inspired by the NiN and Squeeze Net models. This also helps to keep the number of network parameters at a minimum. The benefit of adding a 1 x 1 filter is that it helps to increase the non linearity of the decision function without having any impact on the convolution layer. Since the size of the input and output features does not change in the IRRCNN blocks, it is just a linear projection on the same dimension, and non-linearity is added to the RELU and ELU activation functions. A 0.5 dropout was used after each convolution layer in the transition block. Finally, a Softmax or normalized exponential function layer was used at the end of the architecture. For input sample x, weight vector W. and K distinct linear functions, the Softmax operation can be defined for the i^th class as follows:

[00201] The presently-disclosed IRRCNN model has been investigated through a set of experiments on different benchmark datasets and compared across different models.

[00202] Experiments

[00203] The presently-disclosed IRRCNN model has been evaluated using four different benchmark datasets: CIFAR-10, CIFAR-100, TinyImageNet-200, and CU3D-100. The dataset statistics are provided in Table 12. Different validation and testing samples were used for the TinyImageNet-200 dataset. The entire experiment was conducted on a Linux environment running on a single GPU machine with an NVIDIA GTX-980TL

Table 12: Statistics for the datasets studied in these experiments.

[00204] Experiments on CIFAR-10 and 100 Datasets

[00205] In this experiment, two convolution layers were used at the beginning of the architecture, three IRRCNN blocks followed by three transition blocks, and one global average pooling and Softmax layer at the end. First, evaluation occurred for the IRRCNN model using the stochastic gradient descent (SGD) technique and a default initialization technique. A momentum equal to 0.9 and decay equal to 9.99e-07 were used in this experiment. Second, evaluation occurred for the same model with the Layer-sequential unit-variance (LSUV) initialization method and the latest improved version of the optimization function called EVE. The hyper parameters for the EVE optimization function are as follows: the value of learning rate (2.) is le-4, the decay (y) is le-4, b_± = 0.9, b₂ = 0.999, b₃ = 0.999b, k=0.1, K=10, and e = le— 08. The values b_±, b₂ e [0,1] are exponential decay rates for moment estimation in Adam. The b₃ e [0,1) is an exponential decay rate for computing relative changes. The IRRCNN-block uses the 12— norm for a weight regularization of 0.002. The ReLU activation function was used in the first experiment, and the ELU activation is used in the second experiment. In both experiments, training occurred for the networks for 350 epochs with a batch size of 128 for CIFAR-10 and 100.

[00206] CIFAR -10

[00207] The CIFAR- 10 dataset is a benchmark dataset for object classification. The dataset consists of 32x32 color images split into 50,000 samples for training, and the remaining 10,000 samples are used for testing (classification into one of 10 classes). The experiment was conducted with and without data augmentation. When using data augmentation, only random horizontal flipping was applied. This approach achieved around 8.41% testing error without data augmentation and 7.37% testing error with augmented data (only horizontal flipping) using SDG techniques.

[00208] FIG. 33 is a diagram showing example images. For example, the images are from the CIFAR- 10 dataset. The presently-disclosed model shows better recognition against most of the DCNN models displayed in Table 13. Furthermore, improved performance is observed in the IRCNN that used LSUV initialization and the EVE optimization function. The results show a testing error of around 8.17% and 7.11% without and with data augmentation respectively. It is also observed that the IRRCNN shows better performance when compared to the equivalent IRCNN model. Table 13 shows a testing error (%) of the IRRCNN on CIFAR- 10 object classification dataset without and with data augmentation. For unbiased comparison, the accuracy stated in recent studies using a similar experimental setting was listed.

Table 13: Testing error (%) of the IRRCNN on CIFAR-10 object classification dataset without and with data augmentation.

[00209] CIFAR-100

[00210] Another similar benchmark for object classification was developed in 2009. The dataset contains 50,000 samples for training and 10,000 samples for validation and testing. Each sample is a 32x32x3 image, and the dataset has 100 classes. The presently-disclosed IRRCNN model was studied with and without data augmentation. During the experiment with augmented data, the SGD and LSUV initialization approaches and the EVE optimization function were used. In both cases, the presently-disclosed technique shows better recognition accuracy compared with different DCNN models including the IRCNN. Examples of values for the validation accuracy of the IRRCNN model for both experiments on CIFAR-100 with data augmentation are shown in FIG. 34. The presently-disclosed IRRCNN model shows better performance in the both experiments when compared to the IRCNN, EIN, and EIRN models. The experimental results when using CIFAR-100 are shown in Table 14. The IRRCNN model provides better testing accuracy compared to many recently developed methods. A recognition accuracy of 72.78% was achieved with LSUV+EVE, which is around a 4.49% improvement compared to one of the baseline RCNN methods with almost the same number of parameters (-3.5M). Table 14 shows a testing error (%) of the IRRCNN on the CIFAR-100 object classification dataset without and with data augmentation (DA). For unbiased comparison, the accuracy provided by recent studies is listed in a similar experimental setting. Here, Cl 00 refers to without data augmentation and Cl 00+ refers to with data augmentation.

Table 14: Testing error (%) of the IRRCNN on the CIFAR-100 object classification dataset without and with data augmentation (DA).

[00211 ] Impact of Recurrent Convolution Layers [00212] A question that may arise here is whether there is any advantage of the IRRCNN model against the EIRN and EIN architectures. The EIN and EIRN models are implemented with a similar architecture with a same number of network parameters (~3.5 M). Sequential convolution layers with the same time-step with the same size kernels were used instead of using RCLs for implementing the EIN and EIRN models. In addition, in the case of EIRN, the residual concept is incorporated with an Inception-block like Inception-v4. Furthermore, the performance of the IRRCNN model against the RCNN has been investigated with same number of parameters on the TinyImageNet-200 dataset.

[00213] Another question that may arise here is whether the IRRCNN model is providing better performance due to the use of advance deep learning techniques. It is noted that the LSUV initialization approach applied to the DCNN architecture called FitNet4 achieved 70.04% classification accuracy on augmented data with mirroring and random shifts for CIFAR- 100. In contrast, only random horizontal flipping for data augmentation was applied and achieved around 1.76% better recognition accuracy against FitNet4.

[00214] FIG. 34 is a graph showing example training and validation accuracy values for IRRCNN, IRCNN, BIN, and BIRN on CIFAR-100. The vertical and horizontal axis represents accuracy and epochs respectively. The presently-disclosed model shows the best recognition accuracy in all cases.

[00215] The model accuracy for both training and validation are shown in FIG. 34. From the figures, it is observed that this presently-disclosed model shows lower loss and highest recognition accuracy compared to EIN and EIRN, which proves the necessity of the presently- disclosed models. FIG. 35 is a graph showing examples of values for testing accuracy of the IRRCNN model against IRCNN, EIN, and EIRN on the augmented CIFAR- 100 dataset. It can be summarized that the presently-disclosed IRRCNN provides around 1.02%, 4.49%, and 3.56% improved testing accuracy compared to IRCNN, EIN, and EIRN respectively.

[00216] Experiment on TinyImageNet-200

[00217] The presently-disclosed approach on the TinyImageNet-200 dataset was also analyzed. This dataset contains 100,000 samples for training, 10,000 samples for validation, and 10,000 samples fortesting. These images are sourced from 200 different classes of objects. The main difference between the main ImageNet dataset and Tiny ImageNet is the images are down sampled from 224x224 to 64x64. There are some negative impacts of down-sampling, like loss of detail. Therefore, down sampling the images leads to ambiguity, which makes this problem even more difficult, and this effects overall model accuracy.

[00218] FIG. 36 is a diagram showing examples of sample images from the TinyImageNet-200 dataset. For this experiment, the IRRCNN model with two general convolution layers with a 3x3 kernel were used at the beginning of the network followed by sub-sampling layer with 3x3 convolution using a stride of 2x2. After that, four IRRCNN blocks are used followed by four transition blocks. Finally, a global average pooling layer is used followed by a Softmax layer.

[00219] FIGS. 37 A and 37B are graphs showing examples of accuracy values during training and validation, respectively, for the TinyImageNet-200 dataset. For example, the graphs can result from experimentation with the IRRCNN, IRCNN, equivalent RCNN, EIN, and EIRN using the TinyImageNet-200 dataset. The presently-disclosed IRRCNN model provides better recognition accuracy during training compared to equivalent models including IRCNN, EIN, and EIRN with almost the same number of network parameters (-15M). Generally, DCNN takes a lot of time and power when training a reasonably large model. The Inception-Residual networks with RCLs significantly reduce training time with faster convergence and better recognition accuracy. FIG. 38 is a graph showing examples of values for validation accuracy for various models on the Tiny-ImageNet dataset.

[00220] FIG. 39 is a graph showing examples of values for the top 1% and top 5% testing accuracy on TinyImageNet-200 dataset. From the bar graph, the impact of recurrent connectivity is clearly observed, and a top-l% testing accuracy of 52.23% was achieved, whereas the EIRN and EIN show 51.14% and 45.63% top-l% testing accuracy. The same behavior is observed for Top-5% accuracy as well. The IRRCNN provides better testing accuracy when compared against all other models in both cases which absolutely displays the robustness of the presently-disclosed deep learning architecture.

[00221] Inception-v3, WRN Versus Equivalent IRRCNN Model

[00222] Evaluation occurred for the IRRCNN model with large scale implementation against the Ineption-v3 and WRN. The IRRCNN model is implemented a with similar structure to the Incpeiton-v3 for impartial comparison. The default implementation of Keras version 2.0 was used, and the RCLs were incorporated, where ( k = 2), which means 2 RCLs are used in the Inception units and a residual layer is added at the end of the block. Training occurred for the network with the SGD method with momentum. The concept of transfer learning is used where training was performed for 100 epochs in total. After successfully completing the initial training process for 50 epochs with learning rate of 0.001, the learned weights were used as initial weights for the next 50 epochs for fine-tuning of the network with a learning rate of 0 0001

[00223] CU3D-100 dataset

[00224] Another very high quality visual obj ect recognition dataset with well-controlled images (e.g., object invariance, features complexity) is CU3D-100, which is suitable for evaluation of deep learning algorithms. This dataset contains 18,840 color images in total that have a dimension of 64x64x3 and 20 samples per exemplar. FIG. 40 is a diagram showing examples of images from the CU3D-100 dataset.

[00225] The images in this dataset are three-dimensional views of real-world objects normalized for different positions, orientations, and scales. The rendered images have a 40° depth rotation about the y-axis (plus a horizontal flip), a 20° tilt rotation about x-axis, and an 80° overhead lighting rotation. 75% percent of the images were used for training, and the remaining 25% images were used for testing, which were selected randomly from the whole dataset.

[00226] FIGS. 41A-41C are diagrams showing sample images in the fish category with different lighting conditions and affine transformations. FIG. 41A shows nine examples from the fish category. FIG. 41B shows nine depth, tilt, and lighting variations of the fish category. FIG. 41C shows nine affine transformation images.

[00227] Experimental Results on CU3D

[00228] Two different experiments were conducted. In a first case, the models were trained from scratch. A transfer learning approach with the pre-trained weights of ImageNet dataset was used for second experiment on the CU3D-100 dataset. Models of IRRCNN with equivalent of Inception-v3, which contains 21.25M and 19.74M network parameters respectively, were considered. The WRN model consists of deep and wide factors (x and y, respectively) which contains 31.25M network parameters. At the beginning, the entire dataset is divided into two sets where 75% (14,130) samples are used for training and validation, and the remaining 25% (4,710) of samples are used for testing. For the first experiment, from 14,130 samples, 10% of the sample are used for validation during training.

[00229] FIGS. 42 A and 42B are graphs showing examples of values for training and validation accuracy with respect to epoch for the CU3D-100 dataset. The training and validation accuracy are shown for 25 epochs. FIGS. 42A-42B show that the IRRCNN model exhibits lower error during training and validation when compared to the Inception-v3 and WRN models.

[00230] In the testing phase, 99.81%, 99.13%, and 98.51% testing accuracies were achieved with IRRCNN, Inception-v3, and WRN, respectively. The IRRCNN model shows a 0.68% and 1.30% higher testing accuracy compared to Inception-v3 and WRN. A recently published paper with sparse neural networks with recurrent layers reported about 94.6% testing accuracy on the CU3D dataset which is around 5.24% less testing accuracy compare to this presently-disclosed IRRCNN model. Second, the pre-trained ImageNet weights are used as initial weights for IRRCNN and Inception-v3 models, and only a few layers from the top of the models were trained. The pre-trained weights were taken from GitHub. The presently- disclosed model gives about 98.84% testing accuracy, while the Inception-v3 models gives 92.16% testing accuracy on the CU3D-100 dataset. The IRRCNN model shows around 6.68% better testing accuracy compared to the similar Inception-v3 model. This experiment also proves the robustness of the IRRCNN model when dealing with scale invariance, position and rotation invariance, and different lighting condition input samples.

[00231] Trade-off Between Split Ratio and Accuracy

[00232] To further investigate the performance of the presently-disclosed IRRCNN model, the trade-off between the split ratio versus the performance is investigated against Inception-v3 and WRN. During this experiment, different split ratios [0.9, 0.7, 0.5, 0.3, and 0.1] were used. The number of training and validation samples are taken according to the split ratio, where the number of training samples are increased and the number of validation samples are decreased in the trials respectively. For example, a split ratio of 0.9 refers to only 10% of samples (1423) being used for training, and the remaining 90% of samples (12815) are used for validation for the first trial. A split ratio of 0.7 means that 30% of the samples are used for training and the remaining 70% samples are used for validation for second trial, and so on.

[00233] FIGS. 43A and 43B are graphs showing examples of errors versus split ratio for five different trials on CU3D-100 dataset for training and validation, respectively. It can be also observed from FIGS. 42A-42B that the models have converged after 22 epochs. Therefore, in each trial, 25 epochs were considered, and the errors here match the average training and validation errors of the last five epochs. FIGS. 43A-43B show that the presently- disclosed IRRCNN model shows fewer training and validation errors for five different trials in both cases. These results clearly demonstrate that the IRRCNN is more capable at extracting, representing, and learning features during the training phase which ultimately helps to ensure better testing performance. FIG. 44 is a graph showing examples of values for testing accuracy for different trials on CU3D-100 dataset. In each trial, the models have been tested with the remaining 25% of the samples, and the testing errors are shown in FIG. 44. From FIG. 44, it is seen that R2U-Net shows the lowest error for almost all trails compared to Inception-v3 and WRN.

[00234] Computational time

[00235] The computational time of the presently-disclosed models with other equivalent models for different datasets are shown in Table 15.

Table 15: Computational time per epoch for different models using different datasets.

[00236] Introspection

[00237] From investigation it can be observed that the presently-disclosed IRRCNN model converges faster when compared to the RCNN, EIR, EIRN, and IRCNN models, which are clearly evaluated using a set of experiments. The presently-disclosed techniques provide promising recognition accuracy during the testing phase with the same number of network parameters compared with other models. In this implementation, input samples were augmented by applying only random horizontal flipping. From observation, the presently- disclosed model will provide even better recognition accuracy with more augmentations including transition, central crop, and ZCA.

Ill -Recurrent Residual U-Net (R2U-Net) for Medical Image Segmentation [00238] The present disclosure describes using a Recurrent U-Net as well as a Recurrent Residual U-Net model, which are named RU-Net and R2U-Net respectively. The presently- disclosed models utilize the power of U-Net, Residual Networks, and Recurrent Convolutional Neural Networks (RCNNs). There are several advantages of these presently-disclosed architectures for segmentation tasks. First, a residual unit helps when training deep architectures. Second, feature accumulation with recurrent residual convolutional layers ensures better feature representation for segmentation tasks. Third, better U-Net architectures can be designed using the same number of network parameters with better performance for medical image segmentation. The presently-disclosed models are tested on three benchmark datasets such as blood vessel segmentation in retina images, skin cancer segmentation, and lung lesion segmentation. The experimental results show superior performance on segmentation tasks compared to equivalent models including a variant of a fully connected convolutional neural network (FCN) called SegNet, U-Net, and the residual U-Net (ResU-Net).

[00239] FIGS. 45A-45C are diagrams showing medical image segmentation examples. FIG. 45A shows retina blood vessel segmentation. FIG. 45B shows skin cancer lesion segmentation. FIG. 45 C shows lung segmentation.

[00240] The present disclosure can be applied in different modalities of medical imaging including segmentation, classification, detection, registration, and medical information processing. The medical imaging comes from different imaging techniques such as Computer Tomography (CT), ultrasound, X-ray, and Magnetic Resonance Imaging (MRI). The goal of Computer-Aided Diagnosis (CAD) is to obtain a faster and better diagnosis to ensure better treatment of a large number of people at the same time. Additionally, efficient automatic processing reduces human error and also reduces overall time and cost.

[00241] FIG. 46 is a diagram showing an example of an RU-Net architecture with convolutional encoding and decoding units using recurrent convolutional layers (RCL) which is based on a U-Net architecture. The residual units are used with the RCL and R2U-Net architectures. Since the number of feature maps increases in the deeper layers, the number of network parameters also increases. Eventually, the softmax operations are applied at the end of the network to compute the probability of the target classes.

[00242] As opposed to classification tasks, the architecture of segmentation tasks requires both convolutional encoding and decoding units. The encoding unit is used to encode input images into a larger number of maps with lower dimensionality. The decoding unit is used to perform up-convolution (transpose convolution, or what is occasionally called de- convolution) operations to produce segmentation maps with the same dimensionality as the original input image. Therefore, the architecture for segmentation tasks generally requires almost double the number of network parameters when compared to the architecture of the classification tasks. Thus, it is important to design efficient DCNN architectures for segmentation tasks which can ensure better performance with fewer numbers of network parameters.

[00243] This specification discloses two modified and improved segmentation models, one using recurrent convolution networks, and another using recurrent residual convolutional networks. The presently-disclosed models can be evaluated on different modalities of medical imaging as shown in FIG. 45. The present disclosure provides at last two deep-l earning models, including RU-Net and R2U-Net. Second, experiments are conducted on three different modalities of medical imaging including retina blood vessel segmentation, skin cancer segmentation, and lung segmentation. Third, performance evaluation of the presently- disclosed models is conducted for the patch-based method for retina blood vessel segmentation tasks and the end-to-end image-based approach for skin lesion and lung segmentation tasks. Fourth, comparison against recently used state-of-the-art methods shows superior performance against equivalent models with the same number of network parameters. Fifth, empirical evaluation of the robustness of the presently-disclosed R2U-Net model against SegNet and U- Net based on the trade-off between the number of training samples and performance during the training, validation, and testing phases.

[00244] Related Work

[00245] According to the U-Net architecture, the network consists of two main parts: the convolutional encoding and decoding units. The basic convolution operations are performed followed by ReLU activation in both parts of the network. For down-sampling in the encoding unit, 2x2 max-pooling operations are performed. In the decoding phase, the convolution transpose (representing up-convolution, or de-convolution) operations are performed to up-sample the feature maps. The U-Net model provides several advantages for segmentation tasks. First, this model allows for the use of global location and context at the same time. Second, it works with very few training samples and provides better performance for segmentation tasks. Third, an end-to-end pipeline processes the entire image in the forward pass and directly produces segmentation maps. This ensures that U-Net preserves the full context of the input images, which is a major advantage when compared to patch-based segmentation approaches.

[00246] RU-NET and R2U-NET Architectures

[00247] Inspired by the deep residual model, the RCNN, and the U-Net model, two models for segmentation tasks, which are named RU-Net and R2U-Net, can be used. These two approaches utilize the strengths of all three recently developed deep learning models. The RCNN and its variants have already shown superior performance on object recognition tasks using different benchmarks. The recurrent residual convolutional operations can be demonstrated mathematically according to the improved-residual networks. The operations of the Recurrent Convolutional Layers (RCL) are performed with respect to the discrete time steps that are expressed according to the RCNN. Let’s consider the x, input sample in the I^th layer of the residual RCNN (RRCNN) block and a center pixel of a patch located at (i,j) in an input sample on the k^lh feature map in the RCL. Additionally, let’s assume the output of the network 0\_jk(t) is at the time step t. The output can be expressed as follows:

(12)

[00248] Here, x[^ t) and x,^r(l,/)(t— 1) are the inputs to the standard convolution layers and the I^th RCL respectively. The w and w_k values are the weights of the standard convolutional layer and the RCL of the k^lh feature map respectively, and b_k is the bias. The outputs of the RCL are fed to the standard ReLU activation function / and are expressed:

[00249] T(xi, iv, ) represents the outputs from of I^th layer of the RCNN unit. The output of T(x_{0 Wi}) is used for downsampling and upsampling layers in the convolutional encoding and decoding units of the RU-Net model respectively. In the case of R2U-Net, the final outputs of the RCNN unit are passed through the residual unit that is shown FIG. 47D. Let’s consider that the output of the RRCNN-block to be x,₊₁, and it can be calculated as follows:

x,₊₁ = x, + T(x_{h Wi})

(14) [00250] Here, x_L represents the input samples of the RRCNN-block. The x_i+1 sample is the input for the immediate succeeding subsampling or up-sampling layers in the encoding and decoding convolutional units of the R2U-Net model. However, the number of feature maps and the dimensions of the feature maps for the residual units are the same as in the RRCNN-block shown in FIG. 47D.

[00251] The presently-disclosed deep learning models are the building blocks of the stacked convolutional units shown in FIGS. 47B and 47D.

[00252] There are four different architectures evaluated in this work. First, the U-Net with forward convolution layers and feature concatenation is applied as an alternative to the crop and copy method found in the primary version of U-Net. The basic convolutional unit of this model is shown in FIG. 47 A. Second, the U-Net model with forward convolutional layers with residual connectivity is used, which is often called a residual U-net (or a ResU-Net) and is shown in FIG. 47C. The third architecture is the U-Net model with forward recurrent convolutional layers as shown in FIG. 47B, which is named RU-Net. Finally, the last architecture is the U-Net model with recurrent convolution layers with residual connectivity as shown in FIG. 47D, which is named R2U-Net. FIGS. 48A and 48B are diagrams showing examples of unfolded recurrent convolutional units for / = 2 and I = 3. respectively. .The pictorial representation of the unfolded RCL layers with respect to time step is shown in FIGS. 48A-48B. Here / = 2 (0 ~ 2), refers to the recurrent convolutional operation that includes one single convolution layer followed by two sub-sequential recurrent convolutional layers. In some implementations, concatenation to the feature maps from the encoding unit to the decoding unit can be applied for the RU-Net and R2U-Net models.

[00253] The differences between the presently-disclosed models with respect to the U- Net model are three-fold. This architecture consists of convolutional encoding and decoding units that are the same as those used in the U-Net model. However, the RCLs (and RCLs with residual units) are used instead of regular forward convolutional layers in both the encoding and decoding units. The residual unit with RCLs helps to develop a more efficient deeper model. Second, the efficient feature accumulation method is included in the RCL units of both of the presently-disclosed models. The effectiveness of feature accumulation from one part of the network to the other is shown in the CNN-based segmentation approach for medical imaging. In this model, the element-wise feature summation is performed outside of the U-Net model. The U-Net model only shows the benefit during the training process in the form of better convergence. However, the presently-disclosed models show benefits for both training and testing phases due to the feature accumulation inside the model. The feature accumulation with respect to different time steps ensures better and stronger feature representation. Thus, it helps extract very low-level features which are essential for segmentation tasks for different modalities of medical imaging (such as blood vessel segmentation). Third, the cropping and copying unit can be removed from the basic U-Net model and use only concatenation operations. Therefore, with all the above-mentioned changes, the presently-disclosed models are much better compared to equivalent SegNet, U-Net and ResU-Net models, which ensures better performance with the same or fewer number of network parameters.

[00254] There are several advantages of using the presently-disclosed architectures when compared to U-Net. The first is the efficiency in terms of the number of network parameters. The presently-disclosed RU-Net and R2U-Net architectures are designed to have the same number of network parameters when compared to U-Net and ResU-Net, and the RU- Net and R2U-Net models show better performance on segmentation tasks. The recurrent and residual operations do not increase the number of network parameters. However, they do have a significant impact on training and testing performance which is shown through empirical evaluation with a set of experiments. This approach is also generalizable, as it can easily be applied to deep learning models based on SegNet, 3D-UNet, and V-Net with improved performance for segmentation tasks.

[00255] FIGS. 49A-49C are diagrams showing example images from training datasets. The image in FIG. 49A was taken from the DRIVE dataset. The image in FIG. 49B was taken from the STARE dataset. The image from FIG. 49C was taken from the CHASE-DB1 dataset. FIG. 49Ashows the original images. FIG. 49Bshows the fields of view (FOV). FIG. 49C shows the target outputs.

[00256] Experiments were conducted using several different models including SegNet, U-Net, ResU-Net, RU-Net, and R2U-Net. These models are evaluated with different numbers of convolutional layers in the convolutional blocks, and the numbers of layers are determined with respect to time step t. Dataset details and network architectures along with the corresponding numbers of feature maps in different convolutional blocks are shown in Table 16 for Retina Blood Vessel Segmentation (RBVS), Skin Lesion Segmentation (SLS), and Lung Segmentation (LS).

Table 16: Dataset details

[00257] From the table, it can be seen in rows 2 and 4 that the numbers of feature maps in the convolutional blocks remain the same. However, as a convolutional layer is added in the convolutional block when t = 3, the number of network parameters increases. Feature fusion is performed with an element-wise addition operation in different residual, recurrent, and recurrent residual units. In the encoding unit of the network, each convolutional block consists of two or three RCLs, where 3x3 convolutional kernels are applied, proceeded by ReLU activation layers, followed by a batch normalization layer. For down-sampling, a 2x2 max- pooling layer followed by a 1 xl convolutional layer is used between the convolutional blocks. In the decoding unit, each block consists of a convolutional transpose layer followed by two convolutional layers and a concatenation layer. The concatenation operations are performed between the features in the encoding and decoding units in the network. The features are then mapped to a single output feature map where l x l convolutional kernels are used with a sigmoid activation function. Finally, the segmentation region is generated with threshold (7) which is empirically set to 0.5 in the experiment.

[00258] The architecture shown in the fourth row is used for retina blood vessel segmentation on the DRIVE dataset, as well as skin cancer segmentation. Also, the SegNet model was implemented with a similar architecture and a similar number of feature maps for impartial comparison in the cases of skin cancer lesions and lung segmentation. The architecture can be written as

1 ^32(3)^64(3)^ 128(3)^256(3)^512(3)^256(3)^ 128(3)^64(3)^32(3)^ 1 in the SegNet model for skin cancer lesion segmentation, where each convolutional block contains three convolutional layers and a batch normalization layer which requires a total of 14.94M network parameters. For lung segmentation, the architecture cab be written as 1^32(3)^64(3)^128(3)^256(3)^128(3)^64(3)^32(3)^1 for the SegNet model (three convolutional layers and a batch normalization layer are used in each block) which requires a total of 1.7M network parameters.

[00259] Experimental Setup and Results

[00260] To demonstrate the performance of the RU-Net and R2U-Net models, the models were tested on three different medical imaging datasets. These include blood vessel segmentations from retina images (DRIVE, STARE, and CHASE DBl shown in FIGS. 49 A- 49C), skin cancer lesion segmentations, and lung segmentations from 2D images. For this implementation, the Keras and TensorFlow frameworks are used on a single GPU machine with 56G of RAM and an NIVIDIA GEFORCE GTX-980 Ti with 6GB of memory.

[00261] Experimentation occurred on three different popular datasets for retina blood vessel segmentation including DRIVE, STARE, and CHASE DBl. The DRIVE dataset consists of 40 color retina images in total, of which 20 samples are used for training and the remaining 20 samples are used for testing. The size of each original image is 565x584 pixels. To develop a square dataset, the images were cropped to only contain the data from columns 9 through 574, which then makes each image 565x565 pixels. In this implementation, 190,000 randomly selected patches from 20 of the images in the DRIVE dataset were considered, where 171,000 patches were used for training, and the remaining 19,000 patches were used for validation.

[00262] FIGS. 50A and 50B are diagrams showing example patches and corresponding outputs, respectively. The size of each patch is 48x48 for all three datasets shown in FIGS. 50A-50B. The second dataset, STARE, contains 20 color images, and each image has a size of 700x605 pixels. Due to the small number of samples in the STARE dataset, two approaches are often applied for training and testing when using this dataset. First, training is sometimes performed with randomly selected samples from all 20 images.

[00263] Another approach is the“leave-one-ouf’ method, where in each trial one image is selected for testing, and training is conducted on the remaining 19 samples. Therefore, there is no overlap between training and testing samples. In this implementation, the“leave-one-ouf’ approach can be used for the STARE dataset. The CHASE DBl dataset contains 28 color retina images, and the size of each image is 999x960 pixels. The images in this dataset were collected from both left and right eyes of 14 school children. The dataset is divided into two sets where samples are selected randomly. A 20-sample set is used for training and the remaining 8 samples are used for testing. [00264] As the dimensionality of the input data in the STARE and CHASE DBl datasets larger than that of the DRIVE dataset, 250,000 patches in total from 20 images for both STARE and CHASE DBl datasets were considered. In this case, 225,000 patches are used for training and the remaining 25,000 patches are used for validation. Since the binary FOV (which is shown in FIG. 49B) is not available for the STARE and CHASE DBl datasets, FOV masks were generated using a technique similar to a published technique. One advantage of the patch-based approach is that the patches give the network access to local information about the pixels, which has an impact on overall prediction. Furthermore, it ensures that the classes of the input data are balanced. The input patches are randomly sampled over an entire image, which also includes the outside region of the FOV.

[00265] The Skin Cancer Segmentation dataset is taken from the Kaggle competition on skin lesion segmentation that occurred in 2016. This dataset contains 900 images along with associated ground truth samples for training. Another set of 379 images is provided for testing. The original size of each sample was 700x900, which was rescaled to 128x128 for this implementation. The training samples include the original images, as well as corresponding target binary images containing cancer or non-cancer lesions. The target pixels are set to a value of either 255 or 0, denoting pixels inside or outside the target lesion respectively.

[00266] Lung Segmentation

[00267] The Lung Nodule Analysis (LUNA)-l 6 competition at the Kaggle Data Science

Bowl in 2017 was held to find lung lesions in 2D and 3D CT images. This dataset consists of 267 2D samples in total, each containing a sample photograph, and label image displaying correct lung segmentation [52] For this study, 80% of the images were used for training, and the remaining 20% were used for testing. The original image size was 512x512, however, the images were resized to 256x256 pixels in this implementation.

[00268] Evaluation Metrics

[00269] For quantitative analysis of the experimental results, several performance metrics were considered, including accuracy (AC), sensitivity (SE), specificity (SP), Fl -score, Dice coefficient (DC), and Jaccard Index (JA). To do this, the variables True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) can be used. The overall accuracy AC is calculated using Equation 15, and sensitivity SE and specificity SP are calculated using Equations 16 and 17, respectively.

TP+TN

AC = (15)

TP+TN+FP+FN TP

SE = (16)

TP+FN

TN

SP = TN+FP (17)

[00270] Furthermore, DC and JA are calculated using the following Equations 18 and 19, respectively.

2. TP

DC = (18)

2.TP+FN+FP

TP JA = (19)

TP+FN+FP

[00271] In addition, an experiment was conducted to determine the Dice Index (DI) loss function, and the Jaccard similarity score (JS) is represented using Equations 20 and 21. Here GT refers to the ground truth and SR refers the segmentation result.

\GTnSR\

JS(GT, SR ) = = (21)

\GTUSR\

[00272] The Fl -Score is calculated according to the following equation:

„ precision xrecall

Fl— score = 2 x - (22)

preciswn+recall where, the precision and recall are expressed as:

7

precision = (23)

TP+FP

TP

recall (24)

TP+FN

[00273] The area under the curve (AUC) and the receiver operating characteristics (ROC) curve are common evaluation measures for medical image segmentation tasks. In this experiment, both analytical methods were utilized to evaluate the performance of the presently- disclosed approaches, and the results were compared to existing state-of-the-art techniques. FIGS. 51A and 51B are graphs showing examples of values for training and validation accuracy of the presently-disclosed RU-Net and R2U-Net models compared to the ResU-Net and U-Net models for 150 epochs.

[00274] Experimental results - Retina Blood Vessel Segmentation [00275] Due to the data scarcity of retina blood vessel segmentation datasets, the patch- based approach is used during training and testing phases. A random initialization method, a Stochastic Gradient Descent (SGD) optimization approach with categorical cross entropy loss, a batch size 32, and 150 epochs can be used in this implementation.

[00276] Results on DRIVE dataset

[00277] In FIGS. 51A-51B, the training and validation accuracy values that are shown are when using the DRIVE dataset. The presently-disclosed R2U-Net and RU-Net models provide better performance during both the training and validation phase when compared to the U-Net and ResU-Net models. Quantitative results are achieved with the four different models using the DRIVE dataset, and the results are shown in Table 17. The overall accuracy and AUC are considered when comparing the performance of the presently-disclosed methods in most cases. The results achieved with the presently-disclosed models with 0.841M network parameters (Table 16,“RBVS+LS” row) are higher than those obtained when using the state- of-the-art approaches in most cases. However, to compare against the most recently disclosed method, a deeper R2U-Net is evaluated with 13.34M network parameters (Table 16, “SLS+RBVS” row) and that shows the highest accuracy (0.9613) and a better AUC of 0.979. Most importantly, it can be observed that the presently-disclosed RU-Net and R2U-Net models provide better performance in terms of accuracy and AUC compared to the U-Net and RU-Net models. The precise segmentation results achieved with the presently-disclosed R2U-Net model are shown in FIG. 52A.

[00278] STARE dataset

[00279] The quantitative results when using the STARE dataset, along with a comparison to existing methods, are shown in Table 17. In 2016, a cross-modality learning approach can be used that has reported an accuracy of approximately 0.9628 for the STARE dataset, which was previously the highest recorded result. Recently in 2018, a method was used with a Weighted Symmetry Filter (WSF) and showed an accuracy of 0.9570. The“leave-one- ouf’ method can be used, and the average results of five different trials can be reported. An accuracy of 0.9712 was achieved with the R2U-Net model for the STARE dataset, which is 0.0084 and 0.0142 better than the results obtained when using publicly available methods. In addition, the RU-Net and R2U-Net models outperformed the U-Net and ResU-Net models in this experiment. The qualitative results of R2U-Net when using the STARE dataset are shown in FIG. 52B. [00280] FIGS. 52A-52C are diagrams showing examples of experimental outputs for three different datasets for retina blood vessel segmentation using R2UNet. FIG. 52A shows input images in gray scale. FIG. 52B shows the ground truth. FIG. 52C shows the experimental outputs. The images correspond to the DRIVE, STARE, and CHASE DBl datasets, respectively.

[00281] Table 17 shows experimental results of presently-disclosed approaches for retina blood vessel segmentation and comparison against other traditional and deep learning- based approaches.

[00282] CHASE DBl dataset

[00283] For quantitative analysis, the results are given in Table 17. From Table 17, it can be seen that the RU-Net and R2U-Net models provide beter performance against the U- Net and ResU-Net models when applying the CHASE-DB1 dataset. In addition, the presently- disclosed methods are compared against the recently used approaches for blood vessel segmentation using the CHASE DBl dataset. In 2016, an approach was used that has cross- modality learning and achieved an accuracy 0.9581. However, an accuracy of approximately 0.9634 was achieved with the R2U-Net model, which is about a 0.0053 improvement compared to other approaches. The precise segmentation results with the presently-disclosed R2U-Net model on the CHASE DBl dataset are shown in FIG. 52C.

[00284] FIG. 53 is a diagram showing examples of AUC values for retina blood vessel segmentation for the best performance achieved with R2U-Net on three different datasets. The ROC for the highest AUCs for the R2U-Net (with 1.07M network parameters) model on each of the three-retina blood vessel segmentation datasets is shown in FIG. 53.

[00285] Skin Cancer Lesion Segmentation.

[00286] In this implementation, this dataset is preprocessed with mean subtraction, and was normalized according to the standard deviation. The ADAM optimization technique with a learning rate of 2x l0 ⁴ and binary cross entropy loss was used. In addition, MSE was calculated during the training and validation phase. In this case, 10% of the samples are used for validation during training with a batch size of 32 and 150 epochs. The training accuracy of the presently-disclosed R2U-Net and RU-Net models was compared with that of the ResU-Net and U-Net models for an end-to-end image based segmentation approach. The training and the validation accuracy for all four models are shown in FIGS. 54A-54B. [00287] FIGS. 54A and 54B are diagrams showing example values for training and validation accuracy, respectively, of R2U-Net, RU-Net, ResU-Net, and U-Net for skin lesion segmentation. In both cases, the presently-disclosed RU-Net and R2U-Net models show better performance when compared with the equivalent U-Net and ResU-Net models. This clearly demonstrates the robustness of the learning phase of the presently-disclosed models for end- to-end image-based segmentation tasks. The quantitative results of this experiment were compared against existing methods as shown in Table 18. Table 18 shows experimental results of the presently-disclosed approaches for skin cancer lesion segmentation and comparison against other traditional and deep learning-based approaches.

[00288] The presently-disclosed RU-Net and R2U-Net models were evaluated with respect to the time step t = 2 in the RCL unit. The time step value I = 2 means that the RCL unit consists of one forward convolution followed by two RCLs. The presently-disclosed approaches were compared against recently published results using performance metrics including sensitivity, specificity, accuracy, AUC, and DC. The presently-disclosed R2U-Net model provides a testing accuracy 0.9472 with a higher AUC, which is 0.9430. Furthermore, the JA and DC are calculated for all models, and the R2U-Net model provides 0.9278 for JA, and 0.9627 for the DC for skin lesion segmentation. Although the results are in the third position in terms of accuracy compared to ISIC-2016 (highest) and FCRN-50 (second highest), the presently-disclosed R2U-Net models show better performance in terms of the DC and JA. These results were achieved with a R2U-Net model with 34 layers that contains approximately 13.34M network parameters. The architecture details are shown in Table 16. However, the accuracy of the presently-disclosed RU-Net and R2U-Net models is still higher when compared to the FCRN-38 networks. In addition, the work presented was evaluated with the VGG-16 and Incpetion-V3 models for skin cancer lesion segmentation. These models contain approximately 138M and 23M network parameters respectively. Furthermore, the RU-Net and R2U-Net models show higher accuracy and AUC compared to the VGG-16 and GoolgeNet models. In most cases, the RU-Net and R2U-Net models show better performance against equivalent SegNet, U-Net, and ResU-Net models for skin lesion segmentation. Some qualitative outputs of the SegNet, U-Net, and R2U-Net models for skin cancer lesion segmentation are shown for visual comparison in FIG. 55.

[00289] FIG. 55 is a diagram illustrating a qualitative assessment of the presently- disclosed R2U-Net for the skin cancer segmentation task. The first column shows the input sample, the second column shows ground truth, the third column shows the outputs from the SegNet model, the fourth column shows the outputs from the U-Net model, and the fifth column shows the results of the presently-disclosed R2U-Net model. In most cases, the target lesions are segmented accurately with a similar shape in ground truth. However, if one closely observes the outputs in the first, second, and fourth rows of images in FIG. 55, it can be clearly distinguished that the presently-disclosed R2U-Net model provides a very similar output shape to the ground truth when compared to the outputs of the SegNet and U-Net models. If one observes the third row of images in FIG. 55, it can be clearly seen that the input image contains three lesions. One is a target lesion, and the other brighter lesions are not targets. The R2U-Net model segments the desired part of the image more accurately when compared to the SegNet and U-Net models. Finally, the fifth row clearly demonstrates that the R2U-Net model provides a very similar shape to the ground truth, which is a much better representation than those obtained from the SegNet and U-Net models. Thus, it can be stated that the R2U-Net model is more capable and robust for skin cancer lesion segmentation. [00290] Lung Segmentation

[00291] Lung segmentation is very important for analyzing lung related diseases, and it can be applied to lung cancer segmentation and lung pattern classification for identifying other problems. In this experiment, the ADAM optimizer is used with a learning rate of 2 10 ⁴. The DI loss function according to Equation 20 was used. In this case, 10% of the samples were used for validation, with a batch size of 16 for 150 epochs.

[00292] Table 19 shows the summary of how well the presently-disclosed models performed against the equivalent SegNet, U-Net, and ResU-Net models. Specifically, Table 19 shows experimental results of the presently-disclosed RU-Net and R2U-Net approaches for lung segmentation and comparison against the SegNet, U-Net, ResU-Net models for / = 2 and t = 3.

Table 19: Experimental results of the presently-disclosed RU-Net and R2U-Net approaches

[00293] In terms of accuracy, the presently-disclosed R2U-Net model showed 0.26 and 0.55 percent better testing accuracy compared to the equivalent SegNet and U-Net models respectively. In addition, the R2U-Net model provided 0.18 percent better accuracy against the ResU-Net model with the same number of network parameters.

[00294] Qualitative results of the outputs of the SegNet, U-Net, and R2U-Net models are shown in FIG. 56. FIG. 56 is a diagram showing experimental results for lung segmentation. The first column shows the inputs, the second column shows the ground truth, the third column shows the outputs of SegNet, the fourth column for the outputs of U-Net, and fifth column for the outputs of R2U-Net. It can be visualized that the R2U-Net shows better segmentation results with internal details that are very similar to those displayed in the ground data. If one observes the input, ground truth, and output of the different approaches in the first and second rows, the outputs of the presently-disclosed approaches show better segmentation with more accurate internal details. In the third row, the R2U-Net model clearly defines the inside hole in the left lung, whereas the SegNet and U-Net models do not capture this detail. The last row of images in FIG. 56 shows that the SegNet and U-Net models provide outputs that incorrectly capture parts of the image that are outside of the lesion. On contrary, the R2U- Net model provides a much more accurate segmentation result.

[00295] Many models struggle to define the class boundary properly during segmentation tasks. The outputs in FIG. 56 are provided as heat maps which show the sharpness of the segmentation borders. These outputs show that ground truth tends to have a sharper boundary when compared to the model outputs. FIG. 57 is a graph showing example values on an ROC curve for lung segmentation for four different models where I = 3. The highest AUC is achieved is that of the presently-disclosed R2U-Net model.

[00296] In this implementation, evaluation occurred for both of the presently-disclosed models for patch-based modeling of retina blood vessel segmentation, as well as end-to-end image-based methods for skin and lung lesion segmentation. In both cases, the presently- disclosed models outperform existing state-of-the-art methods including SegNet, U-Net, ResU- Net and FCRN-38 in terms of AUC and accuracy on all three datasets. Thus, the quantitative and qualitative results clearly demonstrate the effectiveness of the presently-disclosed approach for segmentation tasks.

[00297] Analysis of T rade-off Between Number of T raining Samples and Accuracy

[00298] To further investigate the performance of the presently-disclosed R2U-Net model, the trade-off between the number of training samples versus performance is investigated for the lung segmentation dataset. The U-Net and R2U-Net models with ! = 3 were considered, and these models contained 1.07M network parameters. In the case of SegNet, a similar architecture with 1.7M network parameters was considered. At the beginning of the experiment, the entire dataset was divided into two sets where 80% of the samples were used for training and validation, and the remaining 20% of the samples were used for testing during each trail. During this experiment, different split ratios of [0.9, 0.7, 0.5, 0.3, and 0.1] were used, where the number of training samples was increased, and the number of validation samples was decreased for each successive trail. For example, a split ratio 0.9 means that only 10 percent of the samples are used for training and remaining 90% of the samples are used for validation. Likewise, a split ratio of 0.7 means that only 30% of the samples are used for training and the remaining 70% of the samples are used for validation.

[00299] FIGS. 58A-58B are graphs showing examples of values for the performance of three different models (SegNet, U-Net and R2U-Net) for different numbers of training and validation samples. FIG. 58A shows the training DI coefficient errors (l-DI). FIG. 58B shows validation DI coefficient errors for five different trials. FIGS. 58A-58B show the training and validation DI coefficient errors (l-DI) with respect to the number of training and validation samples. In each trial, 150 epochs were considered, and the errors that presented are the average training and validation errors of the last twenty epochs.

[00300] FIGS. 58A-58B show that the presently-disclosed R2U-Net model shows the lowest training and validation error for all of the tested split ratios, except for result where the split ratio is equal to 0.5 for the validation case. In this case, the error for the R2U-Net model is only slightly greater than that of the U-Net model. These results clearly demonstrate that the R2U-Net model is a more capable tool when used to extract, represent, and leam features during the training phase, which ultimately helps to ensure better performance. FIG. 59 is a diagram showing examples of testing errors of the R2U-Net, SegNet, and U-Net models for different split rations for the lung segmentation application. In each trial, the models were tested with the remaining 20% of the samples. The R2U-Net model shows the lowest error for almost all trials relative to the error obtained from the SegNet and U-Net models.

[00301] Network Parameters Versus Accuracy

[00302] In experiments, the U-Net, ResU-Net, RU-Net, and R2U-Net models were utilized with following architecture: 1- 16- 32- 64- 128- 64- 32- 16- 1 for retina blood vessel segmentation and lung segmentation. In the case of the retina blood vessel segmentation, a time step of t = 2 was used. This same architecture was tested for lung lesion segmentation for both ! = 2 and I = 3. Even though the number of network parameters slightly increased with respect to time step in the recurrent convolution layer, improved performance was still observed, as seen in the last rows of Table 19. Furthermore, an equivalent SegNet model was implemented which required 1.73M and 14.94M network parameters respectively. For skin cancer lesion and lung segmentation, the presently-disclosed models show better performance against SegNet when using both 1.07M and 13.34M network parameters, which is around 0.7M and 2.66M fewer when compared to SegNet. Thus, it can be stated that the model provides better performance with same or fewer number of network parameters compared to the SegNet, U-Net and ResU-Net model. Thus, the model possesses significant advantages in terms of memory and processing time.

[00303] Computational time

[00304] The computational time to segment per sample in testing phase is shown in Table 20 for all three datasets. The processing times during the testing phase for the STARE, CHASE DB, and DRIVE datasets were 6.42, 8.66, and 2.84 seconds per sample respectively. It can take around 90 seconds on average to segment an entire image (which is equivalent to a few thousand image patches). Alternatively, the presently-disclosed R2U-Net approach takes around 6 seconds per sample, which is an acceptable rate in a clinical use scenario. In addition, when executing skin cancer segmentation and lung segmentation entire images could be segments in 0.32 and 1.145 seconds respectively.

Table 20: Computational time for processing an entire image during the testing phase

[00305] In summary, the present disclosure includes an extension of the U-Net architecture using Recurrent Convolutional Neural Networks and Recurrent Residual Convolutional Neural Networks. The presently-disclosed models are called“RU-Net” and “R2U-Net” respectively. These models were evaluated using three different applications in the field of medical imaging including retina blood vessel segmentation, skin cancer lesion segmentation, and lung segmentation. The experimental results demonstrate that the presently- disclosed RU-Net and R2U-Net models show better performance in most of the cases for segmentation tasks with the same number of network parameters when compared to existing methods including the SegNet, U-Net, and residual U-Net (or ResU-Net) models on all three datasets. The quantitative and qualitative results, as well as trade-off between the number of training samples versus performance, show that the presently-disclosed RU-Net and R2U-Net models are more capable of learning during training, which ultimately shows better testing performance. In some implementations, the same architecture can be used with a different feature fusion strategy in the encoding and decoding units.

IV - Computer Vision and Image Understanding

[00306] FIG 60 is a diagram showing examples of Recurrent Multilayer Perceptron (RMLP), Convolutional Neural Network (CNN), and Recurrent Convolutional Neural Network (RCNN) models. In FIG. 60, the RMLP is on the left side, the CNN model is in the middle, and the RCNN is on the right side.

[00307] The present disclosure provides a deep learning architecture which combines two most recently used models: a revised version of Inception network and the RCNN. In the presently-disclosed IRCNN model, the recurrent convolutional layers are incorporated within inception block, and the convolution operations are performed considering different time steps.

[00308] FIG. 61 is a diagram showing an overall operational flow diagram of the presently-disclosed Inception Recurrent Convolutional Neural Network (IRCNN). The IRCNN includes an IRCNN block, a transition block, and a softmax layer.

[00309] FIG. 62 is a diagram showing Inception-Recurrent Convolutional Neural Network (IRCNN) block with different convolutional layers with respect to different size of kernels. The upper part of FIG. 62 represents the Inception layers for different kernels and pooling operation with recurrent layers and lower part shows the internal details of recurrent convolutional operation respect to time (t = 3). Finally the outputs are concatenated for the following layer.

[00310] The presently-disclosed inception block with recurrent convolution layers is shown in FIG. 62. A goal of the DCNN architecture of the Inception and Residual networks is to implement large scale deep networks. As the model becomes larger and deeper, the computational parameters of the architecture are increased dramatically. Thus, the model becomes more complex to train and computationally expensive. In this scenario, the recurrent property of the present disclosure ensures better training and testing accuracy with less or equal computational parameters. [00311] According to the present disclosure, a deep learning model called IRCNN is used with combination of the recently developed Inception-v4 and RCNN. Experimental analysis of the presently-disclosed learning models performance against different DCNN architectures on different benchmark datasets such as MNIST, CIFAR-10, CIFAR-100, SVHN, and TinyImageNet-200. Empirical evaluation of the impact of the recurrent convolutional layer (RCL) in the Inception Network. Empirical investigation of the impact of RCLs on Densely connected neural networks namely DenseNet.

[00312] Inception-Recurrent Convolutional Neural Networks

[00313] The presently-disclosed architecture (IRCNN) is based on several recently developed deep learning architectures, including Inception Nets and RCNNs. It tries to reduce the number of computational parameters, while providing better recognition accuracy. As shown in FIG. 61, the IRCNN architecture consists of general convolution layers, IRCNN blocks, transition blocks, and a softmax layer at the end. The presently disclosure architecture provides recurrence in the Inception module, as shown in the IRCNN block in FIG. 62. A feature of Inceptionv4 is that it concatenates the outputs of multiple differently sized convolutional kernels in the inception block. Inception-v4 is a simplified version of Inception- v3, using lower rank filters and pooling layers. Inception-v4 however combines Residual concepts with Inception networks to improve the overall accuracy over Inception-v3. The outputs of inception layers are added with the inputs to the Inception Residual module. The present disclosure utilizes the inception concepts from Inception-v4.

[00314] IRCNN block

[00315] The IRCNN block, performs recurrent convolution operations with different sized kernels (see FIG. 61). In the recurrent structure, the inputs to the next time step are the sum of the convolutional outputs of the present time step and previous time steps. The same operations are repeated based on the number of time steps that are considered. As the input and output dimensions do not change, this is an accumulation of feature maps with respect to the time step are considered. This helps to strengthen the extraction of the target features. As shown in FIG. 62, one of the path of inception block contains an average pooling operation is applied before the recurrent convolution layer. In this particular pooling layer, a 3 ^c 3 average pooling with stride 1 ^c 1 is applied by keeping the border size same, resulting in output samples with the same dimensions as the inputs. The overlapping average pooling technique helps in the regularization of the network. The operations of each Recurrent Convolution Layer (RCL) in the IRCNN block are similar to operations by others in the field. To describe these operations, consider a vectorized patch centered at (i,j) of an input sample on the k^th feature map in the RCL unit. The 0[ _/fe (t) refers the output of I^th layer at time step t. The output can be expressed as:

(25)

[00316] Here

(t— 1) are the inputs for a standard convolutional layer and an RCL respectively. w[ and w_k the weights for the standard convolutional layer and the RCL respectively, and b_k is the bias. Then the output of RCL passes through an activation function. Therefore, the equation for the l^lh layer at time step t is:

where / is the standard Rectified Linear Unit (ReLU) activation function. The Local Response Normalization (LRN) function is applied to the outputs of the IRCNN -block.

z = norm ( y_i ^l _jk(_.t ))

(27)

[00317] The outputs of the IRCNN block with respect to the different kernel sizes l x l and 3 x 3, and average pooling operations followed by 1 ^c 1 are defined as ^zi_xi(x), ^z _3x3(x), and z^_xl(x) respectively. The final output z_out of the IRCNN-block can be expressed as:

[00318] Here Q represents the concatenation operation with respect to the channel-axis on the output samples of inception layers. In this implementation, t = 3 is used, which indicates that one forward and three recurrent convolutional layers have been used in each IRCNN-block (individual path) which is clearly demonstrated in FIG. 62. The outputs of the IRCNN-block becomes the inputs that are fed into the transition block.

[00319] Transition block

[00320] In the transition block, three operations (convolution, pooling, and Dropout) are performed depending upon the placement of the block in the network. According to FIG. 61, all of the operations in the very first transition block have been applied. In the second transition block, only convolution with dropout operations has been used. The third transition block consists of convolution, global-average pooling, and drop-out layers. The global-average pooling layer is used as an alternative to a fully connected layer. There are several advantages of a global-average pooling layer. Firstly, it is very close in operation to convolution, hence enforcing correspondence between feature maps and categories. The feature maps can be easily interpreted as class confidence. Secondly, it does not need computational parameters, thus helping to avoid over-fitting of the network. Late use of the pooling layer is advantageous because it increases the number of non-linear hidden layers in the network. Therefore, only two special pooling layers have been applied in first and third transition block of this architecture. Special pooling is carried out with the max-pooling layer in this network (not all transition blocks have pooling layer). The max-pooling layers perform operations with a 3x3 patch and a 2 x 2 stride over the input samples. Since the non-overlapping max-pooling operation has a negative impact on model regularization, overlapped max-pooling for regularizing the network are used. This can facilitate training a deep network architecture. Eventually, a global-average pooling layer is used as an alternative of fully connected layers. Finally, a softmax logistic regression layer is used at the end of the IRCNN architecture.

[00321] Optimization of network parameters

[00322] To keep the number of computational parameters low compared to other traditional DCNN approaches 1, only l x l and 3x3 convolutional filters have been used in this implementation. There are benefits to using smaller sized kernels, such as the ability to incorporate more non-linearity in the network. For example, a stack of two 3 x 3 respective fields (without placing any pooling layer in between) can be used as a replacement for one 5 ^c 5, and a stack of three 3x3 respective fields can be used instead of a 7x7 kernel size. The benefit of adding a 1 x 1 filter is that it helps to increase the non-linearity of the decision function without having any impact on the convolution layer. Since the size of the input and output features do not change in the IRCNN blocks, it is just a linear projection on the same dimension with non-linearity added using a ReLU. A dropout of 0.5 has been used after each convolutional layer in the IRCNN-block.

[00323] Finally, a soft-max or a normalized exponential function layer has been used at the end of the architecture. For an input sample x and a weight vector W. and K distinct linear functions the softmax operation can be defined for i^th class as follows:

[00324] Network Architectures

[00325] Experiments were conducted with different models including, RCNN, EIN which is same as Inception-v3 network, EIRN which is a small scale implementation of Inception-v4 model, and the presently-disclosed IRCNN. These models are evaluated with different number of convolutional layers in the convolutional blocks while the number of layers are determined with respect to time-step t. In these implementations, t = 2 has been used, indicating that the RCL block contains one forward convolution followed by two RCLs, and t = 3 is for one forward convolution with three RCLs. To experiment with MNIST dataset, a model consists of two forward convolutional layers has been used at the beginning, two IRCNN blocks followed by transition blocks, and softmax layer at the end. In the case of CIFAR-10, CIFAR-100, and SVHN datasets, an architecture with two convolutional layers has been used at the beginning, three IRCNN blocks followed by transition blocks, a dense layer, and a softmax layer at the end of the model. For this model, 16 and 32 feature maps have been considered for the first three convolutional layers, and 64, 128, 256 feature maps are used in first, second, and third IRCNN blocks respectively. The overall model diagram is shown in FIG. 61. This model contains around 3.2 million(M) network parameters.

[00326] A model was used with four IRCNN blocks followed by transition layers, a fully connected layer, and softmax layer for the experiment on TinyImageNet-200 dataset. In addition, the number of feature maps in each of the forward convolution layers and RCLs in IRCNN blocks have almost doubled compared to the model that is used for CIFAR-100, which significantly increases number of network parameters to approximately 9.3M. However, EIN, and EIRN models are implemented with the same structure of IRCNN model with inception and inception-residual modules respectively. For conducting experiment on TinylmageNet- 200 dataset, Batch Normalization (BN) is used instead of LRN in IRCNN, RCNN, EIN, and EIRN models. Further, Equation 27 has been skipped, and the concatenation operation is performed directly on the output of Equation 26. In this case, BN is applied at the end of IRCNN block on z_out. Furthermore, the impact of RCLs on the DenseNet model has been empirically investigated. In this implementation, the RCL layers (where t = 2 refers one forward convolution and two RCLs) have been incorporated as a replacement for forward convolutional layers within the dense block. An BN layer is used in the dense block with RCLs. Only 4 dense blocks have been used, with 4 layers in each block and a growth rate of 6. The experimental result shows significant improvement of training, validation, testing accuracies with DenseNet with RCLs against original DenseNet model.

[00327] Experiments

[00328] Experiments were used to evaluate the presently-disclosed IRCNN method with a set of experiments on different benchmark datasets: MNIST, Cifar-lO, Cifar-lOO, SVHN, and Tiny-ImageNet dataset and compared against different equivalent models. The entire experiment has been conducted on Linux environment with Keras and Theano in the Backend running on the single GPU machine with NIVIDIA GEFORCE GTX980 Ti.

[00329] Training Methodology

[00330] The first experiment trained the presently-disclosed IRCNN architecture using the stochastic gradient descent (SGD) optimization function with default initialization for deep networks found in Keras. The Nesterov momentum is set to 0.9 and decay to 9.99 ^c e ^~7. Second, an experiment was performed with the presently-disclosed approach using the Layer- sequential unit-variance (LSUV) technique, which is a simple method for the weights initialization in a deep neural network. An improved version of the optimization function based on Adam known as EVE was also used. The following parameters are used for the EVE optimization function: the value of the learning rate (T) is 1 x e— 4, decay (g) is 1 x e— 4, b₁ = 0.9, b₂ = 0.9, b₃ = 0.9, k = 0.1, K = 10, and e = 1 x e - 08. The (/?_!, /?₂) e [0,1) values are exponential decay rates for moment estimation in Adam. The b₃ º [0,1) is an exponential decay rate for computing relative changes. The k and K values are lower and upper thresholds for relative change, and e is a fuzzy factor. It should be noted that the h ^~ norm was used with a value of 0.002 for weight regularization on each convolutional layer in the IRCNN block.

[00331] In both experiments, ReLU activation functions were used. The network was generalized with a dropout (0.5). Only the horizontal flipping technique was applied when performing data augmentation. During the training of MNIST and SVHN, 200 epochs were used with a mini-batch size of 128. The models were trained for 350 epochs with a 128 batch size for CIFAR-10 and 100. For the impartial comparison, training and testing was performed against equivalent Inception networks and Inception residual networks (meaning the network contained the same number of layers and computational parameters). [00332] Results

[00333] MNIST

[00334] MNIST is one of the most popular datasets for handwritten digits from 0-9, the dataset contains 28x28 pixel grayscale images with 60,000 training examples and 10,000 testing examples. For this experiment, the presently-disclosed model was trained with two IRCNN blocks (IRCNN-block 1 and IRCNN-block 2) and the ReLU activation function was used. The model was trained with 60,000 samples and 10,000 samples were used for the validation set. Eventually the trained network was tested with 10,000 testing examples. A test error of 0.32% was attained with the IRCNN and the SGD, and about 0.29% error are achieved for the IRCNN when initializing with LSUV and the EVE optimization function. The summary of the classification accuracies is given in Table 21. In Table 21, the“+” notation indicates standard data augmentation using random horizontal flipping. IRCNN achieves lower testing errors in most of the cases indicated with bold. No data augmentation techniques have been applied in this experiment on MNIST. On the contrary, global contract normalization and ZCA whitening were applied in the experiments using most of the mentioned models in Table 21.

Table 21: Testing errors (%) on MNIST, CIFAR- 10(00), CIFAR- 100(000), and SVHN

[00335] CIFAR- 10

[00336] CIFAR-10 is an object classification benchmark consisting of 32 ^c 32 color images representing 10 classes. It is split into 50,000 samples for training and 10,000 samples for testing. The experiment was conducted with and without data augmentation. The entire experiment was conducted on models similar to the one shown in FIG. 61. Using the presently- disclosed approach, about 8.41% error was achieved without data augmentation, and 7.37% error was achieved with data augmentation using the SGD technique. These results are better than those of most of the recognized DCNN models stated in Table 21. Better performance is observed from the IRCNN with LSUV as the initialization approach and EVE as the optimization technique. The results show around 8.17% and 7.11% error without and with data augmentation respectively. When comparing these results to those of the different models in Table 21, it can be observed that the presently-disclosed approach provides better accuracy compared to various advanced and hybrid models. FIG. 63 is a graph showing example values for training and validation loss. For example, the training and validation loss apply to the experiment on CIFAR-l 0 of model. FIG. 64 is a graph showing examples of values for training and validation accuracy of IRCNN with SGD and LSUV+EVE.

[00337] CIFAR-100

[00338] This is another benchmark for object classification from the same group. The dataset contains 60,000 (50,000 for training and for 10,000 testing) color 32x32 images, and it has 100 classes. SGD and LSUV were used as the initialization approach with the EVE optimization technique in this experiment. The experimental results are shown in Table 21. In both cases, the presently-disclosed technique shows state-of-the-art accuracy compared with different DCNN models. IRCNN+SGD shows about 34.13% classification errors without data augmentation and 31.22% classification errors with data augmentation. In addition, this models achieved around 30.87% and only 28.24% errors with SGD and LSUV+EVE on augmented dataset. This is the highest accuracy achieved in any of the deep learning models summarized in Table 21. For augmented datasets, a 71.76 % recognition accuracy was achieved with LSUV+EVE, which is about a 3.51% improvement compared to RCNN.

[00339] FIG. 65 is a graph showing examples of values for the training and validation loss of the IRCNN for both experiments using the CIFAR-l 00 dataset and data augmentation (with and without initialization and optimization). In the first experiment, the IRCNN was used with a LSUV initialization approach and the EVE optimization function. The default initialization approach of Keras and the SGD optimization method are used in the second experiment. It is clearly shown that the presently-disclosed model has lower error rates in the both experiments, showing the effectiveness of the presently-disclosed IRCNN learning model. FIG. 66 is a graph showing example values for the training and testing accuracy of the IRCNN with LSUV and EVE.

[00340] Street View House Numbers (SVHN) [00341] SVHN is one of the most challenging datasets for street view house number recognition. This dataset contains color images representing house numbers from Google Street View. This experiment considered the second version, which consists with 32 ^c 32 color examples. There are 73,257 samples are in the training set and 26,032 samples in testing set. In addition, this dataset has 531,131 extra samples that are used for training purposes. As single input samples of this dataset contain multiple digits, the main goal is to classify the central digit. Due to the huge variation of color and brightness, this dataset is much for difficult to classify compared to the MNIST dataset. In this case, experimentation occurred with the same model as is used in CIFAR-10 and CIFAR-100. The same preprocessing steps applied in the experiments of RCNN were used. The experimental results show better recognition accuracy, as shown in Table 21. Around 1.89% testing errors with IRCNN+SGD and 1.74% errors with IRCNN+LSUV+EVE respectively were obtained. It is noted that Local Contrast Normalization (LCN) is applied during experiments of MaxOut, NiN, DSN, and Drop Connect. The reported results of CNN with drop connection are based on the average performance of five networks.

[00342] Impact of Recurrent Layers

[00343] The presently-disclosed architecture also performs well when compared to other recently proposed optimized architectures. An DCNN architecture is called FitNet4 has conducted experiment with LSUV initialization approach, and it only achieved 70.04% classification accuracy with data augmentation using mirroring and random shifts for CIFAR- 100. On the other hand, random horizontal flipping was only applied for data augmentation in this implementation and achieved about 1.72% better recognition accuracy against FitNet4. For an impartial comparison with the EIN and EIRN models (same architecture but smaller version of Inception-v3 and Inception- v4 models), the Inception network has have implemented with the same number of layers and parameters as in the transition and Inception-block. Instead of using recurrent connectivity in the convolutional layers, sequential convolutional layers were used for the same time-step with the same kernels. During the implementation of EIRN, residual connection was only added in the Inception-Residual block, where the inputs of the Inception-Residual block are added with the outputs of that particular block. The Batch Normalization (BN) is used in the IRCNN, EIN, and EIRN models. In this case, all of the experiment have been conducted on the augmented CIFAR-100 dataset in.

[00344] FIGS. 67 and 68 are graphs showing the model loss and accuracy for both training and validation phases, respectively. From both figures, it can be clearly observed that the presently-disclosed model shows lower loss and the highest recognition accuracy during validation phase compared with EIN and EIRN, proving the effectiveness of the presently- disclosed model. It also demonstrates the advantage of recurrent layers in Inception networks.

[00345] FIG. 69 is a graph showing example values for the testing accuracy of IRCNN, EIN, and EIRN on CIFAR-100 dataset. It can be summarized that the presently-disclosed model of IRCNN shows around 3.47% and 2.54% better testing accuracy compared to EIN and EIRN respectively.

[00346] Experiment on TinyImageNet-200 Dataset

[00347] In this experiment, the presently-disclosed technique was evaluated on the TinyImageNet-200 dataset. This dataset contains 100,000 samples for training, 10,000 samples for validation, and 10,000 samples for testing. These images are sourced from 200 different classes of objects. A difference between the main ImageNet dataset and TinylmageNet is that the images are down sampled from 224 ^c 224 to 64 x 64. The main impact of down sampling is a loss of detail. Therefore, down sampling the images might lead the ambiguity problem which may have an effect on overall model accuracy. The original ImageNet image size is 482 x 418 pixels, where the average object scale is 17%. The size of the images in this experiment is 64 x 64, which makes the TinylmageNet problem even harder. FIG. 70 is a diagram showing examples of images.

[00348] Experimentation with IRCNN, EIN, EIRN and RCNN models with almost same number of parameters shown in Table 22. The SGD with a starting learning rate of 0.001, a batch size 64, and a total number of 75 epochs were used. In this case, the transfer learning approach was used where weights have been stored after every 25 epochs, and then reused as initial weights for the next 25 epochs. The learning rate is decreased by the factor of 10 and weight decay is decreased with respect to the number of epochs of 25 for single time evaluation. FIG. 71 is a graph showing example values for validation accuracy of IRCNN, EIRN, EIN, and RCNN. The impact of transfer learning is observed based on FIG. 71. FIG. 72 is a graph showing example values for validation accuracy of DenseNet and DenseNet with a Recurrent Convolutional Layer (RCL).

[00349] In the testing phase, the presently-disclosed approaches for Top-l% and Top- 5% were evaluated for testing accuracy. Table 22 shows the testing accuracy for all the models including RCNN and DenseNet. According to Table 22, the IRCNN provides better performance compared to EIN, EIRN, and RCNN with almost same number of parameters for object recognition task on the TinyImageNet-200 dataset. Experiments were also conducted with DenseNet and DenseNet with RCL on the TinyImageNet-200 dataset. The experimental results show that DenseNet with RCLs provides about 0.38% improvement on Top-l% accuracy compared to DenseNet with only 1M net- work parameters. The experimental results show DenseNet with RCLs provides higher testing accuracy in both Top-l% and Top-5% compared against DenseNet model.

[00350] Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Software implementations of the described subject matter can be implemented as one or more computer programs, that is, one or more modules of computer program instructions encoded on a tangible, non-transitory, computer-readable computer-storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively, or additionally, the program instructions can be encoded in/on an artificially generated propagated signal, for example, a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer-storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of computer-storage mediums. Configuring one or more computers means that the one or more computers have installed hardware, firmware, or software (or combinations of hardware, firmware, and software) so that when the software is executed by the one or more computers, particular computing operations are performed.

[00351] The term“real-time,”“real time,”“realtime,”“real (fast) time (RET),”“near(ly) real-time (NRT),”“quasi real-time,” or similar terms (as understood by one of ordinary skill in the art), means that an action and a response are temporally proximate such that an individual perceives the action and the response occurring substantially simultaneously. For example, the time difference for a response to display (or for an initiation of a display) of data following the individual’s action to access the data can be less than 1 ms, less than 1 sec., or less than 5 secs. While the requested data need not be displayed (or initiated for display) instantaneously, it is displayed (or initiated for display) without any intentional delay, taking into account processing limitations of a described computing system and time required to, for example, gather, accurately measure, analyze, process, store, or transmit the data.

[00352] The terms“data processing apparatus,”“computer,” or“electronic computer device” (or equivalent as understood by one of ordinary skill in the art) refer to data processing hardware and encompass all kinds of apparatus, devices, and machines for processing data, including by way of example, a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include special purpose logic circuitry, for example, a central processing unit (CPU), an FPGA (field programmable gate array), or an ASIC (application-specific integrated circuit). In some implementations, the data processing apparatus or special purpose logic circuitry (or a combination of the data processing apparatus or special purpose logic circuitry) can be hardware- or software-based (or a combination of both hardware- and software-based). The apparatus can optionally include code that creates an execution environment for computer programs, for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of execution environments. The present disclosure contemplates the use of data processing apparatuses with or without conventional operating systems, for example LINUX, UNIX, WINDOWS, MAC OS, ANDROID, IOS, or any other suitable conventional operating system.

[00353] A computer program, which can also be referred to or described as a program, software, a software application, a module, a software module, a script, or code can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, for example, one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, for example, files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

[00354] While portions of the programs illustrated in the various figures are shown as individual modules that implement the various features and functionality through various objects, methods, or other processes, the programs can instead include a number of sub- modules, third-party services, components, libraries, and such, as appropriate. Conversely, the features and functionality of various components can be combined into single components, as appropriate. Thresholds used to make computational determinations can be statically, dynamically, or both statically and dynamically determined.

[00355] The methods, processes, or logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The methods, processes, or logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, for example, a CPU, an FPGA, or an ASIC.

[00356] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors, both, or any other kind of CPU. Generally, a CPU will receive instructions and data from and write to a memory. The essential elements of a computer are a CPU, for performing or executing instructions, and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, receive data from or transfer data to, or both, one or more mass storage devices for storing data, for example, magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, for example, a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device, for example, a universal serial bus (USB) flash drive, to name just a few.

[00357] Computer-readable media (transitory or non-transitory, as appropriate) suitable for storing computer program instructions and data includes all forms of permanent/non permanent or volatile/non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, for example, random access memory (RAM), read-only memory (ROM), phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic devices, for example, tape, cartridges, cassettes, intemal/removable disks; magneto-optical disks; and optical memory devices, for example, digital video disc (DVD), CD-ROM, DVD+/-R, DVD-RAM, DVD-ROM, HD-DVD, and BLURAY, and other optical memory technologies. The memory can store various objects or data, including caches, classes, frameworks, applications, modules, backup data, jobs, web pages, web page templates, data structures, database tables, repositories storing dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto. Additionally, the memory can include any other appropriate data, such as logs, policies, security or access data, reporting files, as well as others. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[00358] To provide for interaction with a user, implementations of the subject matter described in this specification can be implemented on a computer having a display device, for example, a CRT (cathode ray tube), LCD (liquid crystal display), LED (Light Emitting Diode), or plasma monitor, for displaying information to the user and a keyboard and a pointing device, for example, a mouse, trackball, or trackpad by which the user can provide input to the computer. Input can also be provided to the computer using a touchscreen, such as a tablet computer surface with pressure sensitivity, a multi-touch screen using capacitive or electric sensing, or other type of touchscreen. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, for example, visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s client device in response to requests received from the web browser.

[00359] The term“graphical user interface,” or“GUI,” can be used in the singular or the plural to describe one or more graphical user interfaces and each of the displays of a particular graphical user interface. Therefore, a GUI can represent any graphical user interface, including but not limited to, a web browser, a touch screen, or a command line interface (CLI) that processes information and efficiently presents the information results to the user. In general, a GUI can include a plurality of user interface (UI) elements, some or all associated with a web browser, such as interactive fields, pull-down lists, and butons. These and other UI elements can be related to or represent the functions of the web browser.

[00360] Implementations of the subject mater described in this specification can be implemented in a computing system that includes a back-end component, for example, as a data server, or that includes a middleware component, for example, an application server, or that includes a front-end component, for example, a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of wireline or wireless digital data communication (or a combination of data communication), for example, a communication network. Examples of communication networks include a local area network (LAN), a radio access network (RAN), a metropolitan area network (MAN), a wide area network (WAN), Worldwide Interoperability for Microwave Access (WIMAX), a wireless local area network (WLAN) using, for example, 802.11 a/b/g/n or 802.20 (or a combination of 802.1 lx and 802.20 or other protocols consistent with this disclosure), all or a portion of the Internet, or any other communication system or systems at one or more locations (or a combination of communication networks). The network can communicate with, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, or other suitable information (or a combination of communication types) between network addresses.

[00361] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[00362] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what can be claimed, but rather as descriptions of features that can be specific to particular implementations of particular inventions. Certain features that are described in this specification in the context of separate implementations can also be implemented, in combination, in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations, separately, or in any suitable sub-combination. Moreover, although previously described features can be described as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can, in some cases, be excised from the combination, and the claimed combination can be directed to a sub-combination or variation of a sub-combination.

[00363] Particular implementations of the subject maher have been described. Other implementations, alterations, and permutations of the described implementations are within the scope of the following claims as will be apparent to those skilled in the art. While operations are depicted in the drawings or claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed (some operations can be considered optional), to achieve desirable results. In certain circumstances, multitasking or parallel processing (or a combination of multitasking and parallel processing) can be advantageous and performed as deemed appropriate.

[00364] Moreover, the separation or integration of various system modules and components in the previously described implementations should not be understood as requiring such separation or integration in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[00365] Accordingly, the previously described example implementations do not define or constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure.

[00366] Furthermore, any claimed implementation is considered to be applicable to at least a computer-implemented method; a non-transitory, computer-readable medium storing computer-readable instructions to perform the computer-implemented method; and a computer system comprising a computer memory interoperably coupled with a hardware processor configured to perform the computer-implemented method or the instructions stored on the non- transitory, computer-readable medium.

Claims

CLAIMS What is claimed is:

1. A system, comprising:

one or more data processing apparatuses; and

an image segmentation neural network subsystem implemented on the one or more data processing apparatuses, the image segmentation neural network subsystem comprising:

a plurality of encoding units arranged in succession so that each encoding unit after a first encoding unit is configured to process an input set of feature maps from a preceding encoding unit to generate an output set of feature maps having a lower dimensionality than the input set of feature maps, wherein the first encoding unit is configured to process a neural network input representing a data map to generate a first output feature map, and each encoding unit comprises a recurrent convolutional block or a recurrent-residual convolutional unit; and

a plurality of decoding units arranged in succession so that each decoding unit after a first decoding unit is configured to process a first input set of feature maps from a preceding decoding unit and a second input set of feature maps from a corresponding encoding unit to generate an output set of feature maps having a higher dimensionality than the input set of feature maps, wherein the first decoding unit is configured to process as input the output set of feature maps from a last of the encoding units in the succession of encoding units, and a last of the decoding units in the succession of decoding units is configured to generate a final feature map for the data map.

2. The system of claim 1, wherein each encoding unit of the plurality of encoding units comprises a recurrent convolutional block.

3. The system of claim 2, wherein the recurrent convolutional block comprises a plurality of forward recurrent convolutional layers.

4. The system of any of claims 1-3, wherein each encoding unit of the plurality of encoding units comprises a recurrent-residual convolutional unit.

5. The system of claim 4, wherein the recurrent-residual convolutional unit comprises a plurality of recurrent convolution layers having residual connectivity.

6. The system of any of claims 1-5, wherein the data map is an input image.

7. The system of any of claims 1-6, wherein the final feature map is a segmentation map for the data map.

8. The system of claim 7, further comprising a segmentation engine on the one or more data processing apparatuses, the segmentation engine configured to segment the data map using the segmentation map.

9. The system of any of claims 1-8, wherein the final feature map is a density heap map for the data map.

10. The system of any of claims 1-9, wherein the data map is an input image that depicts a slide of cells, and the neural network subsystem is configured for use in a nuclei

segmentation task to identify nuclei in the slide of cells.

11. The system of any of claims 1-10, wherein the data map is an input image depicts a slide of cells, and the neural network subsystem is configured for use in an epithelium segmentation task to identify epithelium in the slide of cells.

12. The system of any of claims 1-11, wherein the data map is an input image that depicts a slide of cells, and the neural network subsystem is configured for use in a tubule segmentation task to identify tubules in the slide of cells.

13. A method for processing a data map with a neural network subsystem having a plurality of encoder units and a plurality of decoder units, each decoder unit corresponding to a different encoder unit, the method comprising:

processing successive representations of the data map with the plurality of encoder units to generate a set of feature maps for the data map, each feature map having a lower dimensionality than the data map, each encoder unit comprising a recurrent convolutional block or a recurrent-residual convolutional unit; and

upsampling the set of feature maps with the plurality of decoder units to generate a final feature map for the data map that has a higher dimensionality than feature maps in the set of feature maps.

14. The method of claim 13, wherein each encoding unit of the plurality of encoding units comprises a recurrent convolutional block.

15. The method of any of claims 13-14, wherein the recurrent convolutional block comprises a plurality of forward recurrent convolutional layers.

16. The method of any of claims 13-15, wherein each encoding unit of the plurality of encoding units comprises a recurrent-residual convolutional unit.

17. The method of claim 16, wherein the recurrent-residual convolutional unit comprises a plurality of recurrent convolution layers having residual connectivity.

18. The method of any of claims 13-17, wherein the data map is a medical image.

19. A system comprising:

one or more data processing apparatuses; and

one or more computer-readable media having instructions stored thereon that, when executed by the one or more data processing apparatuses, cause the data processing apparatuses to perform the methods of any of claims 13-18.

20. One or more computer-readable media having instructions stored thereon that, when executed by one or more data processing apparatuses, cause the data processing apparatuses to perform the methods of any of claims 13-18.

21. A system, comprising:

one or more data processing apparatuses; and an image segmentation neural network subsystem implemented on the one or more data processing apparatuses, the neural network subsystem comprising:

one or more first convolutional layers;

one or more inception recurrent residual convolutional neural network (IRRCNN) blocks; and

one or more transition blocks.

22. The system of claim 21, wherein each IRRCNN block includes an inception unit and a residual unit, the inception unit including recurrent convolutional layers that are merged by concatenation, the residual unit configured to sum input features to the IRRCNN block with an output of the inception unit.

23. The system of any of claims 21-22, wherein the neural network subsystem is configured to process a data map to perform a classification task based on the data map.

24. The system of any of claims 21-23, further comprising a softmax layer.

25. A method, comprising:

obtaining a neural network input, the neural network input representing a data map; processing the neural network input with a neural network system to generate a classification for one or more items shown in the data map, the neural network system including one or more first convolutional layers, one or more inception recurrent residual convolutional (IRRCNN) blocks, and one or more transition blocks; and

providing the classification for storage, processing, or presentation.

26. A system comprising:

one or more data processing apparatuses; and

one or more computer-readable media having instructions stored thereon that, when executed by the one or more data processing apparatuses, cause the data processing apparatuses to perform the method of claim 25.

27. One or more computer-readable media having instructions stored thereon that, when executed by one or more data processing apparatuses, cause the data processing apparatuses to perform the method of claim 25.