WO2020014477A1

WO2020014477A1 - Methods, systems, and computer readable media for image analysis with deep learning to predict breast cancer classes

Info

Publication number: WO2020014477A1
Application number: PCT/US2019/041395
Authority: WO
Inventors: Charles Maurice PEROU; Heather Dunlop COUTURE; Lindsay Almquist WILLIAMS; Sarah NYANTE; James Stephen Marron; Melissa TROESTER; Marc Niethammer; Joseph Geradts; Ebonee BUTLER
Original assignee: The University Of North Carolina At Chapel Hill; City Of Hope
Priority date: 2018-07-11
Filing date: 2019-07-11
Publication date: 2020-01-16

Abstract

A method for image analysis using deep learning to predict breast cancer classes, includes receiving a test image. The method further includes generating test image instances from the test image. The method further includes inputting the test image instances into a convolutional neural network (CNN), which extracts features from each of the test image instances. The method further includes applying an instance support vector machine (SVM) to the features to predict breast cancer classes for each of the instances. The method further includes aggregating outputs from the instance SVM to predict breast cancer classes for the test image.

Description

DESCRIPTION

METHODS, SYSTEMS, AND COMPUTER READABLE MEDIA FOR IMAGE ANALYSIS WITH DEEP LEARNING TO PREDICT BREAST CANCER

CLASSES

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims benefit of U.S. Provisional Patent Application Serial No. 62/696,464, filed July 11 , 2018 and U.S. Provisional Patent Application Serial No. 62/757,746, filed November 8, 2018; the disclosures of each which are incorporated herein by reference in their entireties.

STATEMENT OF FEDERAL SUPPORT

This invention was made with government support under Grant Numbers CA058223, CA148761 , and CA179715 awarded by the National Institutes of Health. The government has certain rights in the invention.

TECHNICAL FIELD

The subject matter described herein relates to image analysis with deep learning. More particularly, the subject matter described herein relates to methods, systems, and computer readable media for image analysis with deep learning to predict breast cancer classes, where the classes include tumor grade, estrogen receptor (ER) status, histologic subtype, risk of reocurrence, and intrinsic subtype. BACKGROUND

Predicting breast cancer grade, ER status, histologic subtype, risk of reoccurrence, and intrinsic subtype can involve costly immunochemistry, genomic tests, and/or review by a pathologist that may not be available to all patients. Using image analysis to predict breast cancer classes, such as tumor grade, ER status, histologic subtype, risk of reoccurrence and intrinsic subtype, is difficult because the image features needed to perform these predictions may not be apparent in the images. Accordingly, there exists the need for improved methods for image analysis to predict breast cancer classes.

SUMMARY RNA-based, multi-gene molecular assays are available and widely used for patients with ER-positive/HER2-negative breast cancers. However, RNA-based genomic tests can be costly and are not available in many countries. Methods for inferring molecular subtype from histologic images may identify patients most likely to benefit from further genomic testing. To identify patients who could benefit from molecular testing based on haemotoxylin and eosin (H&E) stained histologic images, we developed an image analysis approach using deep learning. A training set of 571 breast tumors was used to create image-based classifiers for tumor grade, ER status, PAM50 intrinsic subtype, histologic subtype, and risk of recurrence score (ROR-PT). The resulting classifiers were applied to an independent test set (n=288), and accuracy, sensitivity, and specificity of each assessed on the test set. Histologic image analysis with deep learning distinguished low-intermediate vs. high tumor grade (82% accuracy), ER status (84% accuracy), Basal-like vs. non-Basal-like (77% accuracy), Ductal vs. Lobular (94% accuracy), and high vs. low-medium ROR-PT score (75% accuracy).

Sampling considerations in the training set minimized bias in the test set. Incorrect classification of ER status was significantly more common for Luminal B tumors. These data provide proof of principle that molecular marker status, including a meaningful clinical biomarker (e.g., ER status), can be predicted with accuracy >75% based on H&E features. Image-based methods could be promising for identifying patients with a greater need for further genomic testing, or in place of classically scored variables typically accomplished using human-based scoring.

According to one aspect of the subject matter described herein, method for image analysis using deep learning to predict breast cancer classes is provided. The method includes receiving a test image. The method further includes generating test image instances from the test image. The method further includes inputting the test image instances into a convolutional neural network (CNN), which extracts features from each of the test image instances. The method further includes applying an instance support vector machine (SVM) to the features to predict breast cancer classes for each of the instances. The method further includes aggregating outputs from the instance SVM to predict breast cancer classes for the test image.

The subject matter described herein can be implemented in software in combination with hardware and/or firmware. For example, the subject matter described herein can be implemented in software executed by a processor. In one exemplary implementation, the subject matter described herein can be implemented using a non-transitory computer readable medium having stored thereon computer executable instructions that when executed by a processor of a computer control the computer to perform steps. Exemplary computer readable media suitable for implementing the subject matter described herein include non-transitory computer-readable media, such as disk memory devices, chip memory devices, programmable logic devices, and application specific integrated circuits. In addition, a computer readable medium that implements the subject matter described herein may be located on a single device or computing platform or may be distributed across multiple devices or computing platforms.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 a is a histogram for probability of high grade tumor by image analysis according to proportion of pathologist-classified low-intermediate (black) or high grade (red) in the test set. The cut point of >0.80 was selected.

Figure 1 b is a bee swarm plot displaying pathologist classification of tumor grade as a function of the image grade score in the test set. Points within each grade group are adjusted horizontally to avoid overlap. The black dots indicate image analysis classified low-intermediate tumor grade and the red dots indicate image analysis classified high grade tumors.

Figure 2 includes images of four H&E cores from a single patient and heat maps indicating the class predictions over different regions of the image. Class probabilities are indicated by the intensity of red/blue color with greater intensity for higher probabilities. Uncertainty in the prediction is indicated by white. This patient was labeled as high grade, ER negative, Basal-like intrinsic subtype, ductal histologic subtype, and high ROR.

Figure 3 is a block diagram illustrating a computing platform or computing cloud that includes a convolutional neural network (CNN)-based image classifier for analyzing images to predict breast cancer classes.

Figure 4 is a flow chart illustrating an exemplary process for image analysis with deep learning to predict breast cancer classes.

Figure 5 illustrates a multiple instance (Ml) augmentation technique for training Ml learning methods with a convolutional neural network.

Figure 6 illustrates an overview for an example Ml learning technique.

Figure 7 shows classification accuracy using mean aggregation as the number of instances increase.

Figures 8a-8e depict visualizations of instance predictions for a sample with various ground truth labels.

Figures 9a-9b depict visualizations of predicted heterogeneity for various classes.

Figure 10 depicts various deep canonical correlation analysis (CCA) based architectures.

Figure 1 1 depicts classification accuracy for different CCA methods.

Figure 12 shows t-SNE plots for CCA methods.

DETAILED DESCRIPTION

Image-based features of breast cancers have an important role in clinical prognostics. For example, tumor grade is strongly associated with survivorship, even among tumors with other favorable prognostic features such as estrogen receptor positivity [1 ] However, major advances in prognostication over the past decade have relied predominantly on molecular methods [2-4] These methods are costly and are not routinely performed on all clinical patients who could benefit from advanced molecular tests. Methods for identifying patients who are likely to benefit from further molecular testing are needed.

Image analysis of H&E stained images could identify patients most likely to benefit from genomic testing. Several previous studies have utilized automated processing of H&E stained breast tumors to identify image features associated with survival. These approaches have largely focused on hand-crafted, user-designed features, such as statistics of shape and color, to capture cell by cell morphology, which are difficult to adapt to new data sets [5,6] Prior work on automated grading addresses mitotic count [7], nuclear atypia [8], and tubule formation [9] individually; however, the latter two may require a time-consuming nuclear segmentation that is also difficult to adapt to new data sets. Feature learning on small image patches to identify novel features associated with survival has shown the utility of somewhat more complex features for breast [10] and other cancers [1 1 , 12], but the focus of that work still remains on smaller-scale properties due to their use of small image patches. None of these approaches is able to capture larger scale features, such as tissue architecture, or properties that are too complex for humans to capture. These abstract features could provide unforeseen insights into prognostics.

Deep learning is a method of learning a hierarchy of features where the higher level concepts are built on the lower level ones. Automatically learning these abstract features enables the system to learn complex functions mapping an input to an output without the need for hand-crafted features. Significant advances in this area have begun to show promise for tumor detection [13], metastatic cancer detection in lymph nodes [14], mitosis detection [7, 15], tissue segmentation [16], and segmentation and detection of a number of tissue structures [17] However, all of the previous successes of deep learning from H&Es have focused on detecting image-based properties that pathologists can routinely assess visually. Using deep learning to predict complex properties that are not visually apparent to pathologists, such as receptor status, intrinsic subtype or even risk of recurrence, has not been previously described.

We hypothesized that a deep learning method for image analysis could be applied to classify H&E-stained breast tumor tissue microarray (TMA) images with respect to histologic and molecular features. We used TMA images from the population-based Carolina Breast Cancer Study Phase 3 (2008-2013) to perform deep learning-based image analysis aimed at capturing larger scale and more complex properties including tumor grade, histologic subtype, ER status, intrinsic breast cancer subtype and a risk of recurrence score [ROR-PT 2] RESULTS

Training and test sets were established from a random division of the data using TMA cores from 2/3 (n=571 ) and 1/3 (n=288) of the eligible CBCS3 patients, respectively. There were no significant differences between the training and the test sets concerning patient or tumor characteristics (Table 1 ).

Across multiple 1.0-mm cores per patient, the probability of a tumor being classified as high grade by image analysis was calculated, and Figure 1 shows that a bimodal distribution of probabilities was observed. By establishing a cut point at >0.80, high grade tumors were detected with accuracy of 82% in the test set (kappa 0.64) (Figure 1a and 1 b, Table 2).

Considering low/intermediate as a group, the percent agreement with pathologist-classified tumor grade was slightly lower than the percent agreement between two breast pathologists who independently reviewed these same patients (overall 89%, kappa 0.78). Tumors with pathologist-defined intermediate grade were more likely to be misclassified as high grade tumors by image analysis (37%), while only 7% of low-grade tumors were misclassified (results not shown). When comparing the misclassification of intermediate grade and low-grade tumors as high grade between two pathologists in a subset of CBCS tumors, errors in classification of intermediate grade tumors as high grade tumors occurred <10% of the time and never occurred for low grade tumors (results not shown).

Image analysis accuracy for predicting molecular characteristics was also high. Accuracy for ER status was 84% (kappa 0.64) and both sensitivity (88%) and specificity (76%) were high (Table 3).

However, tumor grade is strongly associated with ER status in most patient populations, and we were interested in increasing accuracy among patients with low-to-intermediate grade tumors where genomic testing is most likely to influence patient care. Thus, we also employed a training strategy that weighted samples to ensure that low and intermediate grade distributions were similar between ER-positive and ER-negative tumors. This reduced accuracy among high grade tumors (from 77% to 75%), and decreased accuracy among low-intermediate grade tumors (from 91 to 84% accuracy). Using the same weighting strategy, we trained a classifier to predict Basal-like vs. non-Basal-like (Luminal A, Luminal B, HER2, Normal-like combined) PAM50 subtype (Table 4).

‘33? pPi_fefe tefe 630 =. ::. _{» S}¾f fes_fefef *: spij

The classifier had overall accuracy of 77%, but accuracy of 85% among low-intermediate grade tumors and 70% among high grade tumors.

To examine the potential clinical relevance of using this image analysis technique, we determined the sensitivity and specificity of image analysis and the ability to predict whether or not a tumor is classified as having high vs. low-medium risk of recurrence score (ROR-PT) (Table 4). ROR-PT is determined using a combination of tumor information including PAM50 subtype, tumor proliferation, and tumor size [2] Overall the accuracy of image analysis for ROR-PT was high at 76% (kappa 0.47). In grade-stratified analyses, accuracy for ROR-PT was higher among low-intermediate grade tumors (86%) than high grade tumors (67%).

In addition to using image analysis to predict tumor grade, we also tested this approach using histologic subtype, another visual feature of the tumor (Table 4). Image analysis was able to predict a lobular compared to ductal tumor with 94% accuracy (kappa 0.66). The accuracy was slightly lower when restricted to low grade tumors (89%), but was non-estimable among high grade tumors as there were no high grade lobular tumors in the test set.

To evaluate which clinical factors were associated with the accuracy of the image-based metrics, we evaluated predictors of accurate/inaccurate ER status calls (Supplemental Table 1 ) among patients in the test set (n=288).

Supplemental Table 1. Patient and tumor characteristics associated with inaccuracy of predicted ER status from the test set (n=288) _

Inaccurate Accurate Chi-squared

Variable N (%¹) N (%¹) OR (95% Cl) p-value

Age

<50 years 23(304) 110(275) Ref 072 >50 years 20(697) 135(724) 071 (037-136)

Race

White 25(843) 125(776) Ref 022 Black 18(157) 120(224) 075 (039-145)

Grade

Low-Intermediate 19(554) 143(687) Ref 017 High 24(446) 101 (313) 179 (093-344)

Missing 0 1

Stage

I, II 39(858) 220(911) Ref 049 III. IV 4(142) 25(89) 090 (030-274)

Node Status

Negative 32(738) 159(682) Ref 053 Positive 11 (262) 86 (318) 064 (031-132)

Tumor Size

<2cm 25(600) 149(686) Ref 038 >2cm 18(400) 96(314) 112 (058-216)

IHC-based ER Status

Negative 19(378) 72 (202) Ref 007 Positive 24(622) 173(798) 053 (027-102)

IHC-based Ki 67 Status

<10% 22(561) 154(678) Ref 024

>10% 21 (439) 91 (322) 161 (084-310)

Mitotic Grade

1 14(401) 116(583) Ref 020 2 9(209) 33(124) 226 (090-568)

3 20(390) 95 (293) 174 (084-364)

Missing 0 1

Intrinsic Subtype

Luminal A 5(251) 69 (506) Ref 041 Luminal B 8(265) 25 (200) 442 (132-1477)

Basal-like 9(324) 40(198) 310 (097-991)

HER2 2(124) 13(48) 212 (037-1214)

Normal-like 2(36) 7(47) 394 (064-242)

Missing 17 91

¹ All percentages weighted for sampling design

Considering age, race, grade, stage, lymph node status, ER status, Ki67 status, mitotic tumor grade, no significant differences in accuracy of image-based ER assignment were observed. However, we found that image analysis tended to inaccurately predict ER status when tumors were Luminal B [OR, (95% Cl); 4.42 (1.32-14.77)].

We gained further insight into the performance of our method by examining the class predictions across cores from the same patient and within each core. Figure 2 shows four cores from a single patient, along with the class predictions over different regions of the image. While three cores are predicted ER negative and Basal-like intrinsic subtype, the fourth is predicted mostly ER negative and non-Basal-like, indicating that some intra-tumoral heterogeneity might be present between cores.

DISCUSSION

In this study, we used a deep learning approach to conduct image analysis on H&E stained breast tumor tissue microarray samples from the population-based Carolina Breast Cancer Study, Phase 3 (2008-2013). Further details on the image analysis techniques are given in the Methods section. First, we found that the agreement between image analysis and the pathologist-classified grade was only slightly lower than that observed for two study pathologists, and we obtained high agreement and kappa values. Second, we found that ER status, RNA-based molecular subtype (Basal-like vs. non-Basal like), and risk of recurrence score (ROR-PT) could be predicted with approximately 75-80% accuracy. Further, we found the image analysis accuracy to be 94% for ductal vs. lobular histologic subtype.

Previous literature based on comparing two pathologists shows that image assessment is subject to some disagreement [18], particularly among the intermediate-grade tumors as we observed between the image analysis and pathologist classification in our study. Other groups have reported inter-rater kappa statistics of 0.6-0.7 for tumor grade [18,19], in line with both our inter-pathologist agreement and image analysis vs. pathologist agreement for grade. Elsewhere in the literature lower kappa values around 0.5 have been reported between pathologists for histologic grade [20] In light of this inherent variability in image assessment, deep learning-based image analysis performed well at predicting tumor grade as low-intermediate vs. high using H&E images.

It is particularly promising that histologic subtype and molecular marker status could be predicted using image analysis. While we did perform grade-weighting within ER classification, there may be other image features of ER positive tumors that are not readily discernible and are driving the higher accuracy of ER positive images over ER negative. Agreement between true ER status (by IHC) vs. image analysis (kappa 0.64) was slightly lower than that observed for centralized pathology and SEER classifications for ER status (kappa 0.70) [21 ] and is similar to reports of agreement between different IHC antibodies for ER that show substantial agreement (kappa 0.6-0.8) [22] Previous work with CBCS phase 1 samples found that agreement between medical records and staining of tissues was also similar (kappa of 0.62) [23] Overall, the agreement between IHC-based ER status and image analysis predictions based on H&E stained images are similar to estimates for comparing ER status classification in the literature. The high rate of agreement between pathologist-scored and image analysis based histologic subtype was also compelling (kappa 0.64). Together these results suggest that some latent features indicative of underlying tumor biology are present in H&E images and can be identified through deep learning-based approaches.

We observed high accuracy of image analysis to predict ductal versus lobular histologic subtype. The high accuracy may be due to the arrangement of epithelial and stromal cells characteristic of ductal and lobular tumors whereby lobular tumors are characterized by non-cohesive single file lines of epithelial cells infiltrating the stroma and ductal tumors are characterized by sheets or nests of epithelial cells embedded in the surrounding stroma [24,25] We speculate that it may be that the high contrast staining between the epithelium and stromal components resulting from H&E immunohistochemistry strengthens the ability of image analysis to predict this biologic feature of the tumor.

With respect to intrinsic PAM50 subtype based solely upon gene expression values, previous studies have not evaluated image-based analysis for predicting intrinsic subtype or the risk of recurrence using a score-based method, ROR-PT [2] A few previous studies have evaluated the clinical record or a central immunohistochemistry laboratory vs. RNA-based subtyping for Basal-like vs. non-Basal-like. Even considering two molecular comparisons, agreements do not exceed 90%. That is, Allott et al. (2016) found approximately 90% agreement between Basal-like status for IHC-based vs. RNA-based assessment and 77% agreement for classification of Luminal A subtype [26] Our estimates are similar suggesting that image analysis, even without the use of special IHC stains, could be a viable option for classification of molecular breast tumor subtype and ROR-PT from H&E stained images.

As with other studies, our work should be viewed in light of some limitations. Our sample size was limited in our testing set to 288 patients, but this resulted in nearly 1 ,000 TMA cores available for use in our image analysis. Using a larger set of samples with data on RNA-based subtype to balance training for each predictor could be useful. For example, the fact that Luminal B patients had a higher error rate might suggest there are some features of Luminal B breast cancers that are distinct and image-detectable, and a larger sample size would be helpful in identifying these. Deep learning may be utilizing these features, but in our small sample set, we are unable to tune our data to specifically identify those features or to clarify what they are in intuitive language. Additionally, the use of binary classification systems for training our digital algorithms (e.g., Basal-like vs. non-Basal-like) does not allow us to differentiate among all five RNA-based intrinsic subtypes. Currently, U.S. based genomic tests provide continuous risk scores, but also suggest relevant cut points that in essence make these assays almost a binary classification; thus, binary classification may have some utility in the current clinical context. However, future work should extend these approaches to multi-class classification. Furthermore, improved results may be obtained by fine-tuning the Convolutional Neural Network for breast cancer H&E image classification.

Image-based risk prediction has potential clinical value. Gene expression data on tumor tissue samples is not uniformly available for all patients and is costly to obtain in both a clinical and epidemiologic setting. These results suggest that tumor histology and molecular subtype along with the risk of recurrence (ROR-PT) can be predicted from H&E images alone in a high-throughput, objective, and accurate manner. These results could be used to identify patients who would benefit from further genomic testing. Furthermore, even ER testing is not routinely performed in countries with limited laboratory testing resources and predicting ER status by morphologic features may have utility for guiding endocrine therapy in low-resource settings. METHODS

Sample Set. The training and test sets were both comprised of participants from the Carolina Breast Cancer Study (CBCS), Phase 3 (2008-2013). Methods for CBCS have been described elsewhere 27. Briefly, CBCS recruited participants from 44 of the 100 North Carolina counties using rapid case ascertainment via the North Carolina Central Cancer Registry. After giving informed consent, patients were enrolled under an Institutional Review Board protocol that maintains approval at the University of North Carolina. CBCS eligibility criteria included being female, a first diagnosis of invasive breast cancer, aged 20-74 years at diagnosis, and residence in specified counties. Patients provided written informed consent to access tumor tissue blocks/slides and medical records from treatment centers.

The training and test sets were formed by a random partition of the data. The total number of patients available for the training and test set from CBCS3 was 1 ,203. These patients were divided into a group of 2/3 (n=802) for the training set and 1/3 (401 ) for the test set. Of the 802 patients available for the training set, 571 had H&E images and biomarker data available for contribution to the training set. Of the 401 patients eligible for the test set, 288 had H&E images and biomarker data available. Patients in the final training and test sets had information for tumor grade and histologic subtype, determined via centralized breast pathologist review within CBCS, along with biomarker data for ER status, PAM50 intrinsic breast cancer subtype, and risk of recurrence (ROR-PT) where noted. The H&E images were taken from tissue microarrays constructed with 1 -4 1 mm cores for each patient, resulting in 932 core images for the test set analysis presented here. ER status for each TMA core was determined using a digital algorithm as described by Allott et al. (2016) [28] and was defined using a ^ 10% positivity cut point for immunohistochemistry staining.

Tumor Tissue Microarray Construction. As has been described in detail by Allott et al. (2016), tumor tissue microarrays were constructed for CBCS 3 participants with available paraffin-embedded tumor blocks [26] The CBCS study pathologist marked areas of invasive breast cancer within a tumor on H&E stained whole slide images. The marked areas were selected for coring and 1 -4 tumor tissue cores per participant were used in the TMA construction at the Translational Pathology Laboratory at UNC. TMA slides were H&E stained and images were generated at 20x magnification. Cores with insufficient tumor cellularity were eliminated from the analysis.

Molecular Marker Data. In CBCS3, Nanostring assays were carried out on a randomly sampled subset of available formalin fixed paraffin embedded (FFPE) tumor tissue cores. RNA was isolated from [2], 1 .0-mm cores from the same FFPE block using the Qiagen RNeasy FFPE kit (catalogue # 73504). Nanostring assays, which use RNA counting as a measure of gene expression, were conducted. RNA-based intrinsic subtype was determined using the PAM50 gene signature described by Parker et al. (2009) [2] Based on the highest Pearson correlation with a subtype-defined centroid, each tumor was categorized into one of five intrinsic subtypes (Luminal A, Luminal B, HER2, Basal-like, Normal-like), using the 50 gene, PAM50 signature [27] Categorizations were based on a previously validated risk of recurrence score, generated using PAM50 subtype, tumor proliferation, and tumor size (ROR-PT) with a cutoff for high of 64.7 from the continuous ROR-PR score [2]

Image Analysis Pre-processing and Feature Extraction. Color and intensity normalization was first applied to standardize the appearance across core images, countering effects due to different stain amounts and protocols, as well as slide fading [29] The resulting stain intensity channels were then used as input to the rest of our algorithm. Most automated analyses of histology images use features that describe the properties of cells such as statistics of shape and color [5], [30-32] Such features are focused on cell-by-cell morphology and do not adapt well to new data sets. We instead captured tissue properties with a Convolutional Neural Network (CNN), which has been shown more successful for classification tasks on histology [16,33] These multi-layered networks consist of convolution filters applied to small patches of the image, followed by data reduction or pooling layers. Similar to human visual processing, the low level filters detect small structures such as edges and blobs. Intermediate layers capture increasingly complex properties like shape and texture. The top layers of the network are able to represent object parts like faces or bicycle tires. The convolution filters are learned from data, creating discriminating features at multiple levels of abstraction. There is no need to hand craft features. We used the VGG16 architecture (configuration D) [34] that was pre-trained on the ImageNet data set, which consists of 1 .2 million images from 1000 categories of objects and scenes. Although ImageNet contains a vastly different type of image, CNNs trained on this data set have been shown to transfer well to other data sets [35, 57-58], including those from biomedical applications [14, 59] The lower layers of a CNN are fairly generic, while the upper layers are much more specialized. The lower layers only capture smaller-scale features, which do not provide enough discriminating ability, while the upper layers are so specific to ImageNet that they do not generalize well to histology. Intermediate layers are both generalizable and discriminative for other tasks. In transferring to histology, we search for the layer that transfers best to our task. Output from each set of convolutional layers, before max pooling, was extracted over each image at full resolution to form a set of features for the image. Output from the fourth set of convolutional layers was chosen because it performed better than the outputs from other layers. The fourth set of convolutional layers outputs features of dimension 512. These lower CNN layers are convolutional, meaning that they can be run on any image size. For an image size of 2500x2500, they produce a grid of 284x284x512 features.

Model Training and Training Data Sets. In training a model to predict the class or characteristic group of a tumor, such as high or low grade, we utilize patient-level labels. The TMA images may be much larger than the required input to the VGG16 CNN (e.g., typically 2500x2500 pixels for TMA spots vs. 224x224 for VGG16). Further, applying the original CNN fully convolutionally would produce features that are not generalizable to histology. Thus some modifications to the VGG16 approach are necessary. A new classifier may be trained to operate on the intermediate level features from VGG16. Simply taking the mean of each feature over the image would limit our insight into which parts of the image contributed to the classification. The patient-level labels are weak compared to detailed patch- or pixel-level annotations used in most prior work, necessitating a different classification framework called multiple instance learning. In this setting, we were given a set of tumors, each containing one or more image regions. We were given a label for each tumor: tumor grade (pathologist determined), ER status (IHC-based), PAM50 intrinsic subtype (50 gene expression-based), ROR-PT (gene expression-based), or histologic subtype (pathologist determined). Due to the diverse appearance of tissue in a single image, learning the model with the patient label applied to every image region did not perform well in initial experiments. Heterogeneity of image region labels in each image is instead accounted for while training the model.

In order to account for intra-tumor heterogeneity, a probabilistic model was formed for how likely each image region is to belong to each class, with these probabilities aggregated across all image regions to form a prediction for the tumor as a whole. Image regions were generated as 800x800 pixel regions in the training images, with the mean of each CNN feature computed over the region. A linear Support Vector Machine (SVM) [39] calibrated with isotonic regression [40] was used to predict the probability for each region. Isotonic regression fits a piecewise-constant non-decreasing function, transforming the distance from the separating hyperplane learned by the SVM to a probability that an image region belongs to each class. This assumes that the SVM can rank image regions accurately and only needs the distances converted to probabilities. Each image region was labeled with the class of the tumor from which it belongs. The data for model fitting and calibration may be disjoint, so cross-validation was used to split the training instances into five equal-sized groups, where four were used for training and the remaining for calibration/validation (the test set remains untouched). For each fold, an SVM was learned on the training set and calibration was learned on the calibration set with isotonic regression, thus forming an ensemble. An ensemble of size five was selected to balance the desirability of a large training set, a reasonably sized validation set, and the simultaneous desirability of limiting the computation time. Predictions on the test set were made by averaging probabilities from the five models. This ensemble method also helped to soften any noise in the predictions caused by incorrect image region labels due to heterogeneity. Predictions for tumors were made by first forming a quantile function (inverse cumulative distribution) of the calibrated SVM ensemble predictions for the image regions using 16 equally spaced quantiles from images in the training set. The quantiles of the training images were used to train another linear SVM to predict the class label for the whole tumor, with sigmoid calibration transforming the SVM output into probabilities. This method allowed predictions to be made for individual image regions, while also aggregating to overall tumor predictions.

When training the previously described SVM classifiers, we initially weighted each class, including tumor grade, ER status, and Basal-like vs. non-Basal-like intrinsic subtype, equally. To reduce the leverage of grade in predicting ER status and intrinsic subtype, sample weighting was applied using weights inversely proportional to the number of samples in the group, e.g., low grade class 1 , low grade class 2, high grade class 1 , and high grade class 2 were each weighted equally, where the classes are the ER status, histologic subtype, or intrinsic subtype.

Prediction in Test Sets. At test time, 800x800 pixel overlapping regions with a stride of 400 pixels were used as image regions from each TMA spot that is typically 2500 pixels in diameter. Only image regions containing at least 50% tissue within the core image field of view (i.e. 50% tissue, 50% glass) were used. The calibrated SVM ensemble predicted the class of each image region by assigning a probability of belonging to one of two classes (tumor grade 1 or 3, ER+ or ER-, Basal-like or non-Basal-like subtype, ductal or lobular histologic subtype, and low-med or high ROR-PT). The probabilities computed on the image regions from all cores were aggregated into a quantile function and the second SVM was used to predict the class for the whole tumor.

Image-based Classification. Cut points were determined for each tumor characteristic based on the achievement of optimal sensitivity, specificity, and accuracy of each core being correctly classified relative to the pathology or biomarker data. To classify tumor grade, image analysis assigned a probability score of being a high grade vs. low grade tumor for each image. A cut point of greater than 0.80 was used for high grade tumors (Figure 1 a). Independently, traditional pathologist scoring methods were used to classify tumors as a combined grade of low, intermediate, or high. Also, two independent pathologists’ classifications of tumor grade for the same tissue sample were assessed to compare the agreement between two pathologists to that observed for image analysis versus pathologist classification. To classify patients as ER positive based on image analysis, the same principles were used as those described for tumor grade where each core was assigned a probability of ER-positivity. A probability of greater than 0.50 was classified as ER-positive by image analysis. To classify patients as ER positive based on biomarker data, samples had to have 10% or more of nuclei stained positive for ER by immunohistochemistry. For Basal-like vs. non-Basal-like RNA-based subtype, image analysis assigned a probability of each image being Basal-like and a probability cut point of >0.60 was used to classify Basal-like vs. non-Basal-like tumors. These results were compared against the PAM50-based intrinsic subtype classification methods using gene expression described previously [2] Similarly, we used image analysis to predict whether a tumor had a high or low-medium risk of recurrence. Image analysis predicted ROR-PT was based on a cut point of 0.20 for the probability of each TMA spot being classified as high ROR-PT. Histologic subtype was restricted to ductal and lobular tumors and was based on a cut point of 0.1 for the probability of each TMA spot being classified as lobular.

Prediction Accuracy and Associations with Clinical Characteristics. For core-level comparisons, image region probabilities of being a high grade tumor, ER positive, Basal-like subtype, lobular subtype, or high ROR-PT. For each variable, sensitivity, specificity, accuracy and kappa statistics (95% confidence interval [95% Cl]) were determined comparing the image analysis classification to tumor grade for the tumor tissue as a whole, IFIC-based ER status for each corresponding TMA core (ER positivity is available for each core rather than just for the whole tumor tissue), PAM50 subtype for the tumor tissue as a whole, histologic subtype for the tumor tissue as a whole, and ROR-PT for the tumor tissue as a whole. Accurate classification was defined as identical classification based on histologic image analysis and biomarker data for the same core. To determine whether any clinical characteristics were associated with an inaccurate image-based call for ER status, we estimated odds ratios (ORs) and 95% confidence intervals (95% Cl) for the association between patient characteristics and the accuracy of ER status (i.e. concordant with clinical status vs. discordant with clinical status) (Supplemental Table 1 ).“ All statistical analyses were done in SAS version 9.4 (SAS Institute, Cary, NC). P-values were two-sided with an alpha of 0.05.

Figure 3 is a block diagram of a computing platform with a CNN-based image classifier for predicting breast cancer classes, where the classes include tumor grade, ER status, histologic subtype, and intrinsic subtype through image analysis. Figure 4 is a flow chart of using the CNN-based image classifier in Figure 3 to predict breast cancer classes. Referring to Figure 3, a test image 300 is provided as input to a CNN-based image classifier 302. Test image 300 may be an FI&E stained histologic image as described above. CNN-based image classifier 302 may include a convolutional neural network 304, such as the VGG16 network, trained using the steps described above. CNN-based image classifier 302 may be implemented on a computing platform 306 including at least one processor 308 and memory 310.

Referring to Figures 3 and 4, prior to inputting image data into CNN 304, in step 400, instances are generated from test image 300. In one example, generating instances may include dividing test image 300 into regions of a predetermined size. In the example described above, the predetermined size is 800x800 pixels.

In step 402, features are extracted from each instance. Extracting features may include providing each image instance as input to CNN 304, which generates features as output for each image instance.

In some embodiments the CNN may be trained using multiple instance (Ml) learning and/or Ml aggregation. For example, the CNN may be trained end-to-end using Ml learning. In this example, the CNN may be trained using a Ml aggregation technique that aggregates predictions from smaller regions of an image into an image-level classification by using the quantile function. Additional details regarding Ml learning, Ml aggregation, and/or related aspects are discussed below. In some embodiments, the CNN may be trained using genomic data sets and a task-based canonical correlation analysis (CCA) method. For example, a CNN component may be configured for using imaging data and related genomic data set along with task-based CCA method (e.g., a task-optimal CCA method) to project data from the two sources to a shared space that is also discriminative. In this example, the CNN component may be trained for use in improving accuracy when identifying or extracting features from image instances during run-time. Additional details regarding a task-based CCA method and/or related aspects are discussed below.

In step 404, image classes are predicted for each instance. To predict the image classes for each instance, the CNN-generated feature for each instance is inputted into an instance support vector machine 312, which outputs a probability as to whether an instance is a member of each class. For example, for each of the image instances, instance SVM 312 may output probabilities as to whether the instance is one or the other binary classes for tumor grade (high grade or low grade), ER status (ER positive or ER negative), histologic subtype (ductal or lobular), risk of reoccurrence (high versus low), and intrinsic subtype (basal-like or non-basal like).

In step 406, the instance predictions are aggregated to produce instance predictions for the entire image. The probabilities computed for each image instance by instance SVM 312 are aggregated into a quantile function 314, and an aggregation SVM 316 that operates on quantile function 314 is to generate probabilities of each class for the entire image.

In step 408, tumor class is predicted. Using the probabilities generated in step 406, CNN-based image classifier 302 generates output as to whether the tumor is in one or the other of the above-described binary classes. To assign the tumor to one or the other of each of the binary classes, cut points or threshold values may be assigned for each class. The probability computed for each class may then be compared to the cut point or threshold to determine whether or not the tumor is a member of the class. For example, as described above, the cut point for classifying a tumor as high grade may be 0.8. Accordingly, if the probability computed for tumor grade for a given image is 80% or higher, the tumor may be classified as high grade. Thus, using deep learning for class predictions and instance aggregations, tumors can be classified from image analysis alone to identify the need for further diagnostic tests (such as genomic tests) and/or treatment.

MULTIPLE INSTANCE (Ml) LEARNING INTRODUCTION

Ml learning with a convolutional neural network enables end-to-end training in the presence of weak image-level labels. We propose a new method for aggregating predictions from smaller regions of the image into an image-level classification by using the quantile function. The quantile function provides a more complete description of the heterogeneity within each image, improving image-level classification. We also adapt image augmentation to the Ml framework by randomly selecting cropped regions on which to apply Ml aggregation during each epoch of training. This provides a mechanism to study the importance of Ml learning. We validate our method on five different classification tasks for breast tumor histology and provide a visualization method for interpreting local image classifications that could lead to future insights into tumor heterogeneity.

Deep learning has become the standard solution for classification when a large set of images with detailed annotations is available for training. When the annotations are weaker, such as with large, heterogeneous images, we turn to Ml learning. The image (called a bag) is broken into smaller regions (called instances). We are given a label for each bag, but the instance labels are unknown. Some form of pooling aggregates instances into a bag-level classification. By integrating Ml learning into a convolutional neural network (CNN), we can learn an instance classifier and aggregate the predictions so the entire system is trained end-to-end [47, 52, 45]

We propose a more general approach for aggregating instance predictions that looks at the full distribution by pooling with the quantile function (QF) and learning how much heterogeneity to expect for each class. As data augmentation is especially useful in training large CNNs, we also created an augmentation technique for training Ml methods with a CNN (see Figure 5). Through Ml augmentation, we study the importance of the Ml formulation during training. Using Ml learning to make class predictions over smaller regions of the image provides insight into how different parts of the image contribute to the classification. Visualizing the instance predictions provides a method of interpretability that we demonstrate on a data set of breast tumor tissue microarray (TMA) images stained with hematoxylin and eosin (H&E) by predicting grade, receptor status, and subtype. Some of these tasks are not previously known to be achievable from H&E alone. Our quantitative results conclude that the Ml component may be very useful for successful classification, demonstrating the importance of accounting for heterogeneity. This method could provide future insights into tumor heterogeneity and its connection with cancer progression [43, 49]

Figure 5 depicts Ml augmentation. In Ml learning, each bag contains one or more instances. Labels are given for the bag, but not the instances. Ml augmentation is a technique to provide additional training samples by randomly selecting a cropped image region and the instances within it. When the bag label is applied to a small number of instances, it is weak because this small region may not be representative of the bag class. Applying the bag label to larger cropped regions provides a stronger label, while still providing benefit from image augmentation. Training with the whole image maximizes the opportunity for Ml learning, but restricts the benefits of image augmentation. At test time, the whole image is processed and the predictions from all instances are aggregated into a bag prediction.

Contributions. 1 ) A more general Ml aggregation method that uses the quantile function for pooling and learns how to aggregate instance predictions. 2) An Ml augmentation technique for training Ml methods. 3) Exploration of single instance and Ml learning on a continuous spectrum, demonstrating the importance of Ml learning on heterogeneous images. 4) Evaluation on a large data set of 1713 patient samples (5970 images), showing significant gains in classifying breast cancer TMAs. 5) A method for visualizing the predictions of each instance, providing interpretability to the method.

Aggregating Instance Predictions. A permutation invariant pooling of instances is needed to accommodate images of different sizes, whereas a fully connected neural network may not. Existing pooling approaches are very aggressive; they compute a single number rather than looking at the distribution of instance predictions. Most Ml applications use the maximum, which works well for problems such as cancer diagnosis where, if there is a small amount of tumor, the sample is labeled as cancerous [46, 56] A smooth approximation, such as the generalized mean or noisy-OR, provide better convergence in a CNN [47, 52, 45] For other tasks, a majority vote, median, or mean is more appropriate. We include more of the distribution by pooling with the QF and learning a mapping to the bag class prediction, improving the classification accuracy. Our proposed method of quantile aggregation learns how to predict the bag class from instance predictions and so could provide a solution when the most suitable aggregator is unknown. The QF is a new general type of feature pooling that could provide an alternative to max pooling in a CNN.

Training Ml Methods with a CNN. In some embodiments, image augmentation may be applied in training a CNN by randomly cropping large portions of each image during each epoch. At test time, the whole image is used. We propose Ml augmentation, in which a subset of instances is randomly selected from each bag during each epoch. Instances may be the same size, but we choose how many instances to aggregate over. In selecting the number of instances, there may be two extremes: a single instance vs. the whole bag. In the former, the bag label is assigned to each instance and is often called single instance learning. In the latter, Ml aggregation is incorporated while training the bag classifier as in other Ml methods [41 , 44] Comparison studies have found little or no improvement from these Ml methods on some data sets [54, 55] We found Ml learning to be very beneficial and show that it can be useful in dealing with heterogeneous data.

MULTIPLE INSTANCE LEARNING WITH A CNN

We denote a bag by X, its label by Y e {1, 2, . .. , C}, and the instances it contains by x_n for n = 1 , . ., N. The instance labels y_n are unknown. On a novel sample, an instance classifier / _st predicts the probability of each class c and a function f _gg aggregates these instance probabilities into a bag probability:

Ml learning can be implemented with many different types of classifiers [41 , 46, 54] When implemented as a CNN, a fully convolutional network (FCN) forms the instance classifier f_inst , followed by a global Ml layer for instance aggregation f_agg . The FCN consists of convolutional and pooling layers that downsize the representation, followed by a softmax operation to predict the probability for each class. For an input image of size w x w x 3, the FCN output is w_d x w_d x C. An instance is defined as the receptive field from the original image used in creating a point in this w_d x w_dgrid; the instances are overlapping. The Ml aggregation layer takes the instance probabilities and the foreground mask for the input image (downscaled to w_d x w_d), thereby aggregating over only the foreground instances.

Figure 6 illustrates an overview for an example Ml learning technique. During training, a cropped region of a given size is randomly selected. An FCN is applied to predict the class, producing a grid of instance predictions. The instance predictions are aggregated over the foreground of the image (as indicated by the foreground mask) using quantile aggregation to predict the class of the cropped image region. With a cross entropy loss applied, backpropagation then learns the FCN and aggregation function weights. At test time, the whole image is used.

MULTIPLE INSTANCE AGGREGATION

Instance predictions can be used to form a bag prediction in different ways. The bag prediction function should be invariant to the number and spatial arrangement of instances, so some pooling of predictions is needed. Mean aggregation is well suited for global pooling as it is permutation invariant and can incorporate a foreground mask for the input image. Denoting the mask as M and its value for each instance as mn e {0, 1} , the mean aggregation function i

Mean pooling incorporates predictions from all instances, but a lot of information may be lost in compressing to a single number. A histogram is a more complete description of the probability distribution, but is dependent upon a suitable bin width. Alternatively, the QF (inverse cumulative distribution) represents the boundary points between fractions of the population, providing a better discretization [42] We propose quantile aggregation to provide a more complete description of the instance predictions in a bag. If the instance predictions for class c are represented by S_c = {s_{l c}, . . s_{W c}}, then the q-th Q-quantile is the value z such that Pr(S_c < z) = (q - 0.5 )/Q. To pool with the QF, we first sort S_c and exclude instances not in the foreground, leaving the set S_c = {s_{l C},...,S_jv_C}. The sorted values in S_c are used to extract the QF vector for each class c as z_c = \z_{l c}, . . . , z_{Q c}\ where

The QF vectors for all classes are concatenated as Z = [z_l . . z_c] . We then use a softmax function operating on Z to predict the bag class. The QF from all classes is used in order to learn the interaction of different subtypes in a bag. Backpropagation through the QF operates in a similar manner to max pooling by passing the gradient back to the instance that achieved each quantile.

TRAINING WITH MULTIPLE INSTANCE AUGMENTATION

Image augmentation by random cropping is an important technique for creating extra training samples that helps to reduce over-fitting. We propose an augmentation strategy for Ml methods to increase the number of training samples by randomly selecting a different subset of instances for each epoch. We randomly crop the image to select the set of instances, such that each crop contains at least 75% foreground according to the foreground mask. It is important to note that the image is never resized and the instance size remains constant. For each crop size chosen, the FCN is applied to the cropped image at full resolution. Ml augmentation is a strategy used during training. As the Ml aggregation layer is invariant to input size, the entire image and all its instances are always used at test time. MULTIPLE INSTANCE EXPERIMENTS

Data Set. Our data set consists of 1713 patient samples from the Carolina Breast Cancer Study, Phase 3 [53] There are typically four 1 .0 mm cores per patient in the TMA, with a total of 5970 cores. Each core is selected from the H&E-stained whole slide by a pathologist such that it contains a substantial amount of tumor tissue. Each image has a diameter of around 2400 pixels and a maximum of 3500 pixels. One sample core is shown in Figure 6. We use a random subset of half the patients for training and the other half for testing. Classification accuracy is measured for five different tasks, some of them multiclass: 1 ) histologic subtype (ductal or lobular), 2) estrogen receptor (ER) status (positive or negative), 3) grade (1 , 2, or 3), 4) risk of recurrence score (ROR) (low, intermediate, or high), 5) genetic subtype (basal, luminal A, luminal B, HER2, or normal-like). Ground truth for histologic subtype and grade are from a pathologist looking at the original whole slide. ER status is determined from immunohistochemistry, genetic subtype from the PAM50 array [51 ], and ROR from the ROR-PT score-based method [51 ]

Implementation Details. The TMA images are intensity normalized to standardize the appearance across slides [50] The hematoxylin, eosin, and residual channels are extracted from the normalization process and used as the three channel input for the rest of our algorithm. A binary mask distinguishing tissue from background is also provided as input.

We use the pre-trained CNN AlexNet [48] and fine-tune with the Ml architecture shown in Figure 6. All five tasks are equally weighted in a multi-task CNN as shared features help to reduce over-fitting. For each patient, ground truth labels are available for most tasks. The cross entropy loss is adjusted to ignore patients missing a label for a particular task.

In addition to Ml augmentation, we can randomly mirror and rotate each training image. To accommodate the larger cropped image sizes in GPU memory, we can reduce the batch size. A typical image with tissue of diameter 2400 pixels produces a 68 ^c 68 grid of instances. After applying the foreground mask, there are roughly 3600 instances. Q = 15 quantiles are used in all experiments. There are typically four core images per patient; we assign the patient label to each during training and, at test time, take the mean prediction across the images. Further Ml learning could be done to address the multiple core images per patient, however our current focus is only on Ml learning within each image. Ml Augmentation and the Importance of Ml Learning. We study the effect of Ml learning on large images by selecting the cropped image size for training. The smallest possible size is 227 ^c 227 (the input size for AlexNet), consisting of a single instance. When the bag label is applied to each instance during training, this is called single instance learning. Alternatively, a larger cropped region of size w x w can be selected; we test multiples of 500 up to 3500 and use mean aggregation in this experiment. By assigning the bag label to this larger cropped region during training and keeping the instance size constant, we perform Ml learning. Multiple random crops are obtained from each training image such that roughly the same number of pixels is sampled for each crop size (e.g., the whole image for the

3500²

largest crop size of 3500, random crops for a training crop of size w). For

the largest crop size, the whole image may be used without Ml augmentation. Random mirroring and rotations are used for augmentation at all crop sizes. At test time, the whole image is always used, with the bag prediction formed by aggregating across all instances.

Figure 7 shows classification accuracy using mean aggregation as the number of instances (cropped image size) used for training is increased, while keeping instance size constant. Figure 7 shows that larger crop sizes for training significantly increases classification accuracy (p < 10-3 with McNemar’s test for w=500 vs. w=1500 on all tasks). The benefits level off for larger crops. As GPU memory requirements increase for larger crop sizes, selecting an intermediate crop size provides most of the benefits of Ml augmentation.

Although it should not be surprising that a larger crop size at training works better, the magnitude of improvement is very significant. If the images were homogeneous (at the scale of a single instance, w = 227), then applying the bag label to each instance should produce a classification accuracy similar to when Ml aggregation over the whole image is used during training. This is clearly not the case in Figure 7. For example, ER status increases from 68.6% to 85.6% when applying Ml learning over the whole image. This demonstrates the importance of Ml learning and the effect of heterogeneity. Our data set consists of cores selected from a whole slide by a pathologist. Ml learning may be even more crucial when classifying larger and more heterogeneous images like whole slides.

Table 5: Average classification accuracy for different types of Ml aggregation.

The standard error is in brackets.

Ml Aggregation. We compared aggregation methods by training our model on a crop size w = 2000 and taking the average classification accuracy over four runs. Table 5 shows that mean and quantile aggregation both significantly outperform max (p < 10-8 with McNemar’s test). While quantile aggregation performance is similar to mean for some tasks, a significant increase in performance (93.1 % to 95.2%) is observed for predicting the histologic subtype of ductal vs. lobular (p < 10-10 with McNemar’s test). This improvement is due to quantile aggregation predicting the bag class from a more complete view of the instance predictions using QF pooling, thereby capturing the heterogeneity.

Heterogeneity. By computing the class predictions for each instance, we get an idea of each region’s contribution to the classification. Figure 8 shows a visualization for a sample image where the instance predictions are colored for each class. The w = 2000 crop size was used for this example. With the same computation performed over the whole test set, we calculated the proportion of instances predicted to belong to each class. Figure 9 shows a plotting of the results for grade 1 vs. 3 and genetic subtype basal vs. luminal A. Heterogeneity is expected for grade, as the three tumor grades are not discrete, but a continuous spectrum from low to high. On the other hand, the level of heterogeneity to expect for genetic subtype is unknown because no studies have yet assessed genetic subtype from multiple samples within the same tumor. The graph shows a continuous spectrum from basal to luminal A. The luminal B, HER2, and normal samples lie mostly on the luminal A side, but with some mixing into the basal side.

Figures 8a-8e depict visualizations of instance predictions for a sample with ground truth labels of ductal (Figure 8a), ER positive (Figure 8b), grade 1 (Figure 8c), low ROR (Figure 8d), and luminal A (Figure 8e).

Figures 9a-9b depict visualizations of predicted heterogeneity for grade 1 vs. 3 (Figure 9a) and genetic subtype basal vs luminal A (Figure 9b). The predicted proportion for each class is calculated as the proportion of instances in the sample predicted to be from each class. Test samples for all classes are plotted.

TRAINING CNN USING Ml LEARNING DISCUSSION

We have shown that Ml learning while training a CNN may be very useful in achieving high classification accuracy on large, heterogeneous images. Even with a small number of labeled samples, our model was successful in fine-tuning the AlexNet CNN because of the large size of the images providing plenty of opportunity for Ml augmentation. The impact of Ml learning indicates that accommodating image heterogeneity is essential. While aggregating instance predictions with the mean is sufficient for some tasks, quantile aggregation produces a significant improvement for others. Instance-level predictions will enable future work studying tumor heterogeneity, perhaps leading to biological insights of tumor progression.

CANONICAL CORRELATION ANALYSIS (CCA) INTRODUCTION

CCA is a popular data analysis technique that projects two data sources into a space in which they are maximally correlated [61 , 62] It was initially used for unsupervised data analysis to gain insights into components shared by the two sources [63-65] CCA is also used to compute a shared latent space for cross-view classification [66, 64, 67, 68], for representation learning on multiple views that are then joined for prediction [69, 70], and for classification from a single view when a second view is available during training [71 ] While some of the correlated CCA features are useful for discriminative tasks, many represent properties that are of no use for classification and obscure correlated information that is beneficial. This problem is magnified with recent non-linear extensions of CCA, implemented via neural networks (NNs), that make significant strides in improving correlation [63-65, 68] but often at the expense of discriminative capability (see section CCA Experiments). Therefore, we present a new deep learning technique to project the data from two views to a shared space that is also discriminative.

Some prior work that boosts the discriminative capability of CCA is linear only [72-74] More recent work using NNs still remains limited in that it optimizes discriminative capability for an intermediate representation rather than the final CCA projection [70], or optimizes the CCA objective only during pre-training, not while training the task objective [75] We advocate to jointly optimize CCA and a discriminative objective by computing the CCA projection within a network layer while applying a task-driven operation such as classification. Experimental results show that our method significantly improves upon previous work [70, 75] due to its focus on both the shared latent space and a task-driven objective. The latter is particularly important on small training set sizes.

While alternative approaches to multi-view learning via CCA exist, they typically focus on a reconstruction objective. That is, they transform the input into a shared space such that the input could be reconstructed - either individually, or reconstructing one view from the other. Variations of coupled dictionary learning [76-79] and autoencoders [64, 80] have been used in this context. CCA-based objectives, such as the model used in this work, instead learn a transformation to a shared space without the need for reconstructing the input. This task may be easier and sufficient in producing a representation for multi-view classification [64] We show that the CCA objective can equivalently be expressed as an -£₂ distance minimization in the shared space plus an orthogonality constraint. Orthogonality constraints help regularize NNs [81 ]; we present three techniques to accomplish this. While our method is derived from CCA, by manipulating the orthogonality constraints, we obtain deep CCA approaches that compute a shared latent space that is also discriminative.

Overall, our method enables end-to-end training via mini-batches, and we demonstrate the effectiveness of our model for three different tasks: 1 ) cross-view classification on a variation of MNIST [82] showing significant improvements in accuracy, 2) regularization when two views are available for training but only one at test time on a cancer imaging and genomic data set with only 1 ,000 samples, and 3) semi-supervised representation learning to improve speech recognition. In addition, our approach is more robust in the small sample size regime than alternative methods. Our experiments on real data show the effectiveness of our method in learning a shared space that is more discriminative than current state-of-the-art methods for a variety of tasks. CCA DETAILS

We present our task-driven CCA approach in section TOCCA. Linear and non-linear CCA are unsupervised and find the shared signal between a pair of data sources, by maximizing the sum correlation between corresponding projections. Let

and X₂e ^{d2X n} be mean-centered input data from two different views with n samples and d₁ , d₂ features, respectively.

CCA. The objective is to maximize the correlation between <¾ = ¾ and a₂ = w X₂ , where

and w₂ are projection vectors [61 ] The first canonical directions are found via

and subsequent projections are found by maximizing the same correlation but in orthogonal directions. Combining the projection vectors into matrices W₁ =

CCA can be reformulated as a trace maximization under orthonormality constraints on the p

for covariance matrices

= X₂X₂ , and cross-covariance matrix S12 = XiX Let T = åϊ^1/2å_ΐ2å2 ^{1/2 an}d its singular value decomposition (SVD) be T = U₁diag a)U₂ with singular values

in descending order.

and W₂ are computed from the top k singular vectors of

denotes the k first columns of matrix U. The sum correlation in the projection space is equivalent to

e.g., the sum of the top k singular values. A regularized variation of CCA (RCCA) ensures that the covariance matrices are positive definite by computing the covariance matrices as =

^X₂X₂ + rl, for regularization parameter r > 0 and identity matrix / [83]

DCCA. Deep CCA adds non-linear projections to CCA by non-linearly mapping the input via a multilayer perceptron (MLP). In particular, inputs X and X₂ are mapped via non-linear functions j and f₂, parameterized by and q₂, resulting in activations A_t = f^Xi, 0 ) and A₂ = f₂(X₂, q₂) (assumed to be mean centered) [63] When implemented by a NN, A_t and A are the output activations of the final layer with d₀ features. In Figure 10, diagram (a) shows the network structure. DCCA optimizes the same objective as CCA, see Eq. (1 ), but using activations ^ and A₂. Regularized covariance matrices are computed accordingly and the solution for W_t and W₂ can be computed using SVD just as with linear CCA. When k = d₀ (e.g., the number of CCA components is equal to the number of features in A_t and A₂), optimizing the sum correlation in the projection space, as in Eq. (2), is equivalent to optimizing the following matrix trace norm objective (TNO):

optimizes this objective directly, without a need to compute the CCA projection within the network. The TNO is optimized first, followed by a linear CCA operation before downstream tasks like classification are performed. Figure 10 depicts various deep CCA architectures. In Figure 10, diagram (a) depicts a DCCA based architecture that maximizes the sum correlation in projection space by optimizing an equivalent loss, the trace norm objective (TNO) [63] and diagram (b) depicts a SoftCCA based architecture that relaxes the orthogonality constraints by regularizing with soft decorrelation (Decorr) and optimizes the - ₂ distance in the projection space (equivalent to sum correlation with activations normalized to unit variance) [68] Our TOCCA methods add a task loss and apply CCA orthogonality constraints by regularizing in two ways: diagram (c) TOCCA-W uses whitening and diagram (d) TOCCA-SD uses Decorr. The third method that we propose, TOCCA-ND, simply removes the Decorr components of TOCCA-SD.

SoftCCA. While DCCA enforces orthogonality constraints on projections

and W₂A₂, SoftCCA relaxes them using regularization [68] Final projection matrices W and W₂ are integrated into j and f₂ as the top network layer. The trace objective for DCCA in Eq. (1 ) can be rewritten as minimizing the - ₂ distance between the projections when each feature in A_t and A₂ is normalized to a unit variance [84], leading to L£_2dis_t(^Ai_> ^A2) = II A₁-A₂ \\_{p .} Regularization in SoftCCA penalizes the off-diagonal elements of the covariance matrix å, using a running average computed over batches as _å and a loss of £_Decorr(A) = _åfx_{| åu|.} Overall, the SoftCCA loss takes the form:

Supervised CCA methods. CCA, DCCA, and SoftCCA are all unsupervised methods to learn a projection to a shared space in which the data is maximally correlated. Although these methods have shown utility for discriminative tasks, a CCA decomposition may not be optimal for classification because features that are correlated may not be discriminative. Our experiments will show that maximizing the correlation objective too much can degrade performance on discriminative tasks.

CCA has previously been extended to supervised settings by maximizing the total correlation between each view and the training labels in addition to each pair of views [72, 73], and by maximizing the separation of classes [66, 70] Although these methods incorporate the class labels, they do not directly optimize for classification. Dorfer et. al’s CCA Layer (CCAL) is the closest to our method. It optimizes a task loss operating on a CCA projection; however, the CCA objective itself is only optimized during pre-training, not in an end-to-end manner [75] Other supervised CCA methods are linear only [73, 72, 66, 74] Instead of computing the CCA projection within the network, as in CCAL, we optimize the non-linear mapping into the shared space together with the CCA part.

TASK-OPTIMAL CCA (TOCCA)

To compute a shared latent space that is also discriminative, we start with the DCCA formulation and add a task-driven term to the optimization objective. The CCA component finds features that are correlated between views, while the task component ensures that they are also discriminative. This model can be used for representation learning on multiple views before joining representations for prediction [69, 70] and for classification when two views are available for training but only one at test time [71 ] In the CCA Experiments section, we demonstrate both use cases on real data. Our methods and related NN models from the literature are summarized in Table 6 below. Figure 10 shows schematic diagrams.

While DCCA optimizes the sum correlation through an equivalent loss function (TNO), the CCA projection itself is computed only after optimization. Hence, the projections may not be usable to optimize another task simultaneously. The main challenge in developing a task-optimal form of deep CCA that discriminates based on the CCA projection is in computing this projection within the network - a necessary step to enable simultaneous training of both objectives. We tackle this by focusing on the two components of DCCA: maximizing the sum correlation between activations A and A₂ and enforcing orthonormality constraints within ^ and 4₂. We achieve both by transforming the CCA objective and present three methods that progressively relax the orthogonality constraints.

We further improve upon DCCA by enabling mini-batch computations for improved flexibility and test performance. DCCA was developed for large batches because correlation is not separable across batches. While large batch implementations of stochastic gradient optimization can increase computational efficiency via parallelism, small batch training provides more up-to-date gradient calculations, allowing a wider range of learning rates and improving test accuracy [85] We reformulate the correlation objective as the 2 distance (following SoftCCA), enabling separability across batches. We ensure a normalization to one via batch normalization without the scale and shift parameters [86]

Task-driven objective. First, we apply non-linear functions

and f₂ (via MLPs) to each view X_t and X₂, e.g., A_t = j (X_t; q ) and A_t = f₂ (X₂\ q₂). Second, a task-specific function f_taSk .^A _> e_task) operates on the outputs A_t and A₂. In particular,

and f₂ are optimized so that the - ₂ distance between A_t and A₂ is minimized; therefore, f_task can be trained to operate on both inputs A and A₂ We combine CCA and task-driven objectives as a weighted sum with a hyperparameter for tuning. This model is flexible, in that the task-driven goal can be used for classification [87, 88], regression [89], clustering [90], or any other task. See Table 6 below for an overview.

Orthogonality constraints. The remaining complications for mini-batch optimization are the orthogonality constraints, for which we propose three solutions, each handling the orthogonality constraints of CCA in a different way: whitening, soft decorrelation, and no decorrelation.

Whitening (TOCCA-W). CCA applies orthogonality constraints to A_t and A₂ . We accomplish this with a linear whitening transformation that transforms the activations such that their covariance becomes the identity matrix, e.g., features are uncorrelated. Decorrelated Batch Normalization (DBN) has previously been used to regularize deep models by decorrelating features [81 ] and inspired our solution. In particular, we apply a transformation B = UA to make B orthonormal, e.g., BB ^T = /.

We use a Zero-phase Component Analysis (ZCA) whitening transform composed of three steps: rotate the data to decorrelate it, rescale each axis, and rotate back to the original space. Each of these transformations is learned from the data. Any matrix UeM.^d°^xd° satisfying u =

whitens the data, where å denotes the covariance matrix of A. As U is only defined up to a rotation, it is not unique. PCA whitening follows the first two steps and uses the eigendecomposition of å: U_PCA = A^_1/2V^T for A = diag _l ... , j l_do) and v = [v_l ..., v_do], where (L_£, n_έ) are the eigenvalue, eigenvector pairs of å. As PCA whitening suffers from stochastic axis swapping, neurons are not stable between batches [81] ZCA whitening uses the transformation U_ZCA =v A^~1/2V^T in which PCA whitening is first applied, followed by a rotation back to the original space. Adding the rotation v brings the whitened data B as close as possible to the original data A [31 ]

Computation of U_ZCA is clearly dependent on å. While Huang et al. [81] used a running average of U_ZCA over batches, we apply this stochastic approximation to å for each view using the update å(/c) = aå^(ic_1) + (1 - a)å^b for batch k where å^b is the covariance matrix for the current batch and a e (0, 1) is the momentum. We then compute the ZCA transformation from å^(fc) to do whitening as B = f_ZCA ( A ) = U^)_AA. At test time,

from the last training batch is used. Algorithm 1 below describes ZCA whitening in greater detail. In summary, TOCCA-W integrates both the correlation and task-driven objectives, with decorrelation performed by whitening, into:

where B and B₂ are whitened outputs of A_t and A₂, respectively.

Algorithm SI Whitening layer for orthogonality.

Input: activations ,4eI ^{i i X}

.Hyperparameters: hatch size TO, momentum

Parameters of layer: mean /?. covariance å

If training then

m am 4- (1 - a)~·A { Update mean ]

A --- /! ···· m { Mean center data)

Update covariance]

å e å 4- el ( Add el tor numerical stability)

A, V <--- eig

{Compute eigendecomposition )

U 4-

(Compute transformation matrix)

else

A - ,4 - m { Mean center data)

end if

B - UA (Apply ZCA whitening transform)

return B Soft decorrelation (TOCCA-SD). While fully independent components may be beneficial in regularizing NNs on some data sets, a softer decorrelation may be more suitable on others. In this second formulation we relax the orthogonality constraints using regularization, following the Decorr loss of SoftCCA [68] The loss function for this formulation is:

^ 2 (.-^Decorr C^l) T (.^^•Decorr A ₂)) ·

No decorrelation (TOCCA-ND). When CCA is used in an unsupervised manner, some form of orthogonality constraint or decorrelation is necessary to ensure that /j and f₂ do not simply produce multiple copies of the same feature. While this result could maximize the sum correlation, it is not helpful in capturing useful projections. In the task-driven setting, the discriminative term ensures that the features in /j and f₂ are not replicates of the same information. TOCCA-ND therefore removes the decorrelation term entirely, forming the simpler objective: L_tas_ktf_tas_kiA . Y + £_tasMas_k(A₂), Y) +

l _{zdist(A₁, A₂ ) .

These three models allow testing whether whitening or soft decorrelation benefit a task-driven model.

Computational complexity. Due to the eigendecomposition, TOCCA-W has a complexity of O (d ) compared to O (d%) for TOCCA-SD, with respect to output dimension do. However, do is typically small (< 100) and this extra computation is only performed once per batch. The difference in runtime is less than 6.5% for a batch size of 100 or 9.4% for a batch size of 30 (see Table 7 below).

In summary, all three variants are motivated by adding a task-driven component to deep CCA. TOCCA-ND is the most relaxed and directly attempts to obtain identical latent representations. Experiments will show that whitening (TOCCA-W) and soft decorrelation (TOCCA-SD) provide a beneficial regularization. Further, since the £₂ distance that we optimize was shown to be equivalent to the sum correlation (see section CCA DETAILS, SoftCCA paragraph), all three TOCCA models maintain the goals of CCA just with different relaxations of the orthogonality constraints. See Table 6 for an overview.

Table 6

Table 6 provides a comparison of our proposed task-optimal deep CCA methods with other related ones from the literature: DCCA [1 ], SoftCCA [2], CCAL-£_ranfc [3] CCAL-£_ranfc uses a pairwise ranking loss with cosine similarity to identify matching and non-matching samples for image retrieval - not classification. A_t and A₂ are mean centered outputs from two feed-forward networks. å = A^TA is computed from a single (large) batch (used in DCCA); å is computed as a running mean over batches (for all other methods). f_task{A- ^_tfls/c) ^ ^ task-specific function with parameters ^_{tasi ^} e.g., a softmax operation for classification.

CCA EXPERIMENTS

We validated our methods on three different data sets: MNIST handwritten digits, the Carolina Breast Cancer Study (CBCS) using imaging and genomic features, and speech data from the Wisconsin X-ray Microbeam Database (XRMB). Our experiments show the utility of our methods for (1 ) cross-view classification, (2) regularization with a second view during training when only one view is available at test time, and (3) representation learning on multiple views that are joined for prediction.

Data set Batch size Epochs TOCOA W TOCCA SD

MNIST 100 200 488 s 418 s

MNIST 30 200 1071 s 1036 s

CBCS 100 400 KB s 104 s

XRMB 504)00 100 3056 s 3446 s

Table 7 Implementation. Each layer of our network consists of a fully connected layer, followed by a ReLU activation and batch normalization [86] We used the Nadam optimizer and tuned hyperparameters on a validation set via random search; settings and ranges are specified in Table 8. We used Keras with the Theano backend and an Nvidia GeForce GTX 1080 Ti. Our implementations of DCCA, SoftCCA, and Joint DCCA/DeepLDA [70] also use ReLu activation and batch normalization. We modified CCAL-£_ranfc [75] to use a softmax function and cross-entropy loss for classification, instead of a pairwise ranking loss for retrieval, referring to this modification as CCAL-£_ce.

Hyperparameter MNIST CBCS XRMB

_

Hidden layers 4 (0.4 j

Hidden layer size 500 200 1.000 Output layer size 50 50 112

Loss function weight L [10°, 10 ⁴] nod ICr⁵ TO⁵ , Hr⁵ ! Momentum a 0.99 0,99 0.99 regularize! ; 10 ^:\ itr-^{;L 0 ! ftr². !0 ^: ], 0 NO ^;\ io-^?d o

Soft decorrelation regularize*· l ift⁰ J O ^{· '}’! p.0°J0 -d rio°, IO-^I Batch size 32 100 50.000 Learning rate [lO^“2 , Kr⁴] [10- Mo-³] [10°, J O^”4 ! Epochs 200 400 100

Table 8

CROSS-VIEW CLASSIFICATION ON MNIST DIGITS

We formed a multi-view data set from the MNIST handwritten digit image data set [82] Following Andrew et al. [63], we split each 28 x 28 image in half horizontally, creating left and right views that are each 14 x 28 pixels. All images were flattened into a vector with 392 features. The full data set consists of 60k training images and 10k test images. We used a random set of up to 50k for training and the remaining training images for validation. We used the full 10k image test set.

We evaluated cross-view classification accuracy by first computing the projection for each view, then we trained a linear SVM on one view’s projection, and finally we used the other view’s projection at test time. While the task-driven methods presented in this work learn a classifier within the model, this test setup enables a fair comparison with the unsupervised CCA variants and validates the discrim inativity of the features learned. Notably, using the built-in softmax classifier performed similarly to the SVM (not shown), as much of the power of our methods comes from the representation learning part. We do not compare with a simple supervised NN because this setup does not learn the shared space necessary for cross-view classification. We report results averaged over five randomly selected training/validation sets; the test set always remained the same.

Figure 1 1 depicts classification accuracy for different CCA methods. In the left diagram of Figure 1 1 , sum correlation vs. cross-view classification accuracy (on MNIST) is depicted across different hyperparameter settings on a training set size of 10,000 for DCCA [63], SoftCCA [68], TOCCA-W, and TOCCA-SD. For unsupervised methods (DCCA and SoftCCA), large correlations do not necessarily imply good accuracy. In the right diagram of Figure 1 1 , the effect of batch size on classification accuracy is depicted for each TOCCA method on MNIST (training set size of 10,000), and the effect of training set size on classification accuracy for each method. Our TOCCA variants out-performed all others across all training set sizes.

Correlation vs. classification accuracy. We first demonstrate the importance of adding a task-driven component to DCCA by showing that maximizing the sum correlation between views is not sufficient. Figure 1 1 (left) shows the sum correlation vs. cross-view classification accuracy across many different hyperparameter settings for DCCA [63], SoftCCA [68], and TOCCA. We used 50 components for each; thus, the maximum sum correlation was 50. The sum correlation was measured after applying linear CCA to ensure that components were independent. With DCCA a larger correlation tended to produce a larger classification accuracy, but there was still a large variance in classification accuracy amongst hyperparameter settings that produced a similar sum correlation. For example, with the two farthest right points in the plot (colored red), their classification accuracy differs by 10%, and they are not even the points with the best classification accuracy (colored purple). The pattern is different for SoftCCA. There was an increase in classification accuracy as sum correlation increased but only up to a point. For higher sum correlations, the classification accuracy varied even more from 20% to 80%. Further experiments (not shown) have indicated that when the sole objective is correlation, some of the projection directions are simply not discriminative, particularly when there are a large number of classes. Hence, optimizing for sum correlation alone does not guarantee a discriminative model. TOCCA-W and TOCCA-SD show a much greater classification accuracy across a wide range of correlations and, overall, the best accuracy when correlation is greatest.

Effect of batch size. Figure 1 1 (right) plots the batch size vs. classification accuracy for a training set size of 10; 000. We tested batch sizes from 10 to 10;000; a batch size of 10 or 30 was best for all three variations of TOCCA. This is in line with previous work that found the best performance with a batch size between 2 and 32 [85] We used a batch size of 32 in the remaining experiments on MNIST.

Effect of training set size. We manipulated the training set size in order to study the robustness of our methods. In particular, Figure 1 1 (right) shows the cross-view classification accuracy for training set sizes from n = 300 to 50;000. While we expected that performance would decrease for smaller training set sizes, some methods were more susceptible to this degradation than others. The classification accuracy with CCA dropped significantly for n = 300 and 1 ;000, due to overfitting and instability issues related to the covariance and cross-covariance matrices. SoftCCA shows similar behavior ([68] did not test such small training set sizes).

Across all training set sizes, our TOCCA variations consistently exhibited good performance, e.g., increasing classification accuracy from 78.3% to 86.7% for n = 1 ;000 with TOCCA-SD. Increases in accuracy over TOCCA-ND were small, indicating that the different decorrelation schemes have only a small effect on this data set; the task-driven component is the main reason for the success of our method. In particular, the classification accuracy with n = 1 ;000 did better than the unsupervised DCCA method on n = 10;000. Further, TOCCA with n = 300 did better than linear methods on n = 50;000, clearly showing the benefits of the proposed formulation. We also examined the CCA projections qualitatively via a 2D t-SNE embedding [92] Figure 12 shows the CCA projection of the left view for each method. As expected, the task-driven variant produced more clearly separated classes. Figure 12 shows t-SNE plots for CCA methods on an example variation of MNIST. Each method was used to compute projections for the two views (left and right sides of the images) using 10,000 training examples. The plots show a visualization of the projection for the left view with each digit colored differently. TOCCA-SD and TOCCA-ND (not shown) produced similar results to TOCCA-W.

Method Training data Test data Task Accuracy Method Training data Test data Task Accuracy

Linear SVM image only Image Basal 0.777 (0.003 ) Linear SVM GE only GE Grade 0.832 (0.012)

N'N' Image oniy Image Basal 0.808 (0.006) NN GE only GE Grade 0.830 (0.012)

CCAL-£_iK Image+GE Irnaae Basal 0.807 (0.008) CCAL-A» GE+image GE Grade 0.804 (0.033)

TOCCA-W Image+GE Image Basal 0.830 (0.006) TOCCA-W GE+image GE Grade 0.862 (0.013)

TOCCA-SD Image+GE Image Basal 0.818 (0.006) TOCCA-SD GE+image GE Grade 0.856 (0.01 1)

TQCCA-KD Image+GE Image Basal 0 816 (0.004) T0CCA-ND GE+image GE Grade 0.856 (0.01 1 )

Table 9 Table 9 shows classification accuracy for different methods of predicting Basal genomic subtype from images or grade from gene expression. Linear SVM and DNN were trained on a single view, while all other methods were trained with both views. By regularizing with the second view during training, all TOCCA variants improved classification accuracy. The standard error is in parentheses.

REGULARIZATION FOR CANCER CLASSIFICATION

In this experiment, we address the following question: Given two views available for training but only one at test time, does the additional view help to regularize the model?

We study this question using 1 ,003 patient samples with image and genomic data from CBCS3 [93] Images consisted of four cores per patient from a tissue microarray that was stained with hematoxylin and eosin. Image features were extracted using a VGG16 backbone [94], pre-trained on ImageNet, by taking the mean of the 512D output of the fourth set of conv. layers across the tissue region and further averaging across all core images for the same patient. For gene expression (GE), we used the set of 50 genes in the PAM50 array [95] The data set was randomly split into half for training and one quarter for validation/testing; we report the mean over eight cross-validation runs. Classification tasks included predicting (1 ) Basal vs. non-Basal genomic subtype using images, which is typically done from GE, and (2) predicting grade 1 vs. 3 from GE, typically done from images. This is not a multi-task classification setup; it is a means for one view to stabilize the representation of the other.

We tested different classifier training methods when only one view was available at test time: a) a linear SVM trained on one view, b) a deep NN trained on one view using the same architecture as the lower layers of TOCCA, c) CCAL-£_ce trained on both views, d) TOCCA trained on both views. Table 9 lists the classification accuracy for each method and task. When predicting genomic subtype Basal from images, all our methods showed an improvement in classification accuracy; the best result was with TOCCA-W, which produced a 2.2% improvement. For predicting grade from GE, all our methods again improved the accuracy - by up to 3.2% with TOCCA-W. These results show that having additional information during training can boost performance at test time. Notably, this experiment used a static set of pre-trained VGG16 image features in order to assess the utility of the method. The network itself could be fine-tuned end-to-end with our TOCCA model, providing an easy opportunity for data augmentation and likely further improvements in classification accuracy.

SEMI-SUPERVISED LEARNING FOR SPEECH RECOGNITION

Some additional experiments use speech data from XRMB, consisting of simultaneously recorded acoustic and articulatory measurements. Prior work has shown that CCA-based algorithms can improve phonetic recognition [57, 64, 65, 70] The 45 speakers were split into 35 for training, 2 for validation, and 8 for testing - a total of 1 ,429,236 samples for training, 85,297 for validation, and 1 1 1 ,314 for testing.4 The acoustic features are 1 12D and the articulatory ones are 273D. We removed the per-speaker mean & variance for both views. Samples are annotated with one of 38 phonetic labels.

Our task on this data set was representation learning for multi-view prediction. That is, using both views of data to learn a shared discriminative representation. We trained each model using both views and their labels. To test each CCA model, we followed prior work and concatenated the original input features from both views with the projections from both views. Due to the large training set size, we used a Linear Discriminant Analysis (LDA) classifier for efficiency. The same construction was used at test time. This setup was used to assess whether a task-optimal DCCA model can improve discriminative power. We tested TOCCA with a task-driven loss of LDA [88] or softmax to demonstrate the flexibility of our model.

Method T s Accuracy

Table 10

We compared the discriminability of a variety of methods to learn a shared latent representation. Table 10 lists the classification results with a baseline that used only the original input features for LDA. Although deep methods, e.g., DCCA and SoftCCA, improved upon the linear methods, all TOCCA variations significantly outperformed previous state-of-the-art techniques. Using softmax consistently beat LDA by a large margin. TOCCA-SD and TOCCA-ND produced equivalent results as a weight of 0 on the decorrelation term performed best. However, TOCCA-W showed the best result with an improvement of 15% over the best alternative method.

TOCCA can also be used in a semi-supervised manner when labels are available for only some samples. Table 11 lists the results for TOCCA-W in this setting. With 0% labeled data, the result would be similar to DCCA. Notably, a large improvement over the unsupervised results in Table 10 is seen even with labels for only 10% of the training samples. Labeled data Accuracy

100% 0.795

30% 0.762

10% 0,745

3% 0,684

1 % 0.637

Table 11

CCA DISCUSSION

We proposed a method to find a shared latent space that is also discriminative by adding a task-driven component to deep CCA while enabling end-to-end training. This was accomplished by replacing the CCA projection distance minimization and orthogonality constraints on the activations, and was implemented in three different ways. TOCCA-W or TOCCA-SD performed the best, dependent on the data set - both of which include some means of decorrelation to provide an extra regularizing effect to the model and thereby outperforming TOCCA-ND.

TOCCA showed large improvements over state-of-the-art in cross-view classification accuracy on MNIST and significantly increased robustness when the training set size was small. On CBCS, TOCCA provided a regularizing effect when both views were available for training but only one at test time. TOCCA also produced a large increase over state-of-the-art for multi-view representation learning on a much larger data set, XRMB. On this data set we also demonstrated a semi-supervised approach to get a large increase in classification accuracy with only a small proportion of the labels. Using a similar technique, our method could be applied when some samples are missing a second view.

Classification tasks using a softmax operation or LDA were explored in this work. However, the formulation presented can also be used with other tasks such as regression or clustering. Another possible avenue for future work entails extracting components shared by both views as well as individual components. This approach has been developed for dictionary learning [58- 60] and may be extended to deep CCA-based methods. The disclosure of each of the following references is hereby incorporated herein by reference in its entirety.

References

1 . Dunnwald, L. K., Rossing, M. A. & Li, C. I. Hormone receptor status, tumor characteristics, and prognosis: a prospective cohort of breast cancer patients. Breast Cancer Res. 9, R6 (2007).

2. Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27, 1 160-7 (2009).

3. Sparano, J. A. & Paik, S. Development of the 21 -gene assay and its application in clinical practice and clinical trials. J. Clin. Oncol. 26, 721- 728 (2008).

4. Carlson, J. J. & Roth, J. A. The impact of the Oncotype Dx breast cancer assay in clinical practice: a systematic review and meta-analysis. Breast Cancer Res Treat 141 , 13-22 (2013).

5. Beck, A. H. et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl. Med. 3, 108ra1 13 (201 1 ).

6. Yuan, Y. et al. Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci. Transl. Med. 4, 157ra143 (2012).

7. Veta, M. et al. Assessment of algorithms for mitosis detection in breast cancer histopathology images. Med. Image Anal. 20, 237-48 (2015).

8. Khan, A. M., Sirinukunwattana, K. & Rajpoot, N. A Global Covariance Descriptor for Nuclear Atypia Scoring in Breast Histopathology Images. IEEE J. Biomed. Heal. Informatics 19, 1637-1647 (2015).

9. Basavanhally, A. et al. Incorporating domain knowledge for tubule detection in breast histopathology using O’Callaghan neighborhoods. Proc. SPIE 7963, 796310 (201 1 ).

10. Popovici, V. et al. Joint analysis of histopathology image features and gene expression in breast cancer. BMC Bioinformatics 17, 209 (2016). 11. Zhou, Y., Chang, H., Barner, K., Spellman, P. & Parvin, B. Classification of Histology Sections via Multispectral Convolutional Sparse Coding in Proc. CVPR (2014).

12. Vu, T. H., Mousavi, H. S., Monga, V., Rao, A. U. & Rao, G. Histopathological Image Classification using Discriminative Feature-oriented Dictionary Learning. IEEE Trans. Med. Imaging 35, 738-751 (2015).

13. Cruz-Roa, A. A., Ovalle, J. E. A., Madabhushi, A. & Gonzalez, F. A. O. A Deep Learning Architecture for Image Representation, Visual Interpretability and Automated Basal-Cell Carcinoma Cancer Detection in Proc. MICCAI (2013).

14. Wang, D., Khosla, A., Gargeya, R., Irshad, H. & Beck, A. H. Deep Learning for Identifying Metastatic Breast Cancer (2016). Preprint at http://arxiv.Org/abs/1606.05718

15. Cire§an, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. Mitosis detection in breast cancer histology images with deep neural networks. Proc. MICCAI (2013).

16. Xu, J., Luo, X., Wang, G., Gilmore, H. & Madabhushi, A. A Deep Convolutional Neural Network for Segmenting and Classifying Epithelial and Stromal Regions in Histopathological Images. Neurocomputing 191 , 214-223 (2016).

17. Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. J. Pathol. Inform. 7, 29 (2016).

18. Longacre, T. A. et al. Interobserver agreement and reproducibility in classification of invasive breast carcinoma: an NCI breast cancer family registry study. Mod. Pathol. 19, 195-207 (2006).

19. Salles, M., Sanches, F. & Perez AA, G. Importance of a second opinion in breast surgical pathology and therapeutic implications. Rev. Bras. Ginecol. Obstet. 30, 602-608 (2008).

20. Boiesen, P. et al. Histologic grading in breast cancer-reproducibility between seven pathologic departments. South Sweden Breast Cancer Group. Acta Oncol. (Madr). 39, 41-45 (2000). 21. Ma, H. et al. Breast cancer receptor status: do results from a centralized pathology laboratory agree with SEER registry reports? Cancer Epidemiol. Biomarkers Prev. 18, 2214-20 (2009).

22. Prat, A., Ellis, M. J. & Perou, C. M. Practical implications of gene-expression-based assays for breast oncologists. Nat. Rev. Clin. Oncol. 9, 48-57 (2011 ).

23. Carey, L. A. et al. Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study. JAMA 295, 2492-502 (2006).

24. Rosen, P. P. Rosen’s Breast Pathology. (2009).

25. Makki, J. Diversity of Breast Carcinoma: Histological Subtypes and Clinical Relevance. Clin. Med. Insights Pathol. 8, 23-31 (2015).

26. Allott, E. H. et al. Performance of three biomarker immunohistochemistry for intrinsic breast cancer subtyping in the AMBER consortium. Cancer Epidemiol. Biomarkers Prev. 1-28 (2015).

27. Troester, M. A. et al. Racial Differences in PAM50 Subtypes in the Carolina Breast Cancer Study. J. Natl. Cancer Inst. 110, (2018).

28. Allott, E. H. et al. Performance of Three-Biomarker

Immunohistochemistry for Intrinsic Breast Cancer Subtyping in the AMBER Consortium. Cancer Epidemiol. Biomarkers Prev. 25, 470-478 (2016).

29. Niethammer, M. et al. Appearance Normalization of Histology Slides in MICCAI, International Workshop Machine Learning in Medical Imaging 58-66 (2010).

30. Miedema, J. et al. Image and statistical analysis of melanocytic histology. Histopathology 61 , 436-44 (2012).

31. Cooper, L. A. D. et al. Integrated morphologic analysis for the identification and characterization of disease subtypes. J. Am. Med. Informatics Assoc. 19, 317-23 (2012).

32. Chang, H. et al. Morphometic analysis of TCGA glioblastoma multiforme. BMC Bioinformatics 12, 484 (2011 ).

33. Hou, L. et al. Patch-based Convolutional Neural Network for Whole Slide Tissue Image Classification in Proc. CVPR (2016). 34. Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition in International Conference on Learning Representations (2015).

35. Oquab, M., Bottou, L, Laptev, I. & Sivic, J. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks in Proc. CVPR (2014).

36. Razavian, A. S., Azizpour, H., Sullivan, J. & Carlsson, S. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition in Proc. CVPR (2014).

37. Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? in Proc. NIPS (2014).

38. Tajbakhsh, N. et al. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging 35, 1299-1312 (2016).

39. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 9, 1871-1874 (2008).

40. Zadrozny, B. & Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. Proc. Int. Conf. Knowl. Discov. Data Min. 694-699 (2002).

41 . Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: NIPS. pp. 561-568 (2002)

42. Broadhurst, R.E.: Compact appearance in object populations using quantile function based distribution families. Ph.D. thesis, The University of North Carolina at Chapel Hill (2008)

43. Hiley, C.T., Swanton, C.: Spatial and temporal cancer evolution: causes and consequences of tumour diversity. Clinical Medicine 14(Suppl 6), s33-s37 (Dec 2014)

44. Hou, L., Samaras, D., et al.: Patch-based Convolutional Neural Network for Whole Slide Tissue Image Classification. In: CVPR (2016) (same as reference 33) 45. Jia, Z., Huang, X., Chang, E. I.C., Xu, Y.: Constrained Deep Weak Supervision for Histopathology Image Segmentation. arXiv preprint: 1701 .00794 (2017)

46. Kandemir, M., Hamprecht, F.A.F.: Computer-aided diagnosis from weak supervision: A benchmarking study. Computerized Medical Imaging and Graphics (2014)

47. Kraus, O.Z., Ba, J. L., Frey, B.J.: Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics 32(12), Ϊ52-Ϊ59 (jun 2016)

48. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1 106-1 1 14 (2012)

49. McGranahan, N., Swanton, C.: Biological and Therapeutic Impact of Intratumor Heterogeneity in Cancer Evolution. Cancer Cell 27(1 ), 15-26 (Jan 2015)

50. Niethammer, M., Borland, D., Marron, J., Woolsey, J., Thomas, N.: Appearance normalization of histology slides. In: MICCAI Workshop on Machine Learning in Medical Imaging (2010) (same as reference 28)

51 . Parker, J.S., Mullins, M., et al.: Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of clinical oncology 27(8), 1 160-1 167 (2009) (same as reference 2)

52. Sun, M., Han, T.X., Liu, M.C., Khodayari-Rostamabad, A.: Multiple Instance Learning Convolutional Neural Networks for Object Recognition. In: ICPR (2016)

53. Troester, M., Sun, X., et al.: Racial differences in PAM50 subtypes in the Carolina Breast Cancer Study. Journal of the National Cancer Institute (2018) (same as reference 27)

54. Vanwinckelen, G., Tragante do O, V., Fierens, D., Blockeel, H.: Instance-level accuracy versus bag-level accuracy in multi-instance learning. Data Mining and Knowledge Discovery 30(2), 313-341 (mar 2016)

55. Wang, X., Yan, Y., Tang, P., Bai, X., Liu, W.: Revisiting Multiple Instance Neural Networks. Pattern Recognition 74, 15-24 (2018) 56. Xu, Y., Zhu, J.Y., et al. : Weakly supervised histopathology cancer image segmentation and classification. Medical Image Analysis 18(3), 591-604 (Apr 2014)

57. Weiran Wang, Raman Arora, Karen Livescu, and Jeff A. Bilmes. Unsupervised learning of acoustic features via deep canonical correlation analysis. In Proc. ICASSP, 2015.

58. Eric F Lock, Katherine A Hoadley, J S Marron, and Andrew B Nobel. Joint and Individual Variation Explained (JIVE) for Integrated Analysis of Multiple Data Types. The Annals of Applied Statistics, 7(1 ):523— 542, mar 2013.

59. Priyadip Ray, Lingling Zheng, Joseph Lucas, and Lawrence Carin. Bayesian joint analysis of heterogeneous genomics data. Bioinformatics, 30(10): 1370-6, may 2014.

60. Qing Feng, Meilei Jiang, Jan Hannig, and JS Marron. Angle-based joint and individual variation explained. Journal of Multivariate Analysis, 166:241-265, 2018.

61. Harold Hotelling. Relations between two sets of variates.

Biometrika, 28(3/4):321-377, dec 1936.

62. Tijl De Bie, Nello Cristianini, and Roman Rosipal.

Eigenproblems in pattern recognition. In Handbook of Geometric Computing, pages 129-167. Springer Berlin Heidelberg, 2005.

63. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep Canonical Correlation Analysis. In Proc. ICML, 2013.

64. Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In Proc. ICML, 2015.

65. Weiran Wang, Raman Arora, Karen Livescu, and Nathan Srebro. Stochastic optimization for deep CCA via nonlinear orthogonal iterations. In Proc. Allerton Conference on Communication, Control, and Computing, 2016.

66. Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. Multi-view Discriminant Analysis. IEEE PAMI, 2015. 67. Sarath Chandar, Mitesh M. Khapra, Hugo Larochelle, and Balaraman Ravindran. Correlational Neural Networks. Neural Computation, 28(2):257-285, Feb 2016.

68. Xiaobin Chang, Tao Xiang, and Timothy M. Hospedales.

Scalable and Effective Deep CCA via Soft Decorrelation. In Proc. CVPR, 2018.

69. Mehmet Emre Sargin, YOcel Yemez, Engin Erzin, and A Murat Tekalp. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 9(7): 1396-1403, 2007.

70. Matthias Dorfer, Gerhard Widmer, and Gerhard Widmerajku At. Towards Deep and Discriminative Canonical Correlation Analysis. In Proc. ICML Workshop on Multi-view Representaiton Learning, 2016.

71 . Raman Arora and Karen Livescu. Kernel cca for multi-view learning of acoustic features using articulatory measurements. In Symposium on Machine Learning in Speech and Language Processing, 2012.

72. George Lee, Asha Singanamalli, Haibo Wang, Michael D Feldman, Stephen R Master, Natalie N C Shih, Elaine Spangler, Timothy Rebbeck, John E Tomaszewski, and Anant Madabhushi. Supervised multi-view canonical correlation analysis (sMVCCA): integrating histologic and proteomic features for predicting recurrent prostate cancer. IEEE

Transactions on Medical Imaging, 34(1 ):284— 97, Jan 2015.

73. Asha Singanamalli, Haibo Wang, George Lee, Natalie Shih, Mark Rosen, Stephen Master, John Tomaszewski, Michael Feldman, and Anant Madabhushi. Supervised multi-view canonical correla353

tion analysis: fused multimodal prediction of disease diagnosis and prognosis. In Proc. SPIE Medical Imaging, 2014.

74. Kanghong Duan, Hongxin Zhang, and Jim Jing Yan Wang. Joint learning of cross-modal classifier and factor analysis for multimedia data classification. Neural Computing and Applications, 27(2):459-468, Feb 2016.

75. Matthias Dorfer, Jan Schliiter, Andreu Vail, Filip Korzeniowski, and Gerhard Widmer. End-to-end cross modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval, 7(2): 1 17-128, Jun 2018. 76. Sumit Shekhar, Vishal M Patel, Nasser M Nasrabadi, and Rama Chellappa. Joint sparse representation for robust multimodal biometrics recognition. IEEE PAMI, 36(1 ):113— 26, jan 2014.

77. Xing Xu, Atsushi Shimada, Rin-ichiro Taniguchi, and Li He. Coupled dictionary learning and feature mapping for cross-modal retrieval. In Proc. International Conference on Multimedia and Expo, 2015.

78. Miriam Cha, Youngjune Gwon, and H. T. Kung. Multimodal sparse representation learning and applications. arXiv preprint: 1511.06238, 2015.

79. Soheil Bahrampour, Nasser M. Nasrabadi, Asok Ray, and W. Kenneth Jenkins. Multimodal Task-Driven Dictionary Learning for Image Classification. arXiv preprint: 1502.01094, 2015.

80. Gaurav Bhatt, Piyush Jha, and Balasubramanian Raman.

Common Representation Learning Using Step370 based Correlation

Multi-Modal CNN. arXiv preprint: 1711.00003, 2017.

81. Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated Batch Normalization. In Proc. CVPR, 2018.

82. Yann LeCun. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.

83. Natalia Y. Bilenko and Jack L. Gallant. Pyrcca: regularized kernel canonical correlation analysis in Python and its applications to neuroimaging. Frontiers in Neuroinformatics, 10, nov 2016.

84. Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. Multimedia content processing through cross-modal association. In Proc. ACM International Conference on Multimedia, 2003.

85. Dominic Masters and Carlo Luschi. Revisiting Small Batch Training for Deep Neural Networks arxiv preprint: 1804.07612, 2018.

86. Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. ICML, 2015.

87. Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1106-1114, 2012. 88. Matthias Dorfer, Rainer Kelz, and Gerhard Widmer. Deep linear discriminant analysis. In Proc. ICLR, 2016.

89. Jared Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deep Survival: A Deep Cox Proportional Hazards Network arxiv preprint: 1606.00931 , 2016.

90. Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proc. ECCV, 2018.

91 . Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. arXiv preprint: 1512.00809, 2015.

92. L Van Der Maaten and G Hinton. Visualizing high-dimensional data using t-sne. journal of machine learning research. Journal of Machine Learning Research, 9:26, 2008.

93. MA Troester, Xuezheng Sun, Emma H. Allott, Joseph Geradts, Stephanie M Cohen, Chui Kit Tse, Erin L. Kirk, Leigh B Thorne, Michelle Matthews, Yan Li, Zhiyuan Hu, Whitney R. Robinson, Katherine A. Hoadley, Olufunmilayo I. Olopade, Katherine E. Reeder-Hayes, H. Shelton Earp, Andrew F. Olshan, LA Carey, and Charles M. Perou. Racial differences in PAM50 subtypes in the Carolina Breast Cancer Study. Journal of the National Cancer Institute, 2018.

94. Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. ICLR, 2015.

95. Joel S Parker, Michael Mullins, Maggie CU Cheang, Samuel Leung, David Voduc, Tammi Vickery, Sherri Davies, Christiane Fauron, Xiaping He, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology, 27(8): 1 160-1 167, 2009.

It will be understood that various details of the presently disclosed subject matter may be changed without departing from the scope of the presently disclosed subject matter. Furthermore, the foregoing description is for the purpose of illustration only, and not for the purpose of limitation.

Claims

CLAIMS What is claimed is:

1. A method for image analysis using deep learning to predict breast cancer classes, the method comprising:

receiving a test image;

generating test image instances from the test image;

inputting the test image instances into a convolutional neural network (CNN), which extracts features from each of the test image instances;

applying an instance support vector machine (SVM) to the features to predict breast cancer classes for each of the instances; and aggregating outputs from the instance SVM to predict breast cancer classes for the test image.

2. The method of claim 1 wherein the CNN is trained using multiple instance (Ml) learning and/or Ml aggregation.

3. The method of claim 1 wherein the CNN is trained using genomic data sets and a task-optimal canonical correlation analysis (TOCCA) method.

4. The method of claim 1 wherein the test image comprises a haemotoxylin and eosin (H&E) stained histologic image and wherein generating the test image instances includes dividing the test image into regions of a predetermined pixel size.

5. The method of claim 1 wherein inputting the test image instances into a CNN includes inputting the test image instances into a VGG16 CNN trained on an ImageNet data set.

6. The method of claim 1 wherein the instance SVM outputs a probability for each of the classes that an image instance belongs to the class.

7. The method of claim 6 wherein the breast cancer classes include tumor grade, estrogen receptor (ER) status, intrinsic subtype, risk of reoccurrence, and histologic subtype.

8. The method of claim 7 wherein aggregating the outputs from the instance SVM includes aggregating, for each class, probabilities computed for each test image instance into a quantile function and using an aggregation SVM that operates on the quantile function to generate probabilities for each of the breast cancer classes for the test image.

9. The method of claim 8 wherein the breast cancer classes comprise binary classes for each of tumor grade, ER status, intrinsic subtype, risk of reoccurrence, and histologic subtype, and wherein predicting the breast cancer classes includes assigning the test image to one of the binary classes for each of tumor grade, ER status, intrinsic subtype, risk of reoccurrence, and histologic subtype by comparing the probabilities to threshold values for the binary classes.

10. A system for image analysis using deep learning to predict breast cancer classes, the system comprising:

a computing platform including at least one processor a convolutional neural network (CNN)-based image classifier implemented on the computing platform for receiving a test image and generating test image instances from the test image, the CNN-based image classifier including:

a CNN for extracting features from each of the test image instances;

an instance support vector machine (SVM) for predicting, from the features, breast cancer classes for each of the instances; and

an aggregation SVM for aggregating outputs from the instance SVM to predict breast cancer classes for the test image.

1 1 . The system of claim 10 wherein the CNN is trained using multiple instance (Ml) learning and/or Ml aggregation.

12. The system of claim 10 wherein the CNN is trained using genomic data sets and a task-optimal canonical correlation analysis (TOCCA) method.

13. The system of claim 10 wherein the test image comprises a haemotoxylin and eosin (H&E) stained histologic image.

14 The system of claim 10 wherein the instances include regions of the test image that are of a predetermined pixel size.

15. The system of claim 10 wherein the CNN comprises a VGG16 CNN trained on an ImageNet data set.

16. The system of claim 10 wherein the instance SVM outputs a probability for each of the classes that an image instance belongs to the class.

17. The system of claim 16 wherein the breast cancer classes include tumor grade, estrogen receptor (ER) status, intrinsic subtype, risk of reoccurrence, and histologic subtype.

18. The system of claim 17 wherein the instance SVM inputs probabilities computed for each test image instance into a quantile function and the aggregation SVM operates on the quantile function to generate probabilities for each of the breast cancer classes for the test image.

19. The system of claim 18 wherein the breast cancer classes comprise binary classes for each of tumor grade, ER status, intrinsic subtype, risk of reoccurrence, and histologic subtype, and wherein the CNN-based image classifier assigns the test image to one of the binary classes for each of tumor grade, ER status, intrinsic subtype, risk of reoccurrence, and histologic subtype by comparing the probabilities to threshold values for the binary classes.

20. A non-transitory computer readable medium having stored thereon executable instructions that when executed by a processor of a computer control the computer to perform steps comprising:

receiving a test image;

generating test image instances from the test image;