CN117999586A

CN117999586A - Systems and methods for identifying pancreatic ductal adenocarcinoma molecular subtypes

Info

Publication number: CN117999586A
Application number: CN202280044722.XA
Authority: CN
Inventors: C·塞拉德; B·施莫奇; V·奥伯特; K·奥雷利; M·拉克鲁瓦-缇基; I·加尔贝里斯; D·德鲁贝; F·安德烈; J·克罗斯
Original assignee: Aojin Co ltd
Current assignee: Aojin Co ltd
Priority date: 2021-05-07
Filing date: 2022-05-06
Publication date: 2024-05-07

Abstract

A deep learning model for predicting one or more features of pancreatic ductal adenocarcinoma from histopathological section images is provided.

Description

Systems and methods for identifying pancreatic ductal adenocarcinoma molecular subtypes

Cross Reference to Related Applications

The present application claims european patent application number EP21305595.7 filed 5/7 in 2021; EP21305656.7 submitted at 2021, 5, 19; and EP21306599.8 submitted at month 11 and 17 of 2021. The entire contents of the above-mentioned priority application are incorporated herein by reference.

Technical Field

The present invention relates generally to machine learning and computer vision, and more particularly to image preprocessing and classification.

Background

Histopathological Image Analysis (HIA) is a key element in diagnosis in many medical fields, especially oncology. Pancreatic ductal adenocarcinoma (abbreviated as "PAC" or "PDA") is expected to be the second leading cause of cancer death in 2030, and its prognosis has barely improved in the past few decades. PAC is a very heterogeneous tumor with significant interstitial and multiple histological aspects. Genomic and proteomic studies have demonstrated molecular heterogeneity of PAC and may be one of the factors explaining the failure of most clinical trials. Transcriptome subtypes of PAC have been described as having significant prognostic and predictive significance. For example Rashid et al, CLINICAL CANCER RESEARCH (2020); 26:82-92, a single sample classifier for PAC subtype typing named tumor purity independent subtype (PurIST) is described, which classifier is based on gene expression data from RNAseq, nanoString or microarrays. Both PurIST subtypes (classical and basal-like) have a meaningful correlation with patient prognosis and therapeutic response. Within tumor cells, basal-like subtypes are defined by a poorer prognosis associated with early metastasis and Folfirinox resistance compared to classical subtypes characterized by an ancestral epithelial phenotype. Within the matrix (stroma), the activated matrix is enriched with non-organized, pro-tumor cancer associated fibroblasts, with little extracellular matrix (matrix), whereas the deactivated matrix is characterized by the secretion of abundant and dense collagen by the more quiescent myofibroblasts. Furthermore, puleo et al, gastroenterology (2018), 155:1999-2013, describe a PDA sorting system based on gene expression analysis of formalin fixed PDA samples. Such classification systems are based on molecular components extracted by independent component analysis of transcriptome data, and in particular include 4 components named "classical", "basal", "interstitial activity" and "interstitial inactivity" based on correlation with biological signals. The PurIST subtype of Rashid et al and the molecular composition of Puleo et al are both defined by gene expression, e.g., by RNA profiling. These methods are limited by the number and quality of samples (formalin fixed and low cell structure) and assay delays, which can limit their use in routine care. Furthermore, tumors may mix several subtypes, which complicates their interpretation using a large number of transcriptomic methods, limiting their clinical value. A recent study has shown that tumor cell architecture (i.e., gland formation) can partially predict the tumor cell transcriptome subtype in primary resected tumors. This approach, while very interesting, requires a trained pathologist and analysis of the entire tumor.

Disclosure of Invention

Methods, systems, and devices for image classification and uses thereof are described herein.

In one aspect, disclosed herein is a computer-implemented method for processing a digital image of a Pancreatic Ductal Adenocarcinoma (PDA) sample, the method comprising receiving a digital image of a PDA sample derived from a subject, applying a machine learning model to the digital image, and determining a PDA subtype of the image using the machine learning model; a machine learning model has been trained to predict PDA subtypes by processing a plurality of training images, wherein the training images include global labels indicating known PDA subtypes.

In some embodiments, the digital image is an H & E stained section of a PDA sample.

In some embodiments, known PDA subtypes are assigned based on gene expression profiling (e.g., RNAseq or Nanostring) of PDA samples obtained from the same source as the training images. In some embodiments, the known PDA subtypes are classified according to the PurIST classification scheme. In some embodiments, the known PDA subtypes are classical and/or basal. In some embodiments, the known PDA subtypes are classified according to a molecular subtype classification scheme. In some embodiments, the known PDA subtypes are classified according to a molecular composition profiling scheme. In some embodiments, the known PDA subtype is classical, basal-like, interstitial active or interstitial inactive. In other embodiments, the known PDA subtypes include a continuous score assigned to each training image, which corresponds to one or both of the following classifications: classical and basal samples. In other embodiments, the known PDA subtypes include a continuous score assigned to each training image, which images correspond to one or more, two or more, three or more, or four of the following classifications: classical, basal-like, interstitial or interstitial inactivity.

In some embodiments, the step of determining the PDA subtype of the image includes determining a continuous score that represents a likelihood that the tissue represented in the image belongs to one of the two PDA subtypes. For example, the model may generate a score between the first value and the second value for the image (or for each tile derived from the image), wherein a score closer to the first value indicates a higher likelihood that the tissue represented in the image belongs to the first subtype and a score closer to the second value indicates a higher likelihood that the tissue represented in the image belongs to the second subtype. In some embodiments described herein, the model generates a score for the image between 0 and 1. When the score approaches 0, the likelihood that the tissue represented in the model predictive image belongs to a first subtype (e.g., classical) is higher, and when the score approaches 1, the likelihood that the tissue represented in the model predictive image belongs to a second subtype (e.g., basal-like) is higher. In this example, an image assigned a score of 0.9 is more likely to contain the underlying subtype of tissue than an image assigned a score of 0.7. Similarly, an image assigned a score of 0.2 is more likely to contain a classical subtype of tissue than an image assigned a score of 0.4. The likelihood that the image assigned a score of 0.5 contains tissue of the classical subtype and the basal subtype is approximately equal.

In other embodiments, the step of determining the PDA subtype for the image includes determining one or more scores representing one or more PDA subtype features of the tissue sample represented in the image. For example, the model may assign a quantized image (or each tile derived from an image) with a value representing how strong a particular subtype feature is. In this way, by compiling the scores assigned to the individual PDA subtype features within the image, the molecular spectrum for the PDA sample can be determined. For example, in some embodiments, the model may determine a score for each of the following PDA subtype features within the image (or tile derived therefrom): classical, basal, interstitial activity, interstitial inactivity. Thus, in this embodiment, the model will generate four scores, each score indicating the extent to which the tissue sample represented in the image (or tile) contains classical, basal, interstitial or interstitial inactivity features. For example, the model may assign a classical feature score between 0-1 to each image or tile derived therefrom, where a score near 0 indicates that the tissue sample represented in the image contains very few classical features, and a score near 1 indicates that the tissue sample represented in the image contains many classical features. Furthermore, the model may assign base feature scores between 0-1, interstitial activity feature scores between 0-1, and interstitial inactivity feature scores between 0-1 to the image or tiles derived therefrom. In this way, the molecular profile of the sample may be determined by evaluating the score and determining that the PDA sample has a high representative of, for example, basal and interstitial activity features and a lowest representative of classical and interstitial inactivity features. The foregoing description presents a general framework for the models provided herein. The range of possible scores to be assigned to each feature for a particular PDA subtype feature determined by the model may be adjusted by, for example, a label assigned to the training image.

The foregoing method may further comprise one or more additional image processing or preprocessing steps as provided herein, including, but not limited to, (i) selecting one or more tumor tissue fragments present in the image; (ii) tiling tumor tissue fragments into a collection of tiles; and/or (iii) performing feature extraction on the set of tiles to extract the set of features.

In some aspects, provided herein are computer-implemented methods for processing digital images of pancreatic ductal adenocarcinoma (PAC) samples, the methods comprising receiving digital images of PAC samples obtained from a subject, applying a machine learning model to the digital images, and determining PAC classifications of the images using the machine learning model, wherein the machine learning model has been trained by processing a plurality of training images.

In some embodiments, the digital image is a Whole Slice Image (WSI).

In some embodiments, the method further comprises one or more image preprocessing steps.

For example, in some embodiments, the image preprocessing step includes one or more (i.e., one, two, or three) of: a. removing a background segment from the image; b. tiling the digital image into a set of tiles; feature extraction is performed on a set of tiles, or a subset thereof, to extract a set of features from the set of tiles. In some embodiments, the image preprocessing step includes (a), (b), and (c).

In some embodiments, PAC classification is done at the slice level. In other embodiments, PAC classification is at the tile level.

In some embodiments, PAC classification classifies a tile as containing a neoplastic (neoplastic) region or not containing a neoplastic region, where the neoplastic region may include tumor cells and/or interstitial cells associated with a tumor. In some embodiments, the PAC classification is a continuous score representing a likelihood that PAC samples represented in the tiles contain a neoplastic region.

In some embodiments, PAC classification classifies a tile as comprising tumor cells or comprising a interstitial region. In some embodiments, the PAC classification is a continuous score representing a likelihood that the PAC samples represented in the tiles contain tumor cells or contain interstitial regions.

In some embodiments, the PAC classification includes a consecutive score representing a likelihood that the PAC sample represented in the image belongs to one of two PDA subtypes.

In some embodiments, the consecutive score represents a likelihood that the PAC sample represented in the image belongs to a PurIST PAC subtype selected from a classical or base sample.

In some embodiments, the PAC classification includes one or more consecutive scores representing a prevalence of one or more PDA subtype features in the tissue samples represented in the tiles.

In some embodiments, the PAC classification includes one or more consecutive scores reflecting the extent to which the tiles belong to one or more molecular PAC subtypes selected from classical, basal, interstitial activity, and interstitial inactivity.

In some embodiments, the PAC classification includes four consecutive scores, each score representing a prevalence of PDA subtype features selected from classical, basal, interstitial active, and interstitial inactive in the tissue samples represented in the tiles.

In an exemplary embodiment, the present disclosure relates to a computer-implemented method of determining a Pancreatic Ductal Adenocarcinoma (PDA) subtype corresponding to a PDA classification scheme for a subject with PDA. The method may include receiving a digital image of a histological section of a PDA sample from the subject, and preprocessing the image to select one or more tumor tissue fragments present in the image. The tumor tissue fragments may include epithelial tumor cells and interstitial regions. The method may also include tiling the tumor tissue segments into a set of tiles, and performing feature extraction on the set of tiles to extract a set of features from the set of tiles. The method may further include determining PDA subtypes for each tumor tissue segment from the set of features using a machine learning model. The machine learning model may be trained on PDA classification schemes, and each PDA subtype of one or more tumor tissue segments may be a PDA subtype of a PDA classification scheme. In some embodiments, the method may further include calculating one or more PDA sub-component scores for each tile in the set of tiles using a machine learning model. The machine learning model may also be trained to calculate a score for each PDA molecular component, including: classical, basal, interstitial activity, interstitial inactivity.

In some embodiments, the PDA categorization scheme is one of a plurality of PDA categorization schemes. In some embodiments, each PDA class regimen comprises a plurality of possible PDA subtypes.

In some embodiments, histological sections of PDA samples have been stained with a dye. In some embodiments, the dye is hematoxylin and eosin (H & E).

In some embodiments, the digital image is a Whole Slice Image (WSI).

In some embodiments, the PDA sample is a primary pancreatic ductal adenocarcinoma or a portion thereof. In some embodiments, the PDA sample is metastatic pancreatic ductal adenocarcinoma or a portion thereof. In some embodiments, the metastatic pancreatic ductal adenocarcinoma or portion thereof is obtained from the liver of a subject.

In some embodiments, the preprocessing step comprises: (i) Removing background fragments from the image, and/or (ii) removing non-tumor tissue fragments from the image. In some embodiments, removing background segments from the image is performed using a convolutional neural network. In some embodiments, removing non-tumor tissue segments from the image is performed by a model trained to distinguish between neoplastic and normal areas in the PDA. In some embodiments, the pretreatment step comprises (i) and (ii).

In some embodiments, feature extraction is performed using Momentum Contrast (momentum contrast) or Momentum Contrast v2, as taught in Dehaene et al, self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology, month 12, 7 of 2020.

In some embodiments, the PDA classification scheme is one of PurIST classification scheme and molecular component profiling scheme. PurIST classifications may include classical (also referred to as "classical") and basal-like (also referred to as "basal") subtypes. The molecular composition profiling scheme may include: classical, basal, interstitial active and interstitial inactive components. In some embodiments, a PDA categorization scheme assigns one of the foregoing categories to a PDA image. In some embodiments, the PDA categorization scheme assigns a continuous score to the PDA image that represents each subtype and/or each component. In other embodiments, the PDA classification scheme assigns a continuous score representing each subtype and/or each component to each tile within the PDA image.

In some embodiments, determining the PDA subtype for each tumor tissue segment includes analyzing a set of features extracted from a set of tiles using a machine learning model to generate a subtype score corresponding to each tile in the set of tiles. In some embodiments, determining the PDA subtype includes calculating a PurIST score for the slice level based on an analysis of a set of features extracted from a set of tiles. In some embodiments, the machine learning model has been trained using a plurality of training images including digital images of histological sections of PDA samples from subjects of known PDA subtypes of PDA classification schemes. In some embodiments, the training images each include a global label indicating a known PDA subtype.

In some embodiments, the PDA classification scheme is PurIST and the global tag is one of the classical and basal PurIST PDA subtypes. In some embodiments, the machine learning model is a deep multiple instance learning model. In some embodiments, the PDA profile scheme is a molecular component and the global signature is a value for each of classical, basal, interstitial active, interstitial inactive molecular components. In some embodiments, the machine learning model is a wellon model.

In some embodiments, the gene expression profile of a PDA sample is used to identify known PDA subtypes. In some embodiments, the gene expression profile comprises RNAseq data or NanoString data. In some embodiments, the method may further include aggregating the component scores corresponding to each of the plurality of tiles to generate a component score corresponding to the digital image, wherein the component score corresponding to the digital image indicates the molecular component having the highest prediction score.

In some embodiments, the method further includes superimposing the digital image with information representative of the component score of each tile in the set of tiles to generate a digital image labeled with information representative of the component score of each tile in the set of tiles. In some embodiments, the information representative of the component score of each tile includes a label indicating the molecular component with the highest predicted score of one or more tumor tissue fragments contained in the tile.

In some embodiments, the method further comprises selecting a PDA classification scheme(s).

Also provided herein are digital images of histological sections of PDA samples, wherein tumor tissue fragments within the images contain tags associating one or more tumor tissue fragments with one or more of the plurality of PDA subtypes, wherein the digital images are generated according to the computer-implemented methods set forth herein.

In some embodiments, the method further comprises analyzing all PDA molecular component scores corresponding to individual tumors of the patient; determining the proportion of sections of individual tumors corresponding to different PDA molecular component fractions; tumor-grade PDA molecular component fractions were generated based on the proportion of sections of individual tumors corresponding to the different PDA molecular component fractions.

In some aspects, provided herein is a machine-readable medium having executable instructions to cause one or more processing units to perform a method for processing digital images of pancreatic ductal adenocarcinoma (PAC) samples, the method comprising receiving digital images of PAC samples obtained from a subject, applying a machine learning model to the digital images, and determining PAC classifications of the images using the machine learning model, wherein the machine learning model has been trained by processing a plurality of training images.

In some embodiments, the digital image is a Whole Slice Image (WSI).

In some embodiments, PAC classification classifies a tile as containing a neoplastic region or not, wherein the neoplastic region may include tumor cells and/or interstitial cells associated with a tumor. In some embodiments, the PAC classification is a continuous score representing a likelihood that PAC samples represented in the tiles contain a neoplastic region.

In some aspects, provided herein is a machine-readable medium having executable instructions to cause one or more processing units to perform a method for determining Pancreatic Ductal Adenocarcinoma (PDA) subtypes corresponding to PDA classification schemes for subjects with PDA, the method comprising: receiving a digital image of a histological section of a PDA sample obtained from a subject; preprocessing the image to extract a set of features, wherein the preprocessing comprises: tiling the digital image into a set of tiles and performing feature extraction on the set of tiles to extract a set of features from the set of tiles; selecting a subset of tiles representing one or more tumor tissue fragments, wherein the subset of tiles comprises a subset of features and the one or more tumor tissue fragments may comprise epithelial tumor cells and interstitial regions; and determining PDA subtypes of the digital image from at least a subset of the features using a machine learning model, wherein the machine learning model is trained for a PDA classification scheme and each PDA subtype for one or more tumor tissue segments is a PDA subtype of the PDA classification scheme.

Drawings

The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.

FIG. 1 illustrates an example flow chart of a method of applying a trained machine learning model to predict PDA molecular subtype scores in accordance with an embodiment of the present disclosure.

Fig. 2 illustrates an example flowchart of a method of training and validating a DL model according to an embodiment of the present disclosure.

Fig. 3 illustrates an example flowchart of a method of determining Pancreatic Ductal Adenocarcinoma (PDA) subtypes corresponding to PDA classification schemes for subjects with PDA, according to an embodiment of the present disclosure.

Fig. 4 illustrates an example flowchart of a method of generating tumor-level PDA subtype scores according to an embodiment of the present disclosure.

Fig. 5A-5C are graphs illustrating verification results of a trained DL model according to some embodiments of the present disclosure.

Fig. 6 is a graph illustrating the overall lifetime of a single variable/binary in a BJN queue (cohort) according to an embodiment of the present disclosure.

FIG. 7 is a graph illustrating the total lifetime of multiple variables in a TCGA-PAAD queue according to an embodiment of the invention.

FIG. 8A illustrates an example set of base tiles according to some embodiments of the present disclosure.

Fig. 8B illustrates an example set of classical tiles according to some embodiments of the present disclosure.

FIG. 9 illustrates an example of a computer system that may be used in connection with the embodiments described herein.

Fig. 10 presents a schematic view of the workflow detailed in example 1.

11A-11C provide a flow chart illustrating the study design described in example 1. Fig. 11A provides a description of the queue. The discovery cohort consisted of 202 patients (surgical specimens) from 3 centers. Tissue carrots (600 um diameter) were removed from the blocks for RNA profiling. HES sections (at least 2/tumor) were digitized for PACpAInt analysis. In most cases, the tissue carrots and HES are not from the same block. The workflow in the first validation queue bjn_ Unmatched (surgical specimen) is similar. For the next 2 validation queues (BJN-Matched (surgical specimen) and eus_fnb (liver metastasis, fine needle biopsy)), RNA extraction was performed after microdissection using the same blocks to select for neoplastic areas and generate HES sections that were digitized and analyzed with PACpAInt. In addition, in the BJN-Matched queue, all remaining tumor sections were also digitized for PACpAInt analysis. Finally, in TCGA-PAAD validation cohorts (surgical specimens), in contrast to all other cohorts, RNA was extracted from frozen material, not formalin-fixed paraffin-embedded. Similar to the discovery queue, the tissue analyzed by RNAseq is spatially mismatched to the digitized slice. Fig. 11B provides a flow chart of slice-level prediction. At the whole slice level (global classification of the whole slice), the multi-step PACpAInt model first identifies the neoplastic region (PACpAInt-Neo module) and then evaluates the basal pattern (PACpAInt-B/C module) or molecular composition of the classical state (PACpAInt-Comp module). FIG. 11C provides a flow chart of tile level prediction. In this setup, all tiles (small squares, 112um wide) were analyzed and reported individually. The multi-step PACpAInt model first identifies neoplastic tiles (PACpAInt-Neo modules), then tumor cells and stroma (PACpAInt-Cell type modules), then evaluates molecular components (PACpAInt-Comp modules), allowing for in depth study of intratumoral heterogeneity.

FIGS. 12A-12C depict the identification of neoplastic regions by PACpAInt-Neo. Fig. 12A graphically depicts PACpAInt performance in identifying neoplastic regions in BJN and tcga_ PAAD validation queues. Fig. 12B provides a graph of 2 example cases of tumor regions identified with H & E (left), PACpAInt-Neo segmentation (right) and focus (center) of neoplastic (red/upper top and upper bottom panels) and non-neoplastic (green/lower top and bottom panels) regions. Fig. 12C presents representative tiles identified by PACpAInt-Neo as neoplastic and non-neoplastic in the TCGA-PAAD validation queue.

Fig. 13A presents a representative tile identified by PACpAInt as classical or base-like in a validation BJN queue. FIG. 13B graphically depicts PACpAInt performance of identifying molecular subtypes in a validation queue "BJN mismatch" (surgical specimen) (i.e., the section being analyzed is spatially mismatched to the tissue used for RNAseq).

FIG. 14A presents a representative tile identified by PACpAInt-B/C as classical or base-like in the TCGA PAAD validation queue. FIG. 14B graphically depicts PACpAInt-B/C's ability to identify molecular subtypes for an entire slice-level region in the TCGA PAAD validation queue.

Figure 15 graphically depicts PACpAInt the performance of identifying molecular subtypes in a validation queue "BJN match" (surgical specimen) (i.e., the section being analyzed spatially matches the tissue for RNAseq).

Figure 16 graphically depicts PACpAInt performance in identifying molecular subtypes in liver Fine Needle Biopsies (FNB).

Fig. 17A presents the results of the multivariate analysis of clinical/pathological factors and PACpAInt, demonstrating the independent prognostic value of the latter for total survival. FIG. 17B presents the results of a multivariate analysis of clinical/pathological factors and PACpAInt-B/C on disease-free survival in the BJN validation cohort. * P <0.001; * *: p <0.01; * : p <0.05; p <0.1; p >0.1.

FIG. 18A presents the results of multivariate analysis of the total lifetime (BJN validation cohort) of RNA-defined molecular subtypes (PurIST-RNA). FIG. 18B presents the results of multivariate analysis of RNA-defined molecular subtypes (PurIST-RNA) for disease-free survival (BJN validation cohort). * P <0.001; * *: p <0.01; * : p <0.05; p <0.1; p >0.1.

Fig. 19 depicts the application of PACpAInt in a single sample to all tumor sections (n=660) defined as classical 77 cases by RNAseq. Upper graph: the Y-axis represents PACpAInt points assessing the "basicity" of each slice, while the patients (1 to 77) are aligned along the X-axis. Each point represents a slice. Cases where all sections showed a low PACpAInt score (< 0.2) were referred to as "pure" classical, compared to more heterogeneous tumors as "mixed" classical. The following figures: KAPLAN MEYER total survival analysis comparing "pure" and "mixed" classical tumors. * P <0.01.

Fig. 20A presents PACpAInt performance in BJN (top) and tcga_ PAAD (bottom) validation queues for identifying tumor and stromal cells. Fig. 20B presents a representative tile identified by PACpAInt as tumor cells or stroma in the TCGA-PAAD validation queue. Fig. 20C presents the correlation between tumor cell/matrix ratio calculated from PACpAInt (y-axis) and whole cell keratin immunohistochemistry (x-axis).

Fig. 21 presents a multivariate analysis of tumor/interstitial ratios calculated for clinical/pathological factors and PACpAInt cell types versus disease-free survival (left) and total survival (right) in BJN validation cohorts. * P <0.001; * *: p <0.01; * : p <0.05; p <0.1; p >0.1.

Fig. 22 presents correlation at the slice level between tumor and interstitial components defined by RNAseq or PACpAInt on BJN mismatch (left panel) or match (right panel) validation series.

Figure 23 presents PACpAInt tumors and interstitial scores in the tiles identified as classical, basal-like, interstitial active or inactive (analysis of 100K tiles).

FIG. 24 presents the correlation between median interstitial and epithelial scores per slice for PACpAInt-Comp.

Fig. 25 presents a new classification of four subtypes based on PACpAInt tile scores of classical and basal components: principal classical (classical), intermediate, mixed and principal base (basal). For each listed patient, first the basal and classical scores for the 99 th percentile are shown, and second the proportions of basal and classical differentiated tumor tiles for the different classes are shown.

Fig. 26 presents KAPLAN MEYER analyses comparing total survival of the primary classical, intermediate, mixed and primary basal-like tumors (left panel), and KAPLAN MEYER analyses comparing disease-free survival of the primary classical, intermediate, mixed and primary basal-like tumors (right panel).

Figure 27 presents KAPLAN MEYER analysis of total survival (left panel) and disease-free survival (right panel), comparing that less than 5%, 5 to 20% and more than 20% of tumor tiles were identified as basal.

FIG. 28 presents the multivariate analysis of clinical/pathological factors, and the effect of the basal pattern block amounts calculated by PACpAInt-Comp on overall (top) and disease-free (bottom) survival.

Detailed Description

Methods and apparatus for identifying pancreatic ductal adenocarcinoma (interchangeably abbreviated as "PAC" or "PDA") characteristics (e.g., subtypes) are described. In the following description, numerous specific details are set forth in order to provide a thorough explanation of embodiments of the invention. It will be apparent, however, to one skilled in the art that embodiments of the invention may be practiced without these specific details. In other instances, well-known components, structures, and techniques have not been shown in detail in order not to obscure an understanding of this description.

Reference in the specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment. The term "exemplary" is used herein in the sense of "exemplary" rather than "ideal". It should be understood from this disclosure that the invention is not limited to the examples described herein.

For any method described herein, the order of steps presented, whether in the text or in the attached flow diagrams, should not be construed as to imply that those steps must be performed in the order presented unless the context dictates otherwise. Rather, the order of the steps presents one embodiment of the provided methods, and generally such steps may alternatively be performed in a different order or simultaneously. The processes depicted in the figures below are performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both. Although these processes are described below in terms of some sequential operations, it should be appreciated that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

Histology is a field of research related to the microscopic features of biological specimens. Histopathology refers to microscopic examination of a specimen (e.g., tissue) obtained or otherwise derived from a subject (e.g., a patient) in order to assess a disease state. Histopathological specimens are typically produced by processing a specimen (e.g., tissue) in a manner that secures the specimen or a portion thereof to a microscopic section. For example, a microtome or other suitable device may be used to obtain a slice of a tissue specimen, and the slice may be secured to the slice. To aid in visualization of the specimen, the specimen may optionally be further processed, for example, by applying a stain. A number of stains have been developed for visualizing cells and tissues. These include, but are not limited to, hematoxylin and eosin (H & E), methylene blue, ma Senmao, congo red, oil red O, and safranin. Pathologists often use H & E to help view cells within tissue specimens. Hematoxylin stained the nucleus blue and eosin stained the cytoplasm and extracellular matrix pink. Pathologists visually inspecting H & E stained sections can use this information to assess morphological features of the tissue. But H & E stained sections generally contain insufficient information to assess the presence or absence of a particular biomarker by visual inspection. Visualization of specific biomarkers (e.g., protein or RNA biomarkers) can be accomplished with additional staining techniques that depend on the use of labeled detection reagents that specifically bind to the marker of interest, e.g., immunofluorescence, immunohistochemistry, in situ hybridization, and the like. Such techniques are useful for determining expression of a single gene or protein, but are not practical for assessing complex expression patterns involving a large number of biomarkers. Global expression profiling can be achieved by genomic and proteomic methods using separate samples derived from the same tissue source as the sample used for histopathological analysis. Such methods are expensive and time consuming, require the use of specialized equipment and reagents, and do not provide any information relating biomarker expression to specific areas within the tissue specimen (e.g., specific areas within the H & E staining image).

Pancreatic ductal adenocarcinoma ("PAC", "PDA" or "PDAC") has a high level of molecular heterogeneity. Genomic tools such as RNAseq and Nanostring have been used to identify and classify clinically relevant PDA subtypes (see, e.g., rashid et al, CLINICAL CANCER RESEARCH (2020); 26:82-92 and Puleo et al, gastroenterology (2018)), 155: 1999-2013).

Digital images of histological sections, e.g., H & E stained sections, allow for computational evaluation of tissue specimens in addition to or as an alternative to visual inspection by pathologists. Provided herein are computer-implemented methods and associated systems and computer-readable media for determining PDA subtypes based on digital images of PDA tissue sections without the need for genomic or proteomic analysis.

Computational methods for implementing the methods provided herein may include, for example, machine learning, artificial Intelligence (AI), deep Learning (DL), neural networks, classification and/or clustering algorithms, and regression algorithms.

As used herein, the term "digital image" refers to an electronic image represented by a collection of pixels that can be viewed, processed, and/or analyzed by a computer. In some embodiments, the digital image may be acquired by means of a digital camera or other optical device capable of capturing a digital image from a slice or portion thereof. In other embodiments, the digital image may be acquired by means of scanning a non-electronic image of the slice or portion thereof. In some embodiments, the digital image used in the applications provided herein is a whole slice image. As used herein, the term "Whole Slice Image (WSI)" refers to an image that includes all or substantially all portions of a tissue slice, e.g., a tissue slice that is present on a histological slice. In some embodiments, the WSI includes an image of the entire slice. In other embodiments, the digital image used in the applications provided herein is a selected portion of a tissue slice, e.g., a tissue slice present on a histological slice. In some embodiments, the digital image is acquired after treatment of the tissue section with a stain (e.g., H & E).

As used herein, a "region of interest" of an image may be any region that is semantically related to the task to be performed, particularly a region corresponding to a tissue, organ, bone, cell, body fluid, etc. when in the context of histopathology.

As used herein, a "PDA classification scheme" refers to a classification framework for determining one or more PDA subtypes. Exemplary PDA classification schemes provided herein include PurIST classification and molecular component classification. According to the PurIST classification scheme, a PurIST subtype score (e.g., between 0 and 1) may be generated that represents the likelihood that the PDA sample represented in the digital image belongs to the base or classical subtype. In some embodiments, the PurIST subtype scores are determined at the slice level, with a single score assigned to the digital image. According to a molecular component classification scheme, the molecular component subtype score may include a vector corresponding to a value of each molecular component selected from the group consisting of: classical, basal, interstitial and interstitial inactivity. In some embodiments, molecular component subtype scores are determined at a tile level, where the molecular subtype scores are assigned to individual tiles from a digital image.

As used herein, classifying an image describes associating a tag from a predetermined tag list with a particular image. In the context of histopathology, the classification may be a diagnostic classification. In one embodiment, the classification may be binary, e.g., the tag is simply "healthy"/"unhealthy", or "base"/"classical". In other embodiments, there may be more than two tags, e.g., tags corresponding to different diseases, tags corresponding to different stages of the disease, tags corresponding to different kinds of diseased tissue, etc. For example, in some embodiments, there are more than two tags indicating PDA subtypes. In some embodiments, the tag may include a molecular component classification classical, basal, interstitial active, and interstitial inactive.

"PDA subtype" or "PAC subtype" describes a subset of pancreatic ductal adenocarcinomas sharing some common features. For example, the PurIST PAC subtypes "classical" or "classical" and "basal" or "basal-like" were initially defined in terms of the commonality of gene expression common to cancers of the same subtype, and differ from cancers of other subtypes, as described in Rashid et al CLINICAL CANCER RESEARCH (2020); 26:82-92. These commonalities and differences in gene expression may be determined, for example, by RNA expression profiling (e.g., RNAseq). Furthermore, as described herein, it is surprising that each subtype of cancer has common morphological features that can be identified using a deep learning model. Thus, the model provided herein can determine whether a subject's pancreatic ductal adenocarcinoma is of the "classical" or "basal" subtype by analyzing images of PDA (e.g., digital images from H & E stained histological sections of PDA) as described herein, without performing RNA expression profiling on a sample of PDA tissue.

The deep learning method described herein also allows for identifying and characterizing additional PAC subtype features based on the morphology of PAC samples. In one embodiment employing the methods set forth herein, four PAC subtype signatures are identified and the classification "classical", "basal", "interstitial active" and "interstitial inactive" are given based on PDA transcriptome components described by Puleo et al gateway (2018), 155:1999-2013 (see Puleo et al, fig. 2; transcriptome components described as "basal-like tumors" are designated as "basal" in the studies described herein; transcriptome components described as "classical tumors" are designated as "classical" in the studies described herein; transcriptome components described as "activated interstitium" are designated as "interstitial active" in the studies described herein; transcriptome components described as "inflammatory interstitium" are designated as "interstitial inactive" in the studies described herein).

As used herein, predicting a global score of an image describes calculating a value representing a characteristic of the image as a whole. For example, the slice level score is a score applied to an entire image (e.g., an entire slice image). This is in contrast to tile level scores, which are applied to individual tiles that are derived from the image after tiling the tiles, as described herein. In the context of the systems and methods provided herein, the global score may be a score indicative of a PDA subtype. In other embodiments, the overall score may include a risk score associated with prognosis (i.e., survival, expected survival, etc.), a risk score associated with response to treatment (i.e., probability of treatment being effective, expected change, etc.), or any important diagnostic parameter.

The terms "server," "client," and "device" are generally intended to refer to a data processing system and not to a particular form factor of a server, client, and/or device in particular.

A multi-step method is provided that uses a Deep Learning (DL) model to predict tumor components and molecular subtypes thereof in routine histological preparations. In one particular embodiment, 424 WSIs with clinical and transcriptome data corresponding to 202 excised PACs from three centers are assembled and used as a discovery set (i.e., training set). An independent queue of 250 cases, PAC from TCGA (n=134) and an independent queue of 25 liver biopsies were used as validation sets. The tumor region in the slice of the discovery set is annotated to train a multi-step DL model that first identifies tumor tissue and then predicts molecular subtypes.

The technology disclosed herein demonstrates the value of the histological-based DL model for complex tumor transcriptome subtype typing in PAC. The DL model can predict the neoplastic region (i.e., tumor region) of the entire slice image, determine the molecular PAC subtype at the entire slice level in routine histological preparation, and differentiate tumor cells/stromal compartments and predict their corresponding molecular subtype at the tile level to break up intratumoral heterogeneity on a large scale. The present disclosure provides a first AI-based PAC subtype typing tool, ultimately opening the possibility of patient molecular stratification in routine care and clinical trials. Additional benefits include the use of external case queues with sections of tumor to assess intratumoral heterogeneity, and validation of models in queues from prospective clinical trials. Another benefit of the present disclosure is the ability to identify the location of different molecular components within WSI.

Thus, provided herein is a computer-implemented method for processing digital images of Pancreatic Ductal Adenocarcinoma (PDA) samples. The method may include (i) receiving a digital image of a PDA sample from a subject, (ii) applying a machine learning model to the digital image, and (iii) determining a PDA subtype of the image using the machine learning model, wherein the machine learning model is a model that has been trained in advance to predict PDA subtypes by processing a plurality of training images. In some embodiments, the digital image is an entire slice image of a PDA tissue slice. In some embodiments, PDA tissue sections have been stained with staining agents (such as hematoxylin and eosin). In some embodiments, the plurality of training images comprises a plurality of full slice images of training PDA tissue slices, wherein the training PDA tissue slices are derived from tumors of known PDA subtypes. In some embodiments, each of the plurality of training images is stained with a stain (e.g., hematoxylin and eosin). In some embodiments, each of the plurality of training images includes a global label indicating a known PDA subtype. In some embodiments, the plurality of training images lack local annotations of PDA features. In some embodiments, the machine learning model provides a score that indicates the likelihood that a PDA sample obtained from the subject has a predicted PDA subtype.

In some embodiments, the foregoing methods may further comprise an additional preprocessing step to select one or more tumor segments present in the digital image. In some embodiments, the method may further include tiling the image or tumor segments within the image into a set of tiles. In some embodiments, the method may further include performing feature extraction on the set of tiles to extract the set of features therefrom.

FIG. 1 illustrates an example flow chart of a method of applying a trained machine learning model to predict PDA molecular subtype scores in accordance with an embodiment of the present disclosure. At operation 101, a histological image is received. In some embodiments, the histological image may include digital WSI.

In one embodiment, the input histological image is obtained from a patient tissue sample that may be known or suspected to contain PDA tumors. In some embodiments, the input image may include a tissue section that has been stained (e.g., with hematoxylin and eosin (H & E)) to visualize underlying tissue structures. Other common colorants that may be used to visualize tissue structures in an input image include, for example, masson trichome colorant, periodic acid Schiff colorant, prussian Blue colorant, gomori trichome colorant, alcian Blue colorant, or ZiehlNeelsen colorant.

At operation 103, image processing is performed. In some embodiments, image processing may include removing background portions of the image, tiling the image, feature extraction, and/or identifying tumor regions within the image. Applying a deep learning algorithm to histological data is a challenging problem, particularly due to the high dimensionality of the data and the small size of the available data sets. Thus, a preprocessing pipeline consisting of multiple steps may be used to reduce the dimensionality and clean up the data.

In some embodiments, preprocessing includes detecting tissue regions of the image. The identification of tissue regions within the WSI may be performed before or after a subsequent additional image processing function, such as tiling. In one embodiment, a neural network (e.g., U-Net) can be used to segment portions of the image that contain material and discard artifacts such as blurring, handwriting, etc., as well as background portions of the image where no tissue is present.

Tiling an image (or background-subtracted image) can include dividing the original image (or background-subtracted image) into smaller images that are more manageable, called tiles. In one embodiment, the tiling operation is performed by applying a fixed grid to the entire slice image, using a segmentation mask generated by the segmentation method, and selecting tiles containing tissue or any other region of interest for the subsequent classification process. To further reduce the number of tiles to be processed, additional or alternative methods may be used, such as random subsampling to preserve only a given number of slices.

In some embodiments, the original image may be downsampled to make the image segmentation step less computationally expensive. In some embodiments, some of the image analysis is performed at a tile level (which is a sub-portion of the image), and using semantic segmentation on a downsampled version of the image does not degrade the quality of the segmentation. This allows the downsampled image to be used without degrading the quality of the segmentation. Then, in order to obtain a segmentation mask for the original full resolution image, only the segmentation mask generated by the neural network needs to be enlarged.

For example, and in one embodiment, the image (or background-subtracted image) may be divided into tiles of fixed size (e.g., each tile having a size of 224 x 224 pixels). Alternatively, the tile size may be smaller or larger. In this example, the number of tiles generated depends on the size of the detected substance, and may vary from a few hundred tiles to 50,000 or more tiles. In one embodiment, the number of tiles is limited to a fixed number (e.g., 10,000 tiles) that may be set based at least on the computation time and memory requirements. In some embodiments, at least 5%,10%, 15%, 20%, 30% or more of the tiles must have been detected as foreground by the U-Net model discussed above to be considered tiles of matter. Once the WSI is divided into tiles, image processing may extract features from each tile.

In some embodiments, feature extraction is performed using a self-supervised model. For example, the self-supervising model MoCo v2 may be used to extract features from tiles. In some embodiments, feature extraction may be performed generally according to the method described in Dehaene et al "Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology"(2020),arXiv(https://arxiv.org/pdf/2012.03583). In some embodiments, about 1,000 to 5,000, about 1,000 to 3,000, or about 2,000 relevant features may be extracted from each tile. In some embodiments, 2,048 features are extracted from each tile, such that at the end of the preprocessing pipeline, the slice is represented by a matrix of dimensions (n _tiles, 2048).

Once features have been extracted from the tile, the trained DL model may be used to predict the tumorous region of the tile. In one embodiment, the tumor detection model may be trained at the tile level based on tumor annotations provided by expert pathologists. The TumNet model described herein (also referred to as PACpAInt-Neo) includes a multi-layer sensor with a single layer of 128 hidden neurons followed by ReLU activation. In some embodiments, the tumor detection model classifies the tiles by assigning a score to the tiles that indicates a likelihood that the tiles contain a neoplastic region. Such neoplastic regions may contain tumor cells and/or tumor associated stromal cells. In some embodiments, the tile score may be a value between 0 and 1, with one endpoint (e.g., 0) indicating a very high likelihood that the tile does not contain a neoplastic region and the other endpoint (e.g., 1) indicating a very high likelihood that the tile does contain a neoplastic region. In some embodiments, if a tile has a tumor prediction score greater than a threshold (e.g., 0.5), the tile may be classified as a tumor tile. In some embodiments, the tumor tissue fragments may include epithelial tumor cells and a stromal region.

At operation 105, the image (or image tile) may be analyzed by the trained DL model in order to predict PDA subtypes. PuriNet (also known as PACpAInt-B/C) and CompoNet (also known as PACpAInt-Comp) are two DL models that are trained on a discovery queue (training set) to predict PurIST consecutive sample weights of classification (classical or basal) and molecular components (classical, basal, interstitial activity and interstitial inactivity), respectively. In some embodiments, both PuriNet and CompoNet may use the same image (e.g., WSI) preprocessing pipeline described above, including identifying tissue regions, tiling images, and feature extraction. In some embodiments, the model may be trained using a training set containing images of one or more global tags indicative of PDA subtypes. In some embodiments, the PDA molecular subtype signature of the training set image is based on a gene expression profile (e.g., generated using RNA profiling) or protein expression profile of a PDA sample obtained from the same subject as the training image. For example, in some embodiments, the PDA subtype tags associated with each image in the training set may be based on the personal database of Rashid et al CLINICAL CANCER RESEARCH (2020) on the same subject; the PurIST criterion set forth in 26:82-92 is classified as an assessment of the gene expression profile of a classical or basal-like paired sample. In some embodiments, the PDA molecular subtype tags associated with each image in the training set may be based on an assessment of the gene expression profile of a paired sample from the same subject that has been classified as classical, basal, interstitial or interstitial inactivity based on PDA transcriptome components described by Puleo et al Gastroenterology (2018), 155:1999-2013 (see Puleo et al, fig. 2). In some embodiments, PDA molecular subtype tags associated with each image in the training set may be based on an assessment of the gene expression profile of paired samples from the same subject that have been classified as purely classical, immune classical, purely basal-like, interstitial activated or pro-fibrotic according to the criteria set forth by Puleo et al.

Fig. 2 illustrates an example flowchart of a method 200 of training and validating a DL model according to an embodiment of the present disclosure. In order to locally verify the predicted classical and fundamental patterns identified by the CompoNet model, GATA6 and VIM IHC were performed on multiple slices. Tile scores for the base and classical components in the classical and base regions are then compared according to IHC.

At operation 201, the method may begin by receiving a training set with clinical and molecular information. For example, the training set may include a queue of annotated H & E WSIs.

At operation 203, the WSI may be preprocessed and the DL model may be trained to predict molecular information from the preprocessed image. In some embodiments, the model may be trained using training images comprising digital images of histological sections of PDA samples obtained from subjects of known PDA subtypes. For example, each training image may include a global label indicating a known PDA subtype.

At operation 205, the trained model may be validated. In this particular example, the trained model is validated using three validation queues: 1) Beaujon/BJN (n=150+n=100 perfectly matches RNAseq), 2) TCGA-PAAD (n=134) and 3) Beaujon Biopsies/EUS-FNB (n=25).

As assessed by the PurIST method, the area under the receiver operating characteristic curve (AUC) was used to quantify the ability of the model to distinguish classical tumors from basal tumors. The same metric is also used to evaluate the performance of the tumor detection model to distinguish normal regions from tumor regions. The method of Delong was used to calculate the confidence interval for the 95% confidence level. The Pearson correlation was used to evaluate CompoNet model predicted molecular composition performance. Survival analysis was used with univariate and multivariate Cox proportional hazards models implemented in the Python lifeline package. Log rank testing was used to compare survival distribution between human mouth subgroups. The test is double-tailed and P-values <0.05 are considered statistically significant. The clinical variables considered for multivariate analysis are common variables known to be associated with PAC prognosis, including, for example: pN stage, differentiation, peri-nerve invasion, resected status, tumor size, vascular invasion, and/or adjuvant therapy yes/no.

Fig. 3 illustrates an example flowchart of a method 300 of determining Pancreatic Ductal Adenocarcinoma (PDA) subtypes corresponding to PDA classification schemes for subjects with PDA, according to an embodiment of this disclosure.

At operation 301, a digital image of a histological section of a PDA sample obtained from a subject is received. As discussed above, the digital image may be WSI of a tissue slice of a PDA. PDA tissue sections may have been stained with staining agents such as hematoxylin and eosin.

At operation 303, the image is tiled to separate the WSI into smaller images that are easier to manage. In one embodiment, the tiling operation is performed by applying a fixed grid to the entire slice image, using a segmentation mask generated by the segmentation method, and selecting tiles that contain tissue or any other region of interest for the subsequent classification process. To further reduce the number of tiles to be processed, additional or alternative methods may be used, such as random subsampling to preserve only a given number of slices.

Once the WSI has been divided into tiles, the method 300 may continue at operation 305 with extracting features from each tile. In some embodiments, feature extraction is performed using a self-supervised model. For example, the self-supervising model MoCo v2 may be used to extract features from tiles. In one embodiment, features may be extracted by applying a trained feature extractor that is trained with a contrast loss DL algorithm using a training set of images. In one embodiment, the training set of images may include a set of annotated images with PDA molecular subtype tags generated using RNA profiling.

Once features have been extracted from the tiles, the method 300 may continue at operation 307 with selecting tumor tissue segments using the trained DL model. In one embodiment, a tumor detection model, such as TumNet, may be trained at the tile level based on tumor annotations provided by expert pathologists. The TumNet model includes a multi-layer perceptron with a single layer of 128 hidden neurons followed by ReLU activation. In some embodiments, the tumor detection model classifies the tiles by assigning a score to the tiles that indicates a likelihood that the tiles contain a neoplastic region. Such neoplastic regions may contain tumor cells and/or tumor associated stromal cells. In some embodiments, if a tile has a tumor prediction score greater than, for example, 0.5, the tile may be classified as a tumor tile. In some embodiments, the tumor tissue fragments may include epithelial tumor cells and a stromal region.

At operation 309, the DL model is used to determine PDA subtypes by applying the model to a tile comprising one or more tumor tissue fragments identified at operation 307. In some embodiments, the determination of the PDA subtype may be made at the tile level, while in other embodiments, the determination of the PDA subtype may be made at the slice level. PuriNet and CompoNet (also discussed above with reference to fig. 1) are two DL models that are trained on a training set to predict PurIST classification (classical or basal) and consecutive sample weights of molecular components (classical, basal, interstitial activity and interstitial inactivity), respectively. In one embodiment, a machine learning model is trained for a PDA classification scheme, and each PDA subtype to be assigned to an image or a tile derived from an image is a PDA subtype of a PDA classification scheme. In some embodiments, the PDA classification scheme is PurIST and the PDA subtypes comprise canonical and/or basal. In other embodiments, the PDA classification scheme is a molecular component, and PDA subtypes include classical, basal, interstitial activity, and interstitial inactivity.

The PuriNet model may be used to obtain a slice-level score that represents the probability that the tissue represented in the image of the slice is fundamental or classical. In one embodiment, puriNet is trained with binary cross entropy as a loss function. In some embodiments, the score is between 0 and 1, where a score at one end of the range (e.g., 0) represents a very high likelihood that the tissue contained on the slice belongs to the canonical subtype, and a score at the other end of the range (e.g., 1) represents a very high likelihood that the tissue contained on the slice belongs to the basal subtype.

Inspired by WELDON algorithm, compoNet can be used to compute a set of one-dimensional embeddings for tile features using a multi-layer perceptron (MLP). For each channel output of the MLP, r=100 top and bottom fractions can be averaged so that the output of the model is a vector corresponding to the value of each molecular component: classical, basal, interstitial and interstitial inactivity.

At optional operation 311, one or more PDA sub-component scores are calculated for each tile in the set of tiles using a machine learning model. The machine learning model may also be trained to calculate a score for each PDA molecular component, including: classical, basal, interstitial activity, interstitial inactivity.

In some embodiments, the DL model may predict PurIST classifications and molecular component sample weights at the tile level based on features extracted from the tile. Based on this tile level knowledge, purIST scores and molecular component scores may also be generated at the slice level.

Fig. 4 illustrates an example flowchart of a method 400 of generating tumor-level PDA subtype scores according to an embodiment of this disclosure. According to one embodiment, the techniques described herein may be applied to multiple images of the same tumor. For example, a tumor biopsy may be performed and multiple images of individual sections of the tumor may be generated. These images can be analyzed to determine PDA subtype scores and provide a more complete assessment at the tumor level.

At operations 401a and 401b, a plurality of PDA subtype scores corresponding to a single tumor of a patient are analyzed. In one embodiment, a number of slices of a single tumor are imaged and analyzed in accordance with the techniques described herein to generate PDA subtype scores from images of different slices of the same tumor.

At operation 403a, the proportion of sections of a single tumor corresponding to different PDA subtype scores is determined. For example, if 100 sections of a single tumor are analyzed, the method can determine which proportions of those sections correspond to a given PDA subtype. This aggregation may also be accomplished at the tile level by determining which proportion of tiles derived from one or more images corresponds to a given subtype, as shown in operation 403 b.

At operation 405, a tumor-level PDA subtype score is generated based on the aggregate proportions of slices (405 a) or tiles (405 b) of individual tumors corresponding to different PDA subtypes. In this way, PDA subtype scores can be generated at the tumor level. For example, if after analyzing 100 sections from a single tumor, it is determined that 85% of the sections comprise most basal subtypes, then basal tumor grade tags can be assigned. In another example, a tumor grade label of 85% basis may be assigned to the tumor, with the remaining scores being assigned based on the proportion of other subtypes identified in the tumor (e.g., 85% basis/15% classical, or 85% basis/5% classical/8% interstitial activity/2% interstitial inactivity). As will be clear to those skilled in the art, aggregation of tumor-grade PDA subtype scores may be achieved in a variety of ways. For example, in embodiments where a PDA subtype score is assigned to each image at the slice level, the number of slices for each subtype may be determined and used to assign a tumor-level subtype tag. In another example where tile levels assign PDA subtype scores to each image, the number of tiles for each subtype may be determined and used to assign slice level labels. The number of sections per subtype can then be determined and used to assign tumor grade subtype signatures. In another example, where a PDA subtype score is assigned to each image at a tile level, the number of tiles for each subtype identified in a tumor (e.g., across multiple slices) may be determined and used to assign a tumor level subtype label.

In some embodiments, a tumor grade subtype score is determined for subtype classical, basal, or both classical and basal. In other embodiments, a tumor-grade subtype score is determined for subtype classical, basal, interstitial and/or interstitial inactivity.

The foregoing operations may be used to identify various subtypes of PAC present in a population of subjects using histopathological images (e.g., H & E stained images), and/or may be used to determine specific PAC subtypes for individual patients.

For example, in some aspects, provided herein are computer-implemented methods for processing digital images of pancreatic ductal adenocarcinoma (PAC) samples, the methods comprising receiving digital images of PAC samples obtained from a subject, applying a machine learning model to the digital images, and determining PAC classifications of the images using the machine learning model, wherein the machine learning model has been trained by processing a plurality of training images.

In some embodiments, the digital image may be a Whole Slice Image (WSI) that encompasses all tissue on a histological slice. In some embodiments, the tissue may be stained to visualize morphological features of the PAC sample. For example, the sample may be stained with H & E and/or other suitable dyes (such as those described herein).

In one embodiment, provided herein is a computer-implemented method for identifying a neoplastic region within a digital image of a pancreatic ductal adenocarcinoma (PAC) sample, the method comprising:

A digital image of a PAC sample obtained from a subject is received,

Preprocessing an image, wherein the preprocessing includes removing background segments from the image, tiling the image into a set of tiles, and performing feature extraction on the set of tiles or a subset thereof to extract a set of features from the set of tiles, and

A machine learning model is applied to the digital image, wherein the machine learning model assigns a tile score to each tile or a subset thereof in a set of tiles, the tile score representing a likelihood that the tile contains a tumorous region.

In some embodiments, the machine learning model has been trained by processing a plurality of training images, wherein the training images comprise digital images of a plurality of PAC samples, and wherein the digital images contain local annotations defining the neoplastic region. In some embodiments, the neoplastic region contains a neoplastic cell. In other embodiments, the neoplastic region contains both tumor cells and associated stromal cells.

Once the tile score has been assigned to the tile, in some embodiments, the tile may be filtered to select tiles above or below a threshold. In this way, tiles containing neoplastic regions and/or tiles containing non-neoplastic regions may be selected for further analysis. In some embodiments, tiles that have been assigned tile scores may be mapped back or superimposed to their locations in the digital image. This allows the digital image to be marked to indicate portions that contain a neoplastic region and portions that do not contain a neoplastic region.

In one embodiment, provided herein is a computer-implemented method for identifying a subtype of pancreatic ductal adenocarcinoma (PAC) sample obtained from a subject known or suspected to have PAC, the method comprising:

a digital image of the PAC sample is received,

Preprocessing an image, wherein preprocessing comprises removing background segments from the image, tiling the image into a set of tiles, and performing feature extraction on the set of tiles or a subset thereof to extract a set of features from the set of tiles,

Selecting a subset of tiles containing a tumorous region, and

A machine learning model is applied to a subset of tiles containing the tumorous region, wherein the machine learning model assigns a slice score to the subset of tiles, the slice score representing a likelihood that the PAC sample belongs to the PAC subtype.

In some embodiments, a machine learning model has been trained by processing a plurality of training images, wherein the training images comprise digital images of a plurality of PAC samples of known subtypes, wherein the training images comprise global (slice level) labels indicative of the known subtypes. In some embodiments, the training image lacks local annotations.

In some embodiments, the foregoing method is a method of identifying a PurIST subtype of a PAC sample, wherein the PurIST subtype is selected from classical and basal. In some embodiments, the training images each include a global label indicating the PurIST subtype (e.g., classical or basal) of the PAC sample of the known subtype represented therein.

a digital image of the PAC sample is received,

Selecting a subset of tiles containing a tumorous region, and

Applying a machine learning model to a subset of tiles containing the tumorous region, wherein the machine learning model assigns a tile score to each tile or subset thereof in the set of tiles, the tile score representing a likelihood that PAC organization represented in the tile belongs to PAC subtype. In some embodiments, the tile score is a plurality of scores, each score representing a likelihood that the PAC organization represented in the tile belongs to a PAC subtype.

In some embodiments, the machine learning model assigns one, two, three, four, five, six, seven, eight, nine, ten, or more tile scores to each tile or subset thereof in the set of tiles, each tile score representing a likelihood that PAC organization represented in the tile belongs to a PAC subtype.

In some embodiments, the foregoing method is a method of identifying a PurIST subtype of a PAC sample, wherein the PurIST subtype is selected from classical and basal. In some embodiments, the training images each include a global label indicating the PurIST subtype (e.g., classical or basal) of the PAC sample of the known subtype represented therein. In some embodiments, the PAC subtype is selected from classical, basal, interstitial activity, and interstitial inactivity.

Once the tile score has been assigned to the tile, in some embodiments, the tile may be filtered to select tiles above or below a threshold. In this way, tiles containing one or more regions representing a particular PAC subtype may be selected for further analysis. In some embodiments, tiles that have been assigned tile scores may be mapped back or superimposed to their locations in the digital image. This allows the digital image to be marked to indicate the portion containing the particular PAC subtype.

In some embodiments, the techniques described herein may be used to identify a primary PAC subtype of a PAC sample obtained from a patient known or suspected of having PAC.

In some embodiments, the techniques described herein may be used to assess the level of subtype heterogeneity of PAC samples obtained from patients known or suspected to have PAC.

In some embodiments, the foregoing method may be performed on multiple digital images from the same PAC sample. For example, the method may be performed on a plurality of digital images of serial tissue slices derived from PAC samples. In other embodiments, the method may be performed on a plurality of digital images obtained from tissue slices obtained from various regions of a PAC sample. Assessing PAC subtypes using multiple tissue sections may assess the homogeneity or heterogeneity of PAC samples in three dimensions.

In some embodiments, the foregoing techniques may be used to identify a PurIST subtype (classical or basal) of a subject known or suspected to have pancreatic ductal adenocarcinoma (PAC). In some embodiments, the foregoing techniques may be used to identify molecular subtypes (classical, basal, interstitial activity, and interstitial inactivity) of a subject known or suspected to suffer from pancreatic ductal adenocarcinoma (PAC).

PAC subtypes described herein have prognostic significance for overall survival and disease-free survival (alone or in combination with other clinical features). Thus, in some embodiments, provided herein are methods of promoting prognosis of a subject with PAC by determining PAC subtypes and correlating PAC subtypes with prognosis of the subject according to the methods and/or systems described herein. For example, in some embodiments, provided herein are methods of predicting the survival duration of a subject having PAC, comprising determining a PAC subtype according to the methods and/or systems described herein, and correlating the PAC subtype with the survival duration of the subject. In other embodiments, provided herein are methods of predicting the duration of disease-free survival of a subject having PAC comprising determining a PAC subtype according to the methods and/or systems described herein, and correlating the PAC subtype with the duration of disease-free survival of the subject. In other embodiments, provided herein is a method of predicting the survival duration of a subject having PAC, comprising determining a proportion of tiles of a PAC sample having a particular PAC subtype (e.g., basal, classical) in a digital image according to the methods and/or systems described herein, and correlating the proportion of PAC tiles having the particular PAC subtype with the survival duration of the subject. In other embodiments, provided herein are methods of predicting a disease-free survival duration of a subject having PAC, comprising determining a proportion of tiles of a PAC sample having a particular PAC subtype (e.g., basal, classical) in a digital image according to the methods and/or systems described herein, and correlating the proportion of PAC tiles having the particular PAC subtype with the disease-free survival duration of the subject.

Also provided herein is a machine-readable medium having executable instructions to cause one or more processing units to perform any of the foregoing computer-implemented methods.

There is also provided a system for processing digital images of a PDA sample, the system comprising: at least one memory storing instructions, and at least one processor configured to execute the instructions to perform operations necessary to execute the computer-implemented methods provided herein.

These example embodiments are provided to illustrate representative applications of the systems and methods provided herein. These embodiments are merely exemplary. In accordance with the disclosure provided herein, additional models and methods can be used to develop and train machine learning models and/or systems for PDA diagnostic and classification purposes based on histopathological slice images.

Exemplary methods and techniques used in the exemplary embodiments are as follows:

Transcriptome profiling and molecular subtype typing. For the discovery cohort, the PurIST subtype and molecular composition were determined using microarray, while for the validation cohort RNASeq 3' was used.

Preprocessing of the whole slice image. Applying a deep learning algorithm to histological data is a challenging problem, especially due to the high dimensionality of the data (up to 100,000 x 100,000 pixels for a single whole slice image) and the small size of the available data set. Thus, a preprocessing pipeline consisting of multiple steps is used to reduce the dimensionality and clean up the data. The first step involves detecting tissue on WSI: the portion of the image containing the material is segmented using a U-Net neural network and artifacts such as blurring, handwriting, and the like, as well as background, are discarded. The second step involves tiling the slices into smaller images, called "tiles", 112×112 μm (224×224 pixels) in size. The U-Net model must detect at least 20% of the tiles as foreground to be considered as material tiles. Then up to 8000 such tiles are sampled uniformly from each slice. The third step includes extracting features from each tile: 2,048 relevant features were extracted using a self-supervised model (i.e., moCo v 2), following method ,"Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology"(2020),arXiv(https://arxiv.org/pdf/2012.03583), of Dehaene et al, which is incorporated herein by reference in its entirety. At the end of this preprocessing pipeline, each slice is represented by a matrix of dimensions (n _tiles, 2048).

Tumor model structure. The tumor detection ML model "TumNet" was trained on a tile level based on 745 sample tumor annotations provided by the pathologist, which corresponds to a total of 17,292,173 million tiles. The 2,048 features per tile were obtained using the WSI preprocessing technique described above ("preprocessing of the whole slice image"). The model includes a multi-layer perceptron with a single layer of 128 hidden neurons, followed by ReLU activation. The model outputs a tumor prediction score for each tile, which indicates the presence or absence of a tumor region. The tumor detection model is applied to the WSI to select those tiles within the digital image that contain the tumor region. Such tumor regions may include tumor cells and the interstitial components of the tumor.

Molecular model structure. Two deep learning models were trained on the discovery queue to predict PurIST classification ("PuriNet") or molecular component classical, basal, interstitial activity, interstitial inactivity ("CompoNet") of WSI. Both models use the same WSI preprocessing pipeline described above in "preprocessing the entire slice image". The TumNet model is further applied to the tile features to select only tumor tiles (i.e., tumor prediction scores greater than 0.5).

PuriNet use a similar architecture as proposed by Ilse et al "Attention-based deep multiple instance learning",International conference on machine learning,PMLR,2018. A linear layer with 128 neurons was applied for embedding, followed by a gated attention layer with 128 hidden neurons. MLPs with 128 and 64 hidden neurons and ReLU activations were then applied to the results. The final Sigmoid activation is applied to the output to obtain a score between 0 and 1, which represents the probability that the slice is base or classical. PuriNet training using binary cross entropy as a loss function.

CompoNet uses a similar architecture to the WELDON algorithm (see, for example, Thibaut Durand、Nicolas Thome、Matthieu Cord.WELDON：Weakly Supervised Learning of Deep Convolutional Neural Networks,, 29 th IEEE conference for computer vision and pattern recognition (CVPR 2016), month 6 in 2016, las Vegas, NV, usa). A set of one-dimensional embeddings for tile features is calculated using a multi-layer perceptron (MLP) with 128 hidden neurons, 4 neurons later, and ReLU activation. For each channel output of the MLP, r=100 top and bottom fractions are selected and averaged such that the model output is a vector of size 4, corresponding to the value of each molecular component. PuriNet are trained with mean square error as a loss function.

And (5) space verification. In order to locally verify the predicted classical and fundamental patterns identified by CompoNet, GATA6 and VIM IHC were performed on the slices of bjn_matched. Tile scores for basal and classical components were compared to classical and basal regions according to Immunohistochemistry (IHC).

Performance assessment and statistical methods. As assessed by the PurIST method, the area under the receiver operating characteristic curve (AUC) was used to quantify the ability of the model to distinguish classical tumors from basal tumors. The same metric is also used to evaluate the performance of the tumor detection model to distinguish normal regions from tumor regions. The method of Delong was used to calculate the confidence interval at the 95% confidence level. The Pearson correlation was used to evaluate CompoNet model predicted molecular composition performance. Survival analysis was performed using univariate and multivariate Cox proportional hazards models implemented in the Python lifeline package. Log rank testing was used to compare survival distribution between human mouth subgroups. All tests were double-tailed and P values <0.05 were considered statistically significant.

Clinical variables used in multivariate analysis. Clinical variables considered for multivariate analysis are common variables known to be associated with PDA prognosis: pN stage, differentiation, peri-nerve invasion, resected status, tumor size, vascular invasion, adjuvant therapy yes/no.

The multi-step DL model was trained on a 424 digital H & E WSI cohort, these WSIs from 202 resected PDA specimens (average slice/case number=2) from 3 centers, with neoplastic areas annotated by two pathologists. There were at least two hematoxylin-eosin (H & E) sections from the surgical specimen per patient, corresponding to a total of 424 sections.

The trained model was then validated on three independent queues (including the biopsy queue). The characteristics of the training queue (i.e., training set) and the three validation queues are presented in table 1 below.

Queues	Is verification?	# Slice	Patient #	Others
					Training	Whether or not	424	202	Multiple-center resected primary tumor
BJN_250	Is that	505	250	Resected primary tumor
					TCGA-PAAD	Is that	132	132	Multiple-center resected primary tumor
EUS-FNB	Is that	25	25	Fine needle liver biopsy (transfer)

TABLE 1

When applied to two different validation queues (BJN_250 and TCGA-PAAD), the model correctly detected the neoplastic region. Molecular transcriptome subtype predictions were validated on two external queues, BJN_250 (FIG. 5A) and TCGA-PAAD (FIG. 5B), of surgical specimens, with AUC of 0.84 and 0.79, respectively. Interestingly, when limiting samples with high confidence RNA-defined molecular subtypes, our model better predicts subtypes and reaches 0.92AUC and 0.89AUC in bjn_250 and tcga_ PAAD.

Because most PDA patients were diagnosed with the metastatic stage in liver biopsies, this model was tested on 25 fine needle liver biopsies and matched RNAseq data. The model performed well (auc=0.85 [0.69-1.0 ]), and like surgical specimens, the model performance improved in cases with defined molecular subtypes (auc=0.92 [0.77-1.0] ] (fig. 5C).

In one embodiment, in multivariate analysis, the molecular predictions output by the trained DL model are independently related to total survival (OS) and Disease Free Survival (DFS) in bjn_250.

The graphs show the univariate/bivariate total lifetime for BJN (FIG. 6) and the multivariate total lifetime for TCGA-PAAD (FIG. 7), according to an embodiment of the disclosure.

FIG. 8A illustrates an example set of base tiles, while FIG. 8B illustrates an example set of classical tiles.

In another example embodiment, the base sample/classical classification performance of the cross-validated model in the training set is 0.79 (AUC). When limited to samples with high confidence RNA-defined molecular subtypes, the performance of the same model reached 0.86AUC. The subtypes defined by the model are related independently to overall survival in multivariate analysis (hr=2.56 [1.87-3.49], pval < 0.001), and the association is higher relative to the PurIST RNA subtype (hr=1.60 [1.17-2.19], pval < 0.001). In the validation cohort, the overall AUC of the model was 0.82, and in the subset of "subtype-pure" tumors was 0.89. In addition to demonstrating the value of the histological-based deep learning model for PAC tumor subtype typing, these results also show the limitations of molecular-based subtype typing in highly heterogeneous samples.

As shown in fig. 9, a computer system 900, which is one form of data processing system, includes a bus 903 coupled to microprocessor(s) 905 and ROM (read only memory) 907 and volatile RAM 909 and nonvolatile memory 913. The microprocessor 905 may include one or more CPUs, GPU(s), special purpose processors, and/or combinations thereof. The microprocessor 905 may communicate with a cache 904. And may retrieve instructions from the memories 907, 909, 913 and execute the instructions to perform the operations described above. The bus 903 interconnects these various components together and also interconnects these components 905, 907, 909, and 913 to the display controller and display device 915 and to peripheral devices such as input/output (I/O) devices 911 which may be mice, keyboards, modems, network interfaces, printers, and other devices which are well known in the art. Typically, an input/output device 911 is coupled to the system through an input/output controller 917. Volatile RAM (random access memory) 909 is typically implemented as Dynamic RAM (DRAM) which requires power continuously in order to refresh or maintain the data in the memory.

The non-volatile memory 913 may be, for example, a magnetic hard disk drive or a magneto-optical drive or an optical disk drive or DVD RAM or flash memory or other type of memory system that maintains data (e.g., large amounts of data) even after power is removed from the system. Typically, the non-volatile memory 913 will also be a random access memory, but this is not required. Although FIG. 9 shows nonvolatile memory 913 as a local device coupled directly to the remaining components in the data processing system, it should be appreciated that the present invention may utilize nonvolatile memory remote from the system, such as a network storage device coupled to the data processing system through a network interface such as a modem, ethernet interface, or wireless network. Bus 903 may include one or more buses connected to each other through various bridges, controllers and/or adapters as is well known in the art.

Portions of the above description may be implemented with logic circuitry, such as dedicated logic circuitry, or with a microcontroller or other form of processing core executing program code instructions. Thus, the processes taught by the discussion above may be performed with program code such as machine-executable instructions that cause a machine that executes the instructions to perform certain functions. In this context, a "machine" may be a machine (e.g., an abstract execution environment such as a "virtual machine" (e.g., a Java virtual machine), an interpreter, a common language runtime, a high-level language virtual machine, etc.) that converts intermediate form (or "abstract") instructions into processor-specific instructions, and/or electronic circuitry such as general-purpose processors and/or special-purpose processors that is disposed on a semiconductor chip (e.g., a "logic circuitry" implemented in transistors) designed to execute the instructions. The processes taught by the discussion above may also be performed by (as an alternative to, or in combination with, a machine) electronic circuitry designed to perform the processes (or a portion thereof) without the execution of program code.

The present invention also relates to an apparatus for performing the operations described herein. The apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), RAM, EPROM, EEPROM, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.

A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory ("ROM"); random access memory ("RAM"); a magnetic disk storage medium; an optical storage medium; a flash memory device; etc.

The article of manufacture may be used to store program code. The article of manufacture in which the program code is stored may be implemented as, but is not limited to, one or more memories (e.g., one or more flash memories, random access memories (static, dynamic or other)), optical disks, CD-ROMs, DVD ROMs, EPROMs, EEPROMs, magnetic or optical cards or other machine-readable media suitable for storing electronic instructions. The program code may also be downloaded from a remote computer (e.g., a server) by way of data signals embodied in a propagation medium (e.g., via a communication link (e.g., a network connection)).

The foregoing detailed description is presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as "segmenting," "tiling," "receiving," "computing," "extracting," "processing," "applying," "expanding," "normalizing," "pre-training," "sorting," "selecting," "aggregating," "sorting," or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the described operations. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.

Example 1: deep learning method to identify molecular subtypes of pancreatic adenocarcinoma on histological sections

Described herein is a multi-step method that uses a deep learning model to determine tumor cell types and molecular phenotypes on routine histological preparations, with a resolution that allows one to break up intra-tumor heterogeneity on a large scale (fig. 10). The model was trained and validated on a multi-center cohort using 1796 sections (602 patients) with matched transcriptomes.

The deep learning model was trained on a discovery queue consisting of 424 whole slice histological images from 202 resected PACs (average slice/case number=2) in 3 centers, with neoplastic areas annotated by two pathologists (fig. 11A). The first model was developed to predict neoplastic areas (PACpAInt-Neo) and when applied to two different validation queues, the model correctly detected neoplastic areas with AUCs of 0.99 and 0.98, respectively (fig. 12A).

Transcriptome data is available for all cases (from the same lesion but spatially mismatched regions) and used in the second step of modeling on the training predicted neoplastic region to determine the PurIST-RNA defined basic-like/classical (B/C) subtype (PACpAInt-B/C). Histologically, although these tumors exhibited a highly diverse morphology, these patterns could be grouped into two sets corresponding to basal-like or classical subtypes (fig. 13A). When applied to validation queues for RNAseq sampling in a similar manner to discovery queues (i.e., spatial mismatch), AUCs for the PACpAInt-B/C prediction subtypes are 0.86[0.79-0.94] and 0.81[0.71-0.90] in BJN queues (fig. 13B) and tcga_ PAAD (fig. 14A-14B), respectively. The performance of this model was comparable in the queues with spatially matched histological and molecular regions (auc=0.83 [0.73-0.93 ]) (fig. 15).

Given the previously described intratumoral heterogeneity, the analysis was then limited to only 50% of cases with the sharpest molecular subtype (low molecular heterogeneity) and showed a significant improvement in performance (AUC in BJN cohort (see fig. 13B, "transparent RNA subtype") and tcga_ PAAD (see fig. 14B, ("transparent RNA subtype") are 0.91[0.84-0.98] and 0.88[0.79-0.98 ]) respectively) this was particularly significant in matched validation cohorts (see fig. 15, auc=0.95 [0.90-1.0 ]), highlighting the limitations of binary classification in highly heterogeneous tumors.

Since most patients were diagnosed with a metastatic stage in liver biopsy, the PACpAInt-B/C model was tested on 25 fine needle liver biopsies (EUS-FNB) and had matched RNAseq data. The model performed the same (auc=0.85 [0.69-1.0 ]), and like surgical specimens, it improved in cases with a defined molecular subtype (auc=0.92 [0.77-1.0], (fig. 16)).

In the BJN validation cohort, PACpAInt-B/C along with N-stage and tumor size had strong independent prognostic value for total survival (OS; FIG. 17A) and disease-free survival (DFS; FIG. 17B) in multivariate analysis (HR=1.37 [1.16-1.62] p <0.001 and HR=1.27 [1.08-1.49] p=0.003), respectively (see also Table 2). In contrast, purIST-RNA classification had independent prognostic value for DFS, but not for OS (FIG. 18A and FIG. 18D; see also Table 3).

Table 2: PACpAInt-B/C in BJN validation queue

TABLE 2

Table 3: purIST-RNA in BJN validation queue

TABLE 3 Table 3

It has been shown previously that in cases, tumor cells on different sections can have different morphologies. This may be particularly interesting for RNA-defined classical tumors, which may contain regions of lower differentiation that may affect prognosis. But this heterogeneity is cumbersome and difficult to evaluate and quantify correctly. To solve this problem, we selected RNA-defined classical cases in the matched validation cohort (n=77/97) and analyzed all tumor sections (average number of sections/cases=9) using PACpAInt-B/C model (fig. 19). 47 (61%) cases were homogeneous, all slices were explicitly predicted to be classical, while in the rest cases the prediction of one slice was different from the other, clearly indicating important morphology and possible molecular heterogeneity. DFS and OS were short in heterogeneous cases (median survival was 35 months versus 15 months, p=0.08, and 64 months versus 36 months, p=0.002), highlighting the clinical impact of tumor heterogeneity (fig. 19).

While this whole-section approach, centered on tumor cells, has demonstrated its potential, it is limited to heterogeneity between sections and does not take into account the interstitium (an important component of PAC biology). Therefore PACpAInt was further trained to distinguish tumor cells from the stroma (PACpAInt cell type model) at a resolution of 112 micron wide squares (called tiles). The performance of this model was good, with AUC in both validation queues being 0.99 (fig. 20A, 20B). Inter-mass quantified by immunohistochemistry and/or special staining is reported to correlate with the prognosis of the small cohort. This model was validated on a subset of cases (n=50) where tumor cell/matrix ratios were calculated using whole cell keratin immunohistochemistry (Pearson's R =0.72, p < 0.001) numbers (fig. 20C). Using this model over the whole cohort (n=451), it was demonstrated that a large number of stroma was independently associated with better prognosis (hr=0.86 [0.76-0.96], p=0.01 and hr=0.87 [0.77-0.98], p=0.02 for DFS and OS, respectively) (fig. 21 and table 4).

Table 4: PACpAInt-cell type prognosis value

TABLE 4 Table 4

Previously developed methods can quantify various tumor cells (classical/basal-like) and interstitial (active/inactive) phenotypes based on transcriptome profiling. Thus, we further train PACpAInt to identify the four transcriptome components at the whole slice level using the deep learning architecture, thus enabling tiling level reasoning (PACpAInt-Comp) (fig. 11C). The correlation between transcribed components and PACpAInt-Comp was very pronounced in the unmatched validation queues and further improved in the spatially matched validation queues (average Pearson R was +17 for the four components) (fig. 22).

To verify the local accuracy of PACpAInt-Comp in predicting the tile-level phenotype, the consistency between the model (classical/basal pattern on tumor cell tiles and active/inactive on interstitial tiles) and the scores of the two pathologists in PAC was first evaluated and found to be very good (consistency of tumor composition = 100% and 99.2%, consistency of interstitial composition = 99.2% and 99.4%). Furthermore, the tiles with high classical/basal-like component scores are tiles with high tumor cell scores, and the tiles with high active/inactive interstitial scores are tiles with high interstitial scores (fig. 23). To complete the local validation of PACpAInt-Comp, the sections were stained with GATA6 and KRT17 antibodies, two recognized markers for classical and basal-like phenotypes, respectively. The stained tumor area is selected and the tiles therein are scored for tumor cell composition. Both tumor components were well predicted, the AUC of the basal component was 0.87, and the AUC of the classical component was 0.75, distinguishing krt17+ from gata6+ regions (classical). PACpAInt-Comp allows us to study the relationship between the components, demonstrating the strong association between the active and basal-like rather than classical stroma (fig. 24). In the multivariate Cox model, use of PACpAInt-Comp local components significantly improved prognosis prediction (+4c index, p=0.007 and +3c index, p=0.008 for OS and DFS respectively) (table 5).

Table 5: cox model with tile score

Covariates	Total life cycle	Disease-free survival time
			Clinical variables	0.63	0.62
Clinical variable + PurIST	0.62	0.63
			Clinical variable + pattern score	0.67	0.65

TABLE 5

To assess the prognostic impact of intratumoral heterogeneity, a total of 6.3 million tumor tiles covering 451 patients were phenotyped, i.e., each tumor/interstitial component was assigned a score that measured its intensity level. The distribution of basal and classical scores showed that only 60% of tumors had distinguishable major subtypes (classical 41%, basal 19%). The remainder can be divided into rare mixed subtypes defined by coexistence of well-differentiated basal-like and classical tumor cells (10%), and intermediate subtypes where most tumor cells cannot be assigned explicitly to either subtype (30%) (fig. 25). Both the total survival (fig. 26, left panel) and disease-free survival (fig. 26, right panel) of the latter two subtypes have a moderate prognosis. Single cell analysis (12 cases) indicated that most tumors could consist of basal-like tumor cells. The use of PACpAInt-Comp for 451 cases showed that 71% of tumors exhibited detectable highly basal-like tumor cells. The overall proportion of basal-like cells had prognostic significance, with a poorer prognosis for total survival (fig. 27, left panel) and disease-free survival (fig. 27, right panel) starting from 5% of highly basal-like tumor cells. In addition, this basal proportion is also an independent adverse prognostic factor for total and disease-free survival (fig. 28 and table 6).

Table 6: base ratio AI-450

TABLE 6

The above studies indicate that the deep learning method described herein is capable of predicting PAC molecular subtypes of tumor and stromal cells on routine pathological sections. The interpretable deep learning design allows translation of defined molecular features across a tumor to morphology-based spatially-oriented cellular phenotypes for comprehensive intratumoral heterogeneity analysis.

Training queues included sections from different centers over a long period of time, with different staining protocols to cover wide variability of staining. The robustness of this model is demonstrated by the good performance achieved on the validation queue and the multi-center TCGA PAAD queue. Moreover, this model performs well on liver biopsies (the most common diagnostic sample for PAC diagnosis). In addition to the binary classification of tumor cells (without lengthy and expensive RNAseq analysis) that can help in immediate decision of treatment regimen, the model can also be deployed in clinical trials to stratify patients. In addition, it can detect tumor cells remaining after neoadjuvant therapy, pave the way for standardized regression scores, which can also be used in experiments to adjust adjuvant therapy. The model described herein also makes it possible to evaluate PAC intratumoral heterogeneity at a level that has never been explored before. These results provided the first clear picture of the intratumoral heterogeneity of PAC, indicating that almost one third of tumors may be between the classical and basal subtypes. Furthermore, the data show that smaller basal-like components that binary classification would ignore have a strong prognostic significance. This study also demonstrates that the interstitial chamber can rapidly subtype, paving the way for patient stratification in drug targeting assays. Thus, artificial intelligence based PAC subtype typing may guide patient stratification based on strong molecular guidelines.

Method of

The following methods represent those used in the examples described herein.

Data set description. The set of findings used to develop our models were those obtained in the Saint-Antaine university Hospital during the period of 9 th 1996 to 12 th 2010,A multi-central queue of 202 patients treated by a university hospital or Ambroise Par e university hospital. There were at least two hematoxylin-eosin +/-peak (HES) sections from the surgical specimen per patient, corresponding to a total of 424 sections. BJN_ unmatched, BJN _matched is two independent validation queues receiving treatment at university hospital Beaujon during month 9 1996 and month 1 2014. BJN_ unmatched consisted of 304 HES sections of surgical resection specimens of 148 patients. For the bjn_matched cohort, all sections of tumor specimens were digitized, corresponding to a total of 909 HES sections for 100 patients. EUS-FNB is the third independent validation cohort, and endoscopic ultrasound fine needle biopsies were performed on 25 liver transfer patients (one biopsy per patient) who received treatment at the Beaujon university hospital during 2013 to 2020. TCGA PAAD is a multi-center independent validation queue of 134 hematoxylin-eosin (H & E) sections (126 cases) from the TCGA database public dataset. The inclusion criteria for all queues are as follows: clear diagnosis of PAC, available formalin fixation, histological sections of paraffin embedded material, available follow-up and molecular information, no metastasis at diagnosis. This resulted in the exclusion of 34 sections from TCGA, which were free of tumor cells, or from frozen examination.

Transcriptome profiling and molecular subtype typing. This discovery queue corresponds to 202 resected tumors of a study (Puleo et al, gastroenterology (2018), 155:1999-2013) performed using a U219Affymetrix microarray for profiling. For the BJN_ unmatched queue, RNA was extracted from 0.8mm diameter cores sampled from tumor-rich areas. In most cases, RNA is not extracted from the same block used to generate HES sections. For the BJN_matched and EUS_FNB series, RNA was extracted after manual microdissection to remove contaminating normal liver or pancreatic tissue. For these queues, RNA was extracted from the exact same region as the HES being analyzed. In addition, all other tumor sections were also analyzed by PACpAInt for the bjn_matched cohort. For the BJN queue, DNA/RNA was extracted using ALLPREP FFPE tissue kit (Qiagen, venlo, netherlands) according to the manufacturer's instructions and sequenced using 3' RNAseq (LexogeneQuantseq '). RNAseq reads were plotted using STAR (REF STAR) and the genes were quantified using FeatureCount. Gene counts were normalized by the upper quartile and recorded. PurIST are applied to the microarray and RNAseq spectra to generate class tags for each sample. Tumor and interstitial components were applied to the microarray and RNAseq spectra, producing a continuous score for each component in each sample. For each dataset, the difference between the scaled basal and classical component scores was calculated, and samples with higher differences than median were considered to have a clear RNA subtype.

Preprocessing of the whole slice image. Applying a deep learning algorithm to histological data is a challenging problem, especially due to the high dimensionality of the data (up to 100,000 x 100,000 pixels for a single whole slice image) and the small size of the available dataset. Thus, a preprocessing pipeline consisting of multiple steps is used to reduce the dimensionality and clean up the data. This pipeline includes detecting tissue on WSI: the U-Net neural network is used to segment portions of an image containing material and discard artifacts such as blurring, handwriting, and the like, as well as background. An additional step includes tiling the slice into smaller images of 112 x 112 μm (224 x 224 pixels), called "tiles". For the examples provided herein, at least 20% of the tiles must have been detected by the U-Net model as foreground to be considered tiles of material. Then up to 8000 such tiles are sampled uniformly from each slice. Additional steps include extracting features from each tile: 2,048 relevant features were extracted using self-supervision model MoCo v, using the method proposed by Dehaene et al ,"Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology"(2020),arXiv(https://arxiv.org/pdf/2012.03583). At the end of this preprocessing pipeline, each slice is represented by a matrix of dimensions (n _tiles, 2048).

Tumorigenicity and cell type prediction. PACpAInt the tumorigenicity prediction model (PACpAInt-Neo) was trained at the tile level based on the tumorigenicity notes of 433 slices in the discovery queue provided by two pathologists, which corresponds to a total of 9,886,596 tiles. The WSI preprocessing described in the section "preprocessing of whole slice images" is used to obtain 2,048 features per tile. PACpAInt-Neo consists of a single layer of multi-layer sensors with 128 hidden neurons, followed by ReLU activation. The model was validated on the areas annotated by two pathologists for BJN and TCGA PAAD queues (10 slices per queue). PACpAInt Cell type segmentation model (PACpAInt-Cell type) has the same architecture as PACpAInt-Neo and is trained on 81 sections in the discovery queue, with the stroma and tumor cells annotated. PACpAInt-Cell types were also validated on BJN and TCGA-PAAD (10 slices per queue).

And (5) molecular prediction. PACpAInt-B/C and PACpAInt-Comp are two deep learning models that are trained on a discovery queue to predict PurIST basal classification and molecular constituent classical, basal, interstitial activity, interstitial inactivity, respectively. These two models use the same WSI preprocessing pipeline described in the section "preprocessing the entire slice image". PACpAInt-Neo are further applied to the tile features to select only tiles in the tumorous area (e.g., tiles with tumorous predictive scores greater than 0.5). The PACpAInt-B/C architecture is similar to that proposed by Ilse et al ,"Attention-based deep multiple instance learning",International conference on machine learning.PMLR(2018): a linear layer with 128 neurons was applied for embedding, followed by a gated attention layer with 128 hidden neurons. MLPs with 128 and 64 hidden neurons and ReLU activations were then applied to the results. The final Sigmoid activation is applied to the output to obtain a score between 0 and 1, which represents the probability that the slice is base or classical. PACpAInt-B/C were trained with binary cross entropy as a loss function. PACpAInt-Compo was inspired by the WELDON algorithm: a one-dimensional embedded set of tile features was calculated using a multi-layer perceptron (MLP) with 128 hidden neurons followed by 4 neurons and ReLU activation (Durand et al, WELDON: weakly Supervised Learning of Deep Convolutional Neural Networks, 29 th IEEE conference for computer vision and pattern recognition (CVPR 2016), 2016, 6 th month, las Vegas, NV, usa). For each channel output of the MLP, r=100 top and bottom fractions are selected and averaged such that the output of the model is a vector of size 4, corresponding to the predicted value for each molecular component. PACpAInt-Compo trains with mean square error as a loss function.

And (5) space verification. To verify PACpAInt-Comp prediction classical and basal accuracy locally, GATA6 and KRT17IHC were performed on 12 slices of bjn_matched. The tile scores of the classical and basal components were analyzed in the region defined by IHC as classical/basal. Two pathologists also analyzed tiles predicted as classical or basal (n=500) and tiles predicted as interstitial active or inactive (n=500), unknowing the score associated with each tile.

Performance assessment and statistical methods. As assessed by the PurIST method, the area under the receiver operating characteristic curve (AUC) was used to quantify the ability of the model to distinguish classical tumors from basal tumors. The same metric was used to evaluate PACpAInt-Neo's performance in distinguishing normal from neoplastic regions, and PACpAInt-Cell's performance in distinguishing interstitial from epithelial tumor cells. The Delong method was used to calculate confidence intervals at 95% confidence levels (the DeLong et al ,"Comparing the areas under two or more correlated receiver operating characteristic curves:a nonparametric approach",Biometrics(1988)：837-845).Pearson correlation was used to evaluate the performance of the PACpAInt-Comp model in predicting molecular composition. The survival analysis was performed with univariate and multivariate Cox proportional risk models implemented in the Python's lifeline package (Davidson-pilot et al CamDavidsonPilon/lifelines: v0.22.9 (version v0.22.9). Zenodo.http:// doi.org/10.5281/zenodo.3523175, 2019, 10 month 30 day.) the log rank test was used to compare survival distribution between human mouth subgroups. Survcomp R was packaged to compare c-index (MS Schroder et al ,"Survcomp:an R/Bioconductor package for performance assessment and comparison of survival models",Bioinformatics(2011),27(22),3206-3208). all tests were double tailed, and P values <0.05 were considered statistically significant).

Intratumoral heterogeneous subtypes. Using all available slices for each patient, the 99 th percentile of the basal and classical component scores defined by PACpAInt-Comp for each patient was calculated. The difference between the 99 th basal and classical is then used, a gaussian mix is used to distinguish the two groups of tumors, the high-difference group of tumors are considered well-differentiated or basal or classical, and the low-difference group of tumors have an ambiguous basal/classical differentiation. The maximum value of the base or classical component is then used to further separate the low variance group: whereas both classical and basal differentiation are higher, the high maximum group with small differences between extremes is called mixing, while the low maximum group with small differences between extremes is called intermediate cells, since no tumor cell pattern reaches a high level of either basal or classical differentiation.

Clinical variables used in multivariate analysis. The clinical variables considered for multivariate analysis are common variables known to be associated with PAC prognosis: pN stage, differentiation, peri-nerve invasion, resected status, tumor size, vascular invasion, and adjuvant therapy yes/no.

The foregoing discussion and examples merely describe some exemplary embodiments of the present invention. Those skilled in the art will readily recognize from such discussion, and from the accompanying drawings and claims that various modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A computer-implemented method for processing a digital image of a pancreatic ductal adenocarcinoma (PAC) sample, the method comprising receiving a digital image of a PAC sample obtained from a subject, applying a machine learning model to the digital image, and determining a PAC classification of the image using the machine learning model, wherein the machine learning model has been trained by processing a plurality of training images.

2. The computer-implemented method of claim 1, wherein the digital image is a Whole Slice Image (WSI).

3. The computer-implemented method of claim 1 or claim 2, wherein the method further comprises one or more image preprocessing steps.

4. The computer-implemented method of claim 3, wherein the image preprocessing step comprises one or more of:

a. removing a background segment from the image;

b. Tiling the digital image into a set of tiles;

c. Feature extraction is performed on a set of tiles, or a subset thereof, to extract a set of features from the set of tiles.

5. The computer-implemented method of claim 4, wherein the image preprocessing step comprises (a), (b), and (c).

6. The computer-implemented method of any of claims 1 to 5, wherein PAC classification is at a slice level.

7. The computer-implemented method of claim 4 or claim 5, wherein PAC classification is at a tile level.

8. The computer-implemented method of claim 7, wherein PAC classification classifies a tile as containing a neoplastic region or not, wherein a neoplastic region can include tumor cells and/or tumor-associated stromal cells.

9. The computer-implemented method of claim 6, wherein the PAC classification includes a consecutive score representing a likelihood that the PAC sample represented in the image belongs to one of two PDA subtypes.

10. The computer-implemented method of claim 7, wherein the PAC classification includes one or more consecutive scores representing a prevalence of one or more PDA subtype features in the PAC sample represented in the tile.

11. A computer-implemented method of determining a Pancreatic Ductal Adenocarcinoma (PDA) subtype corresponding to a PDA classification scheme for a subject having pancreatic ductal adenocarcinoma PDA, comprising:

receiving a digital image of a histological section of a PDA sample obtained from a subject;

Preprocessing the image to extract a set of features, wherein preprocessing comprises,

Tiling a digital image into a set of tiles, and

Performing feature extraction on the set of tiles to extract the set of features from the set of tiles;

selecting a subset of tiles representing one or more tumor tissue segments, wherein the subset of tiles comprises a subset of features; and

A PDA subtype of the digital image is determined from at least a subset of the features using a machine learning model, wherein the machine learning model is trained for PDA classification schemes and each of the PDA subtypes is a PDA subtype of a PDA classification scheme.

12. The computer-implemented model of claim 11, further comprising: one or more PDA sub-component scores are calculated for each tile in the subset of tiles using a machine learning model.

13. The computer-implemented model of claim 11 wherein the machine learning model is further trained to calculate a score for each PDA subtype of the classification scheme.

14. The computer-implemented model of claim 11, wherein the PDA classification scheme is PurIST, and wherein the PDA subtypes include classical and basal patterns.

15. The computer-implemented model of claim 11, wherein the PDA classification scheme is molecular component profiling, and wherein PDA molecular components include classical, basal, interstitial and interstitial inactivity.

16. The computer-implemented method of claim 11 wherein the PDA sort plan is one of a plurality of PDA sort plans.

17. The computer-implemented method of claim 11 wherein each PDA taxonomy of the plurality of PDA taxonomies comprises a plurality of possible PDA subtypes.

18. The computer-implemented method of claim 11, wherein the histological section of the PDA sample has been stained with a dye.

19. The computer-implemented method of claim 18, wherein the dye is hematoxylin and eosin (H & E).

20. The computer-implemented method of any of claims 11 to 19, wherein the digital image is a Whole Slice Image (WSI).

21. The computer-implemented method of any one of claims 11 to 20, wherein the PDA sample is primary pancreatic ductal adenocarcinoma or a portion thereof.

22. The computer-implemented method of any one of claims 11-20, wherein the PDA sample is metastatic pancreatic ductal adenocarcinoma or a portion thereof.

23. The computer-implemented method of claim 22, wherein the metastatic pancreatic ductal adenocarcinoma or portion thereof is obtained from the liver of a subject.

24. The computer-implemented method of claim 11, wherein the preprocessing step further comprises removing background segments from the image.

25. The computer-implemented method of claim 24, wherein removing background segments from the image is performed using a convolutional neural network.

26. The computer-implemented method of claim 11, wherein the selection of the subset of tiles is performed using a tumor model trained to distinguish tiles that include tumor regions from tiles that include normal regions.

27. The computer-implemented method of claim 26, wherein the tumor model comprises a multi-layer perceptron.

28. The computer-implemented method of claim 11, wherein feature extraction is performed using momentum contrast or momentum contrast v 2.

29. The method of claim 11, wherein determining a PDA subtype for each of the tumor tissue segments comprises:

analysis is performed on a subset of features extracted from a subset of tiles using a machine learning model to generate a subtype score corresponding to each tile in the subset of tiles.

30. The method of claim 29 wherein determining PDA subtypes includes calculating PurIST scores for slice levels based on an analysis of a subset of features extracted from a subset of tiles.

31. The computer-implemented method of claim 11, wherein the machine learning model has been trained using a plurality of training images including digital images of histological sections of PDA samples obtained from subjects with known PDA subtypes of PDA classification schemes.

32. The computer-implemented method of claim 31 wherein the training images each include a global label indicating a known PDA subtype.

33. The computer-implemented method of any of claims 29 to 32, wherein the machine learning model is a deep multi-instance learning model.

34. The computer-implemented method of claim 29, wherein the PDA profile is a molecular component and the global label is a value for each of classical, basal, interstitial active, interstitial inactive molecular components.

35. The computer-implemented method of any of claims 29 to 31 or claim 33, wherein the machine learning model is a Weldon model.

36. The computer-implemented method of any one of claims 30 to 32, wherein known PDA subtypes are identified using a gene expression profile of a PDA sample.

37. The computer-implemented method of claim 35, wherein the gene expression profile comprises RNAseq data or NanoString data.

38. The computer-implemented method of claim 12, further comprising aggregating PDA sub-component scores corresponding to each tile in the subset of tiles to generate a slice-level sub-component score, wherein the slice-level sub-component score indicates the sub-component with the highest prediction score.

39. The computer-implemented method of claim 12, further comprising superimposing the digital image with information representative of PDA sub-component scores corresponding to each tile in the subset of tiles to generate a digital image labeled with the information representative of PDA sub-component scores.

40. The computer-implemented method of claim 37, wherein the information representative of PDA molecular component scores for each tile includes a label indicating the molecular component having the highest predicted score for the one or more tumor tissue fragments contained in the tile.

41. The computer-implemented method of claim 11, further comprising:

One of the plurality of PDA classification schemes is selected.

42. A digital image of a histological section of a PDA sample, wherein tumor tissue segments within the image include tags associating the one or more tumor tissue segments with one or more PDA subtypes of a plurality of PDA subtypes, wherein the digital image is generated according to the computer-implemented method of claim 39.

43. The computer-implemented method of claim 12, further comprising:

Analyzing all PDA molecular component fractions corresponding to individual tumors of the patient;

Determining the proportion of sections of individual tumors corresponding to different PDA molecular component fractions; and

Tumor-grade PDA molecular component fractions were generated based on the proportion of sections of individual tumors corresponding to the different PDA molecular component fractions.

44. A machine-readable medium having executable instructions to cause one or more processing units to perform a method for processing digital images of pancreatic ductal adenocarcinoma (PAC) samples, the method comprising receiving digital images of PAC samples obtained from a subject, applying a machine learning model to the digital images, and determining PAC classifications for the images using the machine learning model, wherein the machine learning model has been trained by processing a plurality of training images.

45. A machine-readable medium as defined in claim 44, wherein the digital image is a Whole Slice Image (WSI).

46. The machine-readable medium of claim 44 or claim 45, wherein the method further comprises one or more image preprocessing steps.

47. The machine-readable medium of claim 46, wherein the image preprocessing step comprises one or more of:

a. removing a background segment from the image;

b. Tiling the digital image into a set of tiles;

48. The machine-readable medium of claim 47, wherein the image preprocessing step comprises (a), (b), and (c).

49. The machine readable medium of any of claims 44 to 48 wherein PAC classification is at a slice level.

50. The machine readable medium of claim 47 or claim 48 wherein the PAC classification is at the tile level.

51. The machine readable medium of claim 50, wherein PAC classification classifies a tile as containing a neoplastic region or not containing a neoplastic region, wherein the neoplastic region can include tumor cells and/or tumor associated stromal cells.

52. The machine-readable medium of claim 49 wherein the PAC classification includes a continuous score that represents the likelihood that the PAC sample represented in the image belongs to one of two PDA subtypes.

53. The machine-readable medium of claim 50 wherein the PAC classification includes one or more consecutive scores that represent the prevalence of one or more PDA subtype features in the PAC sample represented in the tile.

54. A machine-readable medium having executable instructions to cause one or more processing units to perform a method for determining a Pancreatic Ductal Adenocarcinoma (PDA) subtype corresponding to a PDA classification scheme for a subject with a pancreatic ductal adenocarcinoma PDA, the method comprising:

Tiling a digital image into a set of tiles, and

Selecting a subset of tiles representing one or more tumor tissue fragments, wherein the subset of tiles comprises a subset of features and the one or more tumor tissue fragments can comprise epithelial tumor cells and a interstitial region; and

Determining PDA subtypes of the digital image from at least a subset of the features using a machine learning model, wherein the machine learning model is trained for a PDA classification scheme and each of the PDA subtypes for the one or more tumor tissue segments is a PDA subtype of the PDA classification scheme.

55. The machine-readable medium of claim 54, the method further comprising: one or more PDA sub-component scores are calculated for each tile in the subset of tiles using a machine learning model.

56. The machine-readable medium of claim 54 wherein the machine learning model is further trained to calculate a score for each PDA subtype of the classification scheme.

57. The machine-readable medium of claim 54 wherein the PDA classification scheme is PurIST and wherein the PDA subtypes include classical and basal patterns.

58. The machine-readable medium of claim 54, wherein the PDA profiling scheme is a molecular component, and wherein the PDA molecular component comprises classical, basal, interstitial and interstitial inactivity.

59. The machine-readable medium of claim 54 wherein the PDA sort plan is one of a plurality of PDA sort plans.

60. The machine-readable medium of claim 54 wherein each PDA sort plan of the plurality of PDA sort plans comprises a plurality of possible PDA subtypes.

61. The machine-readable medium of claim 54, wherein the histological sections of the PDA samples have been stained with a dye.

62. The machine-readable medium of claim 61 wherein the dye is hematoxylin and eosin (H & E).

63. A machine readable medium as in any of claims 54-62, wherein the digital image is a Whole Slice Image (WSI).

64. The machine readable medium of any of claims 54 to 63, wherein the PDA sample is a primary pancreatic ductal adenocarcinoma or a portion thereof.

65. The machine readable medium of any of claims 54 to 63, wherein the PDA sample is metastatic pancreatic ductal adenocarcinoma or a portion thereof.

66. The machine-readable medium of claim 65, wherein the metastatic pancreatic ductal adenocarcinoma or portion thereof is obtained from the liver of a subject.

67. The machine-readable medium of claim 54, wherein the preprocessing step further comprises removing background segments from the image.

68. The machine-readable medium of claim 67, wherein removing background segments from an image is performed using a convolutional neural network.

69. The machine readable medium of claim 54 wherein selecting the subset of tiles is performed using a tumor model trained to distinguish tiles that include tumor regions from tiles that include normal regions.

70. The machine-readable medium of claim 69, wherein the tumor model comprises a multi-layer perceptron.

71. The machine-readable medium of claim 54, wherein feature extraction is performed using momentum contrast or momentum contrast v 2.

72. The machine-readable medium of claim 54, wherein determining a PDA subtype for each of the tumor tissue segments comprises:

73. The machine-readable medium of claim 72 wherein determining the PDA subtype includes calculating a PurIST score for a slice level based on an analysis of a subset of features extracted from a subset of tiles.

74. The machine-readable medium of claim 54 wherein the machine learning model has been trained using a plurality of training images including digital images of histological sections of PDA samples obtained from subjects with known PDA subtypes of PDA classification schemes.

75. The machine-readable medium of claim 74, wherein the training images each include a global label indicating a known PDA subtype.

76. The machine readable medium of any of claims 72 to 75, wherein the machine learning model is a deep multiple instance learning model.

77. The machine-readable medium of claim 72, wherein the PDA parsing scheme is a molecular component and the global label is a value for each of classical, basal, interstitial active, interstitial inactive molecular components.

78. The machine-readable medium of any of claims 72 to 74 or 76, wherein the machine learning model is a Weldon model.

79. The machine readable medium of any one of claim 73 to claim 75, wherein known PDA subtypes are identified using a gene expression profile of a PDA sample.

80. The machine-readable medium of claim 78, wherein the gene expression profile comprises RNAseq data or NanoString data.

81. The machine-readable medium of claim 55, the method further comprising: PDA sub-component scores corresponding to each tile in the subset of tiles are aggregated to generate a slice-level sub-component score, wherein the slice-level sub-component score indicates the sub-component with the highest prediction score.

82. The machine-readable medium of claim 55, the method further comprising: the digital image is superimposed with information representative of PDA sub-component scores corresponding to each tile in the subset of tiles to generate a digital image labeled with information representative of PDA sub-component scores.

83. The machine-readable medium of claim 80, wherein the information representative of PDA molecular component scores for each tile includes a label indicating the molecular component having the highest predicted score for the one or more tumor tissue fragments contained in the tile.

84. The machine-readable medium of claim 54, further comprising:

One of the plurality of PDA classification schemes is selected.