WO2023003993A1 - Label-free classification of cells by image analysis and machine learning - Google Patents

Label-free classification of cells by image analysis and machine learning Download PDF

Info

Publication number
WO2023003993A1
WO2023003993A1 PCT/US2022/037790 US2022037790W WO2023003993A1 WO 2023003993 A1 WO2023003993 A1 WO 2023003993A1 US 2022037790 W US2022037790 W US 2022037790W WO 2023003993 A1 WO2023003993 A1 WO 2023003993A1
Authority
WO
WIPO (PCT)
Prior art keywords
cell
model
images
type
cells
Prior art date
Application number
PCT/US2022/037790
Other languages
French (fr)
Inventor
Jian Huang
Yaling LIU
Original Assignee
Coriell Institute For Medical Research
Lehigh University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Coriell Institute For Medical Research, Lehigh University filed Critical Coriell Institute For Medical Research
Publication of WO2023003993A1 publication Critical patent/WO2023003993A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57484Immunoassay; Biospecific binding assay; Materials therefor for cancer involving compounds serving as markers for tumor, cancer, neoplasia, e.g. cellular determinants, receptors, heat shock/stress proteins, A-protein, oligosaccharides, metabolites
    • G01N33/57492Immunoassay; Biospecific binding assay; Materials therefor for cancer involving compounds serving as markers for tumor, cancer, neoplasia, e.g. cellular determinants, receptors, heat shock/stress proteins, A-protein, oligosaccharides, metabolites involving compounds localized on the membrane of tumor or cancer cells
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/143Segmentation; Edge detection involving probabilistic approaches, e.g. Markov random field [MRF] modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/155Segmentation; Edge detection involving morphological operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10056Microscopic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10064Fluorescence image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20036Morphological image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20152Watershed segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30024Cell structures in vitro; Tissue sections in vitro
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30096Tumor; Lesion

Definitions

  • HSCs Hematopoietic stem cells
  • MPPs multipotent progenitors
  • HSCs Hematopoietic stem cells
  • MPPs multipotent progenitors
  • Techniques may include receiving a plurality of first images.
  • One or more, or each, of the plurality of first images may depict first cells of a first type or a second type.
  • Techniques may include, for each of the plurality of first images, receiving an indicator identifying whether the first image depicts a first cell of the first type or the second type.
  • Techniques may include inputting, into a deep-learning (DL) model, the plurality of first images and the indicator for each of the plurality of first images.
  • Techniques may include processing, by the DL model, the plurality of first images and the indicator for each of the plurality of first images.
  • DL deep-learning
  • Techniques may include inputting, into the DL model, a second image comprising a second cell of the first type or the second type. Techniques may include determining, via the DL model, whether the second cell is of the first type or the second type, based at least in part, on the processing of the plurality of first images and the indicator for each of the plurality of first images.
  • the DL model may comprise, at least in part, one or more layers.
  • Techniques may comprise determining, by the DL model from the indicator, which of each of the plurality of first images depicts the first cell as that of the first type.
  • Techniques may comprise determining, by the DL model from the indicator, which of each of the plurality of first images depicts the first cell as that of the second type.
  • Techniques may comprise associating, by the DL model, one or more characteristics of each of the plurality of first images determined to depict the first cell of the first type with one or more first image identification parameters of a cell of the first type.
  • Techniques may comprise associating, by the DL model, one or more characteristics of each of the plurality of first images determined to depict the first cell of the second type with one or more second image identification parameters of a cell of the second type.
  • Techniques may comprise tuning, by the DL model, at least some of the one or more layers of the DL model based the first image identification parameters and/or the second image identification parameters.
  • the one or more layers may be convolutional layers.
  • the tuning may be performed, by the DL model, at a one or more learning rates that may be associated with the one or more convolutional layers.
  • FIG. 1 is an example flowchart of an overview of at least one cell distinguishing technique.
  • FIG. 2 is an illustration of at least one tested model used to distinguish cells.
  • FIG. 3 is an example flow diagram of at least one technique for distinguishing among different types of cells.
  • FIG. 4 is a block diagram of a hardware configuration of an example device that may control one or more parts of one or more cell distinguishing techniques.
  • FIG. 5 is an example illustration of a Demonstration of Image pre-processing on the raw image data with higher density of cell.
  • FIG. 6 illustrates an example demonstration of image data from patient blood.
  • FIG. 7 illustrates an example of the architecture of the machine learning model of
  • FIG. 8 illustrates an example of Five-Fold cross-validation during training and testing experiments.
  • FIG. 9 illustrates an example of trained model evaluation.
  • FIG. 10 illustrates an example of single cell images cropped from bright field images.
  • FIG. 11 illustrates the cropping results after applying size filter, after applying both size filter, and uniqueness operation, and a size distribution characterization according to the crops.
  • FIG. 12 is an example illustration of the cropped single cells after normalization actions for the training purposes.
  • FIG. 13 is an example illustration of typical data samples from a batch.
  • FIG. 14 is an example illustration of a subpopulation of stem cells.
  • FIG. 15 illustrates an example of a Deep Learning model workflow.
  • FIG. 16 is an example illustration of principle components of at least three subsets of HSCs versus non-HSC: ST-HSC, LT-HSC and MPP.
  • FIG. 17 illustrates an example of a confusion matrix for the HSC 3-classes.
  • FIG. 18 is an example illustration of a t-SNE plot of four subsets of image data.
  • FIG. 19 illustrates an example confusion matrix and learning history of the training experiments.
  • FIG. 20 illustrates an example overview of one or more experiments to distinguish one or more cells.
  • FIG. 21 illustrates example FACS sorting of murine HSCs and MPPs using LSK/SLAM markers and cell imaging.
  • FIG. 22 illustrates a summary of a DL model’s performance on the holdout validation data of LSK/SLAM sorting.
  • FIG. 23 illustrates an interpretation of the one or more DL model(s).
  • FIG. 24 illustrates an example of the one or more DL model(s)-based classification of HSCs/MPPs.
  • Detection, characterization, and classification of different types of cells is important for the diagnosis and monitoring of cells.
  • the traditional way of cell classification via fluorescent images requires a series of tedious experimental procedures and often impacts the status of cells.
  • Described herein are one or more methods for label-free detection and classification of cells, by taking advantage of data analysis of bright field microscopy images.
  • the approach uses the convolutional neural network (CNN), a powerful image classification and machine learning algorithm to perform label-free classification of cells detected in microscopic images of cell samples containing different types of cells. It requires minimal data pre-processing and has an easy experimental setup.
  • CNN convolutional neural network
  • one or more methods described herein can achieve high accuracy on the identification of cells without the need for advanced devices or expert users, thus providing a faster and simpler way for counting, identifying, and classification of cells. Details of one or more methods are described herein and their application in cancer cells and/or stem cells.
  • CNN may be targeted to have images/videos as primary data flow because how CNN operates on the inputs could maintain the intrinsic relation between neighboring pixels.
  • CNN models are generally over-parameterized mathematical model(s) whose parameters are found via supervised learning.
  • a Softmax function on the last (fully-connected) layer may be applied as the target to be optimized.
  • ResNet- 50 as pre-trained backbone, followed by at least four fully-connected layers were used.
  • the parameters of the backbone and fully-connected layers were fine-tuned/calibrated, in a way that a full learning rate of the fully-connect layers was used.
  • (e.g ., only) 1% of the learning rate was used in the backbone.
  • the cropped images were (e.g., only) selected that clearly contain an isolated single cell from one or more, or all, cropped images by applying size thresholding and uniqueness checks that were used to train the model (as described herein).
  • LT-HSC minor classes
  • the oversampling algorithm randomly sampled training images from the minority, perhaps until the number of the examples reached the same number in a majority class for one or more, or each, run.
  • the training dataset for each run may contain equivalent numbers of data examples for one or more, or all, of the classes.
  • HSCs Hematopoietic stem cells
  • MPPs multipotent progenitors
  • FACS fluorescence- activated cell sorting
  • a deep learning approach was developed that can extract minutiae from large-scale datasets of long-term HSCs (LT-HSCs), short-term HSCs (ST-HSCs), and multipotent progenitors (MPPs) to distinguish subpopulations of hematopoietic precursor solely based on their morphology.
  • the one or more deep learning model(s) can achieve predictable identification of subsets of HSCs with at least 85% accuracy. It is anticipated that the accurate and robust deep learning-based platform described herein for hematopoietic precursors will provide a basis for the development of a (e.g., next-generation) cell sorting system.
  • HSCs Hematopoietic stem cells
  • MPPs multipotent progenitors
  • HSCs Hematopoietic stem cells
  • BM bone marrow
  • Numerous studies have defined phenotypic and functional heterogeneity within the HSC/MPP pool and have revealed the coexistence of several HSC/MPP subpopulations with distinct proliferation, self-renewal, and differentiation potentials.
  • LT long-term
  • ST short- term
  • MPP multipotent progenitors
  • LSK CD150 + CD48- cells possess the capability to give long-term repopulation in recipients of BM transplants.
  • ST-HSCs and MPPs can be isolated by sorting LSK CD150- CD48- and LSK CD150-CD48 + cells, respectively.
  • HSCs can also be subdivided by their characteristic CD34 and CD135 (Flt3) expression profiles. With this staining, LSK CD34-CD135- cells are defined as LT-HSCs, while LSK CD34 + CD135- as ST-HSCs and LSK CD34 + CD135 + as MPPs.
  • HSCs Other than surface antigen markers, several intracellular proteins, e.g., a-catulin and ecotropic viral integration site-l(Evil), have been found to be expressed predominantly in murine HSCs. Thus, GFP expression driven by a-catulin or Evil gene promoters has been used to identify HSCs and track their “sternness” ex vivo or in vivo.
  • the detection of both membrane- bound and intracellular HSC markers relies on antibody staining, which will affect HSC activity, and/or excited fluorescence, which will cause photodamage and loss of HSCs sternness. Therefore, a label-free, laser-free method to identify HSCs would be very advantageous for research and clinical applications.
  • Deep learning has become state-of-the-art for computer vision tasks in biological and biomedical studies.
  • One or more DL algorithms build a mathematical model based on training examples with ground truth labels. It extracts relevant biological microscopic characteristics from massive image data.
  • the primary algorithm(s) for DL image classification is based on the convolutional neural network (CNN).
  • CNN convolutional neural network
  • CNN is mainly composed of convolutional layers that perform a convolution with “leamable” filters. The parameters of such filters can be optimized during the learning process.
  • the output is flattened into a vector for classification, which categorizes given inputs into certain classes.
  • DL has proven to be extremely effective on image classification tasks. For example, DL has been used to categorize tumor cells, red blood cells, white blood cells, predict neural stem cell differentiation, and assist cancer diagnosis.
  • a DL-based platform has been successfully developed for the automatic detection of rare circulating tumor cells in a label-free, laser-free way. Hence, malignant tumor cells have distinct morphology from normal cells. Whether this DL-based platform can distinguish the subpopulations of HSCs, which only have very subtle differences, is a challenging question.
  • DIC Differential Interference Contrast
  • One or more DL models were trained with single-cell training datasets, with multiple rounds of parameter optimization and augmentation of training sample sizes.
  • the efficacy of one or more DL models were evaluated by feeding it with single-cell validation datasets and/or calculating the accuracy of its cell type prediction.
  • Circulating tumor cells found in peripheral blood are originated from solid tumors. They are cells shed by a primary tumor into the vasculature, circulating through bloodstream of cancer patients, and colonizing at distant sites which may form metastatic tumors.
  • CTCs are an important biomarker for early tumor diagnosis and early evaluation of disease recurrence and metastatic spread in various types of cancer. Early detection of CTCs provides high chances for patients to survive before severe cancer growth occurs. The CTC count is also an important prognostic factor for patients with metastatic cancer. For example, a study has shown that the number of CTCs is an independent predictor of survival in patients for breast cancer and prostate cancer, and the changes of the CTC counts predict the survival in patients for lung cancer.
  • CTCs Focating specific target cells such as CTCs often requires tedious procedures. During the processes, CTCs need to be distinct from a (e.g., huge/large) amount of leukocytes via immunofluorescent labeling and fluorescent microscopy, and identifying the CTCs via the fluorescent labeling images could be achieved with high-throughput.
  • Epithelial markers such as cytokeratin (CK), and epithelial cell adhesion molecules (EpCAM), are useful for detecting CTCs in patients.
  • CK cytokeratin
  • EpCAM epithelial cell adhesion molecules
  • CTCs renal cell carcinoma
  • RCC renal cell carcinoma
  • Machine learning e.g., as a part of Deep Learning (DL)
  • DL Deep Learning
  • Machine learning algorithms build a mathematical or statistical model based on sample “training data” with known “ground truth” annotations, to make inference or predictions.
  • machine learning models such as random forest that can perform classification or prediction given high- quality features
  • deep learning models such as Convolutional Neural Networks (CNNs) that can leam to extract features in an automatic fashion.
  • CNNs Convolutional Neural Networks
  • CNN has been applied to the categorization of cell lines and red blood cells have integrated feature extraction and deep learning with high-throughput quantitative imaging enabled by photonic time stretch, achieving high accuracy in label-free cell classifications on selected white blood cells (WBCs) and cancer cells.
  • WBCs white blood cells
  • their image acquisition is based on a time-stretch quantitative phase imaging system, and the representations of results can be improved by using samples from patients’ blood.
  • Techniques described herein may include the following: isolation and labeling the blood samples, image data collection, image processing, and training and evaluating the one or more deep learning model(s).
  • a flowchart shown in FIG. X demonstrates the work after acquiring the isolated and labeled blood samples.
  • FIG. 1 is a is an example flowchart of an overview of at least one cell distinguishing technique.
  • FIG. 1 illustrates a deep-leaming based analysis framework for microscopy images from isolated blood samples.
  • One or more phases/elements of the process/techniques may include data preparation, image pre-processing, ML, and/or testing.
  • the images collected from bright field and fluorescent microscopy are processed and cropped into images containing a single cell, which are used as the training and testing raw data for the machine learning model with a deep CNN architecture.
  • peripheral blood samples from metastatic RCC (mRCC) patients were provided by Lehigh Valley Health Network, and healthy donor whole blood samples were provided by the University of Maryland.
  • the patient's whole blood was drawn in the 8.5mL heparin tube and processed expeditiously. For example, 2mL of whole blood was used for each batch of enrichment with EasySep Direct Human CTC Enrichment Kit.
  • the human colorectal cancer cell line HCT-116, American Type Culture Collection (ATCC), USA ) and healthy donor whole blood were used in this work.
  • WBCs used for the experiments were obtained from whole blood with red blood cell (RBC) lysis.
  • RBC lysing buffer ThermoFisher
  • 30-min incubation in the dark at room temperature was used.
  • CTCs were isolated from peripheral blood samples of metastatic renal cell carcinoma patients. The isolated cells enriched from 2mL of whole blood were triple washed using lx PBS (pH 7.4, Thermo).
  • the enriched cells were mixed with 5pL anti-hCarbonic Anhydrase lx and 2pL Calcein AM (BD Biosciences, USA) to the cells and brought the final volume to 200pL with PBS in a 1.5ml sterile Eppendorf tube for staining.
  • An efficient CTC immunomagnetic separation method ( EasySep direct human CTC enrichment kit, Catalog #19657) was used with negative selection.
  • a manual EasySep protocol was followed where peripheral blood was mixed directly with antibody cocktails (CD2, CD14, CD16, CD19, CD45, CD61, CD66b, and Glycophorin A) that recognize hematopoietic cells and platelets.
  • the antibodies were labeled with unwanted cells, then labeled with magnetic beads and separated by EasySep magnet.
  • the target CTCs will be collected from flow through and available for downstream analysis immediately.
  • Live cells could be identified by being stained with Calcein AM, and CTCs isolated from renal cell carcinoma patients were stained with human carbonic anhydrase IX (Abeam, USA) PE-conjugated antibody.
  • a live cell stained with the carbonic anhydrase IX PE-conjugated antibody would be finally identified as a CTC.
  • Optical images were obtained from fluorescent microscopy. Both immunocytochemically stained and bright field images were taken from tumor cell line mixed with WBC from heathy donor whole blood and the negative depletion of peripheral blood from renal cell carcinoma patients.
  • the raw-cell microscopy images are acquired under an Olympus 1X70- microscope, 640/480 microscopy bright field camera, with 20-X and 10-X scope magnification.
  • the corresponding label images shown in the top left frame of FIG. 5 are for a subset of the raw cell images with the image in the top middle frame of FIG. 5 acting as ground truth.
  • High resolution and high magnification images contain more details but acquiring them increases the total number of images to be captured and processed. Therefore, the selection of magnifier of the scope can be considered as a trade-off scenario. As described herein, a 20-X was chosen as the magnification scope since it provides a reasonable image resolution for each cell (500 pixels), with acceptable number of images to acquire per testing sample.
  • the first step of image pre processing is applying a customized toolbox to automatically segment the cells.
  • the bright field image may be processed with first filtering by edge detection based on Otsu’s method.
  • a flood-fill operation may be applied on the filtered image.
  • Morphological opening operation may locate one or more, or all, cells and/or may remove one or more, or a majority, of irrelevant spots.
  • a watershed segmentation can be achieved.
  • the bright field image can be cropped into individual cell images. The segmentation of this type of nuclear images is known in the art.
  • a raw image is processed through the Otsu’ s filtering edge detection (top right frame of FIG. 5), flood-fill operation (second frame down on right in FIG. 5), and morphological opening operation (third frame down on right in FIG. 5), so that a watershed segmentation can be achieved (fourth frame down on right in FIG. 5).
  • Otsu s filtering edge detection
  • flood-fill operation second frame down on right in FIG. 5
  • morphological opening operation third frame down on right in FIG. 5
  • the single-cell images were manually selected from all cropped images and (e.g., only) those were used to train the ML model.
  • the label for a selected single cell image (WBC or CTC) is (e.g., easily) obtained from the label of the cell in the corresponding fluorescent image.
  • FIG. 5 is an example illustration of a Demonstration of Image pre-processing on the raw image data with higher density of cell.
  • the top left and middle frames of FIG. 5 are the fluorescent labeled image and the corresponding bright field image, respectively.
  • the bright field image is then processed in the toolbox (the dashed-line rectangle region).
  • the processing may include filtering by edge detection based on Otsu’s method (top right frame of FIG. 5), flood-fill operation on the filtered image (second frame down on right in FIG. 5), morphological opening operation that locates one or more, or all, cells and removes one or more, or all, irrelevant spots (third frame down on right in FIG. 5), watershed transformation for segmentation (fourth frame down on right in FIG. 5).
  • One or more, or each, individual cell is visualized with a distinct color in this figure.
  • the appearance of segmented cells in the original bright field image is illustrated in the bottom middle frame of FIG. 5.
  • the bright field image can be cropped into individual cell images
  • FIG. 6 illustrates an example of selected patient blood sample images captured from the microscope. Some examples as shown in the top left frame of FIG. 6 (WBCs) and the top right frame of FIG. 6 (isolated CTCs). Some examples of cropped single-cell images are shown in the left middle frame of FIG. 6 and the right middle frame of FIG. 6. The width and height of the cropped images are both 30 pixels. Because a cell may stay near the edge of a cell culture well where there are low intensity and cloudy background, a brightness and background normalization operation has been applied on all the cropped single cells. The cropped and normalized single-cell images are used as the dataset for training and testing the one or more ML model(s). The size ranges of CTCs and WBCs in patient blood samples have been collected and are shown in the bottom frame of FIG. 6. It can observed that both types of cells are similar in size, thus the size alone cannot be used to distinguish the two types of cells.
  • FIG. 6 illustrates an example demonstration of image data from patient blood. Selected images captured by the microscope from isolated patient blood samples are shown in the top left and right frames of FIG. 6. The processed WBCs image (top left frame FIG. 6) and CTCs image (top right frame of FIG. 6) and cropped single WBC (middle left frame of FIG. 6) and CTC (middle right frame of FIG. 6), respectively.
  • the bottom frame of FIG. 6 illustrates a summary of the size distributions of CTC and WBC cells. The average diameters of CTCs and WBCs are approximately both 11.5 pm, while the CTC has distinguishable wider range of size distribution.
  • t-SNE t-distributed stochastic neighbor embedding
  • t-SNE is a non-linear dimensionality reduction technique developed to embed high-dimensional data for visualization in a low-dimensional space. It maps one or more, or each, high-dimensional data point to a low-dimensional (typically two- or three-dimensional) point in such a way that similar data points in the high-dimensional space are mapped to nearby low dimensional points and dissimilar data points are mapped to distant points, with high probability.
  • the visualizations presented by t-SNE can vary due to the selection of different parameters of the algorithm.
  • the scikit-learn (version 0.21.2) machine learning package was used to perform the t-SNE dimensionality reduction and map high-dimensional data points to a two-dimensional space.
  • the default parameters of the package were used except changing the perplexity value to be 50 and the learning rate to be 100.
  • the data points in the two-dimensional map fall into two clearly distinct clusters, which indicates that a deep learning network capable of nonlinear functional mapping should be able to achieve highly accurate classification on the dataset.
  • FIG. 7 illustrates the architecture of the machine learning model of ResNet-50, with input images of size 34 x 34 (resized from the cropped cell images), and binary categorical output.
  • the convolutional layers are initialized with pre-trained weights (e.g ., perhaps based on one or more image recognition parameters, etc.) learned from the ImageNet datasets, a method that allows faster training and reduces the requirement of training data.
  • pre-trained weights are used for feature extraction, where the extracted features by the convolutional layers usually encode multi scale appearance and shape information.
  • the first convolutional block directly takes the image data as input, extracts features and provides a feature map as illustrated within the dotted-line box of FIG. 7.
  • Further feature extraction is applied by taking the feature map of the previous convolutional block as input for the next block.
  • the pre trained ResNet-50 is followed by trainable layers that contain a fully connected layer with ReLu activation function, a dropout layer with a dropout rate of 0.6, and a softmax activation function with a cross-entropy loss implemented to generate the predicted results.
  • the model uses a learning rate of 0.0001 and is optimized by the Adam optimizer.
  • the trainings are processed in mini-batch, with the batch size of 16.
  • FIG. 7 illustrates an example of the architecture of the deep convolutional network, ResNet-50 , for transfer learning and CTC-WBC cell classification, and the demonstration of features extracted (within the dotted-line box of FIG. 7) by the first convolutional block.
  • the network receives the input data of cell images and predicts the probability of both classes.
  • the network consists of five stages each containing convolution and identity blocks, and each of the blocks has three convolutional layers.
  • the features of a cell image are extracted by the pre-trained convolutional layers.
  • the training on cultured cell lines is based on 1,745 single-cell images (436 cultured cells, 1309 WBCs).
  • a total number of 120 cells (31 cultured cells and 89 WBCs) are tested.
  • the combined performance has shown that all WBCs have been classified correctly, while 3 out of 31 cultured are misclassified as WBCs.
  • the overall accuracy of this learning model is 97.5%.
  • the training on patient blood samples is based on 95 single-cell images as raw input.
  • the cell images originally came from two patients: 15 CTCs from one and 17 CTCs from the other.
  • the training data was enhanced before processing the training by applying data augmentations on the original dataset.
  • the data augmentation may increase the diversity of the original dataset.
  • the most popular way to practice data augmentation is by creating a selected amount of new images by performing traditional affine and elastic transformations.
  • the data augmentation provides a larger dataset, which helps improve the overall robustness of the WBC- CTC classification CNN model without additional laboring for the preparation of fluorescent labels.
  • the expanded dataset includes single-cell images with different types of geometric transformations: rotation, shear transformation, horizontal and vertical reflection.
  • the augmented training dataset for one or more, or each, training experiment contains 1,000 CTCs and 1,000 WBCs.
  • K-fold cross- validation is applied for measuring the overall performance.
  • Cross-validation helps avoid performance differences in different runs of the learning algorithm, caused by the random split of training and testing data.
  • Five-fold cross-validation was used in one or more experiments. The original data is shuffled and divided into five groups with one group becoming the testing subset and the combination of the others becoming the dataset for training and validation. The training and validation data are then augmented for the training process. The final overall performance of the model is presented as the average of the five runs with different data as the testing set. More details on how the data was split and training, validation, and testing datasets were obtained are described as seen in FIG. 8.
  • FIG. 8 illustrates an example of Five-Fold cross-validation during training and testing experiments.
  • the original data of single-cell images is shuffled and divided into five non- overlapped subsamples with equal number of images.
  • One subsample is treated as the testing set in an experiment, and training is performed on the remainder of the dataset.
  • the experiment repeats with each of the five subsamples once tested.
  • the data for the training purpose is augmented and split into training (80%) and validation (20%) subsets, and then fits the model.
  • the training dataset was visualized by the t- SNE algorithm.
  • the t-SNE plot (top frame of FIG. 9) shows the distribution of the first and second dimensions of the t-SNE map after performing non-linear dimensionality reduction by the algorithm on the training dataset.
  • This t-SNE plot visualizes the high dimensional image data projected into a two-dimensional space, which helps to understand the overall distribution of the dataset. It can be seen from the output of t-SNE, samples from the two classes (CTCs and WBCs) form largely separate clusters in the two-dimensional space. It is hypothesized that the separation of the two classes holds true in the high-dimensional space as well.
  • the trained deep learning model can ( e.g ., reliably) extract high-dimensional features and perform classification with high accuracy.
  • the results of the deep learning model for cell image classification based on cultured cells and patient blood samples are summarized in two frames of the second row in FIG. 9, respectively.
  • examples of misclassified and well-classified CTCs and WBCs from the model are shown in the two frames of the third row in FIG. 9.
  • the misclassifications could be due to noise or errors in the manual labeling process, and/or the inherent partial overlap between the distributions of the two classes (e.g., the CTCs mixed in the cluster of WBCs, and vice versa, as shown in the t-SNE plot).
  • the averaged learning history from the five cross-validation experiments of the training and validation during epochs can be seen in the bottom two frames of FIG. 9.
  • the curves indicate that the model does not over- fit the problem and the network converges near the end of the training process.
  • the testing results on cell images of patient blood samples show that the overall accuracy from the five-fold cross-validation is 88.4%, and the F-score, traditionally defined as the weighted harmonic mean of the precision and recall of the result, is 0.89.
  • the F-score provides a measure of the overall performance of the model by considering the equal importance of precision and recall.
  • deep learning networks have shown the ability to unlock the hidden information in fluorescent images.
  • the networks could classify fluorescent images of single cells including CTCs with a very high accuracy (96%).
  • the bright field images of CTCs described herein may have lower accuracy in classification due to the lack of fluorescent label information, obtained results show (. e.g ., nice) convergence of the learning curve and promising accuracy with (e.g., only) limited amount of data demonstrate the potential of the proposed approach.
  • the receiver operating characteristic (ROC) curve was used to show the performance of the model at one or more, or all, classification thresholds, and the corresponding area under the curve (AUC) value to indicate the performance of prediction on one or more, or each, experiment.
  • the bottom right frame of FIG. 9 shows the total ROC curve and the calculated averaged AUC, 0.923, for the classification of patient blood CTCs and WBCs.
  • the high AUC indicates that the model has been successfully trained to distinguish CTCs from WBCs.
  • the examples of misclassified and well-classified CTCs (the two frames of the third row of FIG. 9) show that the CTC images are either correctly detected or incorrectly classified as WBCs.
  • the trained model works as a binary classifier for the single-cell images without fluorescent labels. Note that the coordinates of all the cropped single cells in the bright field image are recorded during pre-processing.
  • a label-free CTC count information for this bright field image can be generated when the recorded coordinates and the corresponding predicted cell types are combined.
  • this method can be combined with a sorting technique such as acoustic sorting, where the upstream image machine learning results can be used to trigger pulse activation of acoustic forces that sort cells into different channels for isolation and characterization.
  • a sorting technique such as acoustic sorting
  • Such combined label free image detection and label free sorting improves cell viability compared to a labelled approach and enables potential culturing of captured cells for personalized drug screening.
  • FIG. 9 illustrates an example of trained model evaluation. The top frame of FIG.
  • FIG. 9 illustrates the t-SNE plot of the training dataset showing the dimensionality reduction pre processing for the training dataset.
  • Confusion matrices for classification results of samples from cultured samples is shown in the left frame of the second row of FIG. 9, and (in the right frame of the second row of FIG. 9) patient blood vs. the WBCs.
  • Example misclassified and well-classified CTC images and WBC images are shown in the two frames of the third row of FIG. 9.
  • the learning history of the training and validation at each epoch are shown in the left bottom frame of FIG. 9.
  • the overall ROC-AUC result for WBC and CTC classification by cross-validation are shown in the right bottom frame of FIG. 9.
  • the ROC curve and AUC are the total/average performance of the five training experiments from the cross-validation process.
  • a diagonal dashed line from the bottom left to the top right comers represent the non-discriminatory test.
  • a deep convolutional neural network may be applied to classify cell images acquired from processed patient blood samples to detect rare CTC cells from mRCC patients.
  • a software toolbox (as described herein) for pre-processing raw images acquired from the microscope was developed to apply Otsu's thresholding, segmentation, and cropping on the images.
  • a manual cropped-cell image selection process then ignores incorrect segmentations and chooses good single-cell images for training the CNN model.
  • Ninety-five images containing single cells from patients are used as the original data, which is the source for training, validation, and testing datasets.
  • HSC Hematopoietic Stem Cell
  • Image machine learning can also be used to identify the subtype of stem cells and predict their sternness. There are a few outcomes. First, image segmentation and information registration methods can be used to automatically track and record the sternness status and progression of mother and daughter cells. Second, the sternness of the cells can be predicted with the bright field image, phase contrast image, properties and behavior of the cells.
  • Segmentation process for the stem cells can be achieved by the following process.
  • a bright field stem cell image is filtered by edge detection based on Otsu’ s method, and then flood- fill operation on the filtered image is applied to fill the filtered image.
  • morphological opening operation locates one or more, or all, cells and removes one or more, or all, irrelevant spots.
  • the watershed transformation for segmentation can be achieved, where one or more, or each, individual cell may be visualized with a distinct color. Meanwhile, one or more, or each, cell is assigned a number to indicate their identity.
  • the sternness of the HSCs will be predicted from the bright field images, phase contrast images, properties and behavior of the cells with a ML approach.
  • both bright field and fluorescent images of the isolated cells will be used to train the convolutional neural network.
  • the input images will be separated into two categories, stem cell, and/or non-stem cell images.
  • a certain pixel intensity unit (PIU) threshold will be applied to differentiate the two image groups.
  • the sternness of the cells will also be predicted with other features such as properties of the cells.
  • image of HSCs may be extracted for the training purposes.
  • the true identity of the two daughter cells will be determined from the fluorescent images and provided to the algorithm as the answer sheet.
  • differentiation/distinguishing at least four different mouse MPP (multipotent progenitor) populations: MPP1, MPP2, MPP3, and/or MPP4 is contemplated to be achievable using one or more of the techniques described herein.
  • T cell T cell
  • B cell B cell
  • NK cells differentiated to be achievable using one or more of the techniques described herein.
  • EVI1 Ecotropic viral integration site 1
  • SET/PR domain protein family Ecotropic viral integration site 1
  • the Evil locus was initially discovered as a common target of retroviral integration in murine myeloid leukemia.
  • Conditional deletion of Evil in adult mice leads to a profound loss of HSC self-renewal activity, but does not affect HSC specification into the blood cell lineage.
  • Evil-GFP reporter was first used to show that it is a specific reporter to mark hematopoietic precursors ex vivo.
  • mouse bone marrow cells were ( e.g ., first) harvested and FACS (fluorescence activated cell sorting) sort GFP + cells from Evil-GFP transgenic mouse bone marrow.
  • FACS fluorescence activated cell sorting
  • a confocal microscope was used to collect images of GFP + and GFP- cells which are corresponding to HSC and progenitors, respectively.
  • the second system is using surface markers staining to isolate HSC and progenitors. It is a well-established approach to stain mouse bone marrow cells with fluorescence labeled antibodies to lineage markers, Seal, c-kit, CD150 and CD48 markers.
  • Hematopoietic precursors can be distinguished from committed progenitors by different combinations of surface markers.
  • LSK(Lin-Scal + ckit + ) are mostly hematopoietic precursors, while Lin-(Lin-Scal- ckit-), LS(Lin-Scal + ckit-) and LK(Lin-Scal-ckit + ) are mostly committed progenitors.
  • long-term (LT) HSC, short-term(ST) HSC and multi-potent progenitor (MPP) were defined by different combinations of CD 150 and CD48.
  • LSK CD150 + CD48- cells are mostly LT-HSC
  • LSK CD150-CD48- cells are mostly ST-HSC
  • LSK CD150-CD48 + are mostly progenitors.
  • At least one task is the characterization of the sternness level of each cell in the bright field.
  • An objective is to use state-of-art deep learning techniques to analyze the raw HSC image data for the discovery of the distinction of HSC populations.
  • Tasks can be divided into two levels. The first task focuses on the distinction between the lineage negative populations, where there are four subsets, different in the intensity of the cell marker seal and ckit: Lin-(Lin-Scal- ckit-), LS(Lin-Scal + ckit-), LK(Lin-Scal-ckit + ), LSK(Lin-Scal + ckit + ).
  • the second task is the differences between HSC and non-HSC within the population of LSK population: three subsets (Long-Term (LT-HSC), Short-Term (ST-HSC), and Multi-Potent Progenitor (MPP)) that are characterized equivalently by two methods: surface marker and GFP.
  • LT-HSC Long-Term
  • ST-HSC Short-Term
  • MPP Multi-Potent Progenitor
  • One or more techniques described herein can tell the differences between HSC population, where there may be contained at least three subsets: LT-HSC, ST-HSC and MPP cells. Distinction between the lineage negative populations, where there are four subsets, different in the intensity of the cell marker seal and ckit: Lin-(Lin-Scal- ckit-), LS(Lin-Scal + ckit-), LK(Lin-Scal- ckit + ), LSK(Lin-Scal + ckit + ) can the identified.
  • a toolbox is used to crop and segment individual cell images from microscopic images.
  • the cell population is sparse in the view, where accurate detection may be more important, and/or the cell population is dense, where the wrong crops are to be avoided to reduce post selection.
  • a computer script is established to finish ( e.g ., at least two steps) for preprocessing. For example, cropping the single cell from DIC views may be performed.
  • characterizing the cell targets and/or removing the outliers (debris, merged cells, etc.) after applying a size filter may be performed and/or a uniqueness check with analyzing the diameters and the coordinates of the detected cells may be performed.
  • the size filter is applied by a preset cell size threshold according to the pixel unit of the DIC view.
  • the uniqueness check may evaluate the pairwise distances between each cell and may decide/determine which (e.g., several) cells may be duplicated detected.
  • the algorithm may eliminate these outliers such that the image crops might not contain images from the same cell, for example.
  • Cropping the single cell from Bright field images is illustrated in FIG. 10.
  • characterizing the cell targets and/or removing the outliers may be performed after applying size and uniqueness filter with analyzing the diameters, as illustrated in FIG. 11.
  • FIG. 10 illustrates an example of single cell images cropped from bright field images.
  • FIG. 11 illustrates the cropping results after applying size filter (the top left frame of FIG. 11), after applying both size filter, and uniqueness operation (the top right frame of FIG. 11), and a size distribution characterization according to the crops (the bottom frame of FIG. 11).
  • FIG. 12 illustrates the cropped single cells after normalization actions for the training purposes, among other scenarios.
  • FIG. 13 illustrates typical data samples from batch 0507 (LSK / Lin-, 176 cells), and 0517 (GFP, 522 cells).
  • the cropped single cells may be stored in the corresponding category.
  • FIG. 14 is an example illustration of subpopulation of stem cells.
  • the cell images are cropped into 72 x 72 and 48 x 48 or other sizes from the original dataset.
  • the image dimension and background brightness and contrast are normalized into the same size before training experiments. At least 20% of the total data are randomly sampled as the testing subset to which a training procedure might never be exposed. The remainder 80% of each class are used as the training materials.
  • At least one Deep Learning classifier may be trained to predict the type of the single cell data based on the ResNet50 model as the pre-trained convolutional layers.
  • a pretrained network that has been trained to learn a big public database, ImageNet, is used for transfer learning.
  • the convolutional layers in the framework are then fine-tuned/calibrated with the following fully- connected layers trained from scratch.
  • the whole framework become the architecture of the model.
  • the model then fits the training dataset to learn the distinctions between target cell images.
  • the trained model is used to predict the class of new image data and the overall performances are evaluated for validation.
  • Convolutional neural network can be used to train and predict the image data. Pre-trained networks can be used and/or transfer learning the image file.
  • a multi-modal image classifier can be built up by learning cell images at different locations. The information collected during the experiment can have different types of data representation: image information (multi-modals image matrix), and/or physical properties (numeric information).
  • a mixed model such as a CNN model for handling static cell image and/or a MLP model to encode physics properties in the form of numeric data might be built.
  • FIG. 15 illustrates an example of a Deep Learning model workflow for one or more techniques described herein.
  • the top of FIG. 15 illustrates one or more phases/elements of a convolutional neural network.
  • the bottom of FIG. 15 illustrates a generalization of one or more procedures of one or more DL techniques.
  • Table 1 illustrates an example data set. TABLE 1: Example Data Set
  • the differentiation between the most-populated HSC: LT-HSC and the non-HSC cells may be the subject of focus, in this case for example, MPP.
  • MPP the subject of focus
  • FIG. 16 illustrates the one or more (e.g., principle) components of three subsets of HSCs versus non-HSC: ST-HSC, LT-HSC and MPP.
  • FIG. 17 illustrates an example confusion matrix for the HSC 3-classes.
  • Table 2 illustrates example performances of the model(s) on three- classes model for HSC/ non-HSC.
  • One or more techniques may include/use lineage negative population.
  • the variation of the four subsets might not be ( e.g ., substantially) clear cut, but may include one or more preset boundaries.
  • LSK is the most favorable and least populated ones due their significant sternness.
  • the cells contained in the LIN- subset are the most populated and heterogeneous ones.
  • Expectations may include the LIN- cells having the most separate feature clustering while the LSK's features may be more focused.
  • PCA and t-SNE analysis was conducted for the given dataset.
  • FIG. 18 is an example illustration of a t-SNE plot of the four subsets of image data.
  • FIG. 19 illustrates an example confusion matrix and learning history of the training experiments
  • HSC/MPP subpopulations were (e.g., first) sorted from murine BM by fluorescence-activated cell sorting (FACS).
  • FACS fluorescence-activated cell sorting
  • a well-established surface marker combination including LSK and SLAM markers was used and sorted out three fractions: LT-HSCs (LSK CD150 + CD48-), ST-HSCs (LSK CD150- CD48-) and MPPs (LSK CD150-CD48 + ) (illustrated in the top frames of FIG.
  • FIG. 20 illustrates an example overview of one or more experiments.
  • the frames of the top three rows of FIG. 20 illustrates an overall workflow from sample preparation to (e.g., final) results, including sample collection, image acquisition, training procedures, and model testing.
  • the frames of the bottom row of FIG. 21 illustrates a DL framework and data flow during training experiments.
  • the DL network(s) contain two parts: a convolutional layer(s) and fully- connected (FC) layers.
  • the convolutional layers were fine-tuned/calibrated with a low learning rate and the following fully-connected layers were trained with a regular learning rate.
  • the hyper-parameters from the training experiments were tuned/calibrated with cross-validation. With the proper hyper-parameters found, the whole training dataset was used for training.
  • the model was tested on image data from different markers.
  • FIG. 21 illustrates FACS sorting of murine HSCs and MPPs using LSK/SLAM markers and cell imaging.
  • the top four frames of FIG. 21 illustrate representative FACS density dot plots showing the gating strategy employed to identify and isolate LT-HSCs, ST-HSCs, and MPPs from BM.
  • the bottom four frames of FIG. 21 illustrate cell size distributions of sorted HSCs and MPPs on FACS plots.
  • Model training began with 4,050, 7,868, and 9,676 cell inputs from LT-HSCs, ST- HSCs, and MPPs, respectively.
  • data augmentation was applied to enhance data diversity and avoid overfitting.
  • oversampling was practiced to balance the significance of the minority subset, in this case for example, LT-HSCs.
  • One or more models were trained based on CNN (illustrated in the four frames of the bottom of FIG. 20). To obviate the need for big datasets, transfer learning was practiced to filters learned on large, annotated datasets from ImageNet, which can be used for a new purpose where data may be limited.
  • the convolutional layers in the framework were fined tuned/calibrated with following fully connected layers trained from scratch.
  • One or more models were designed as a multi-class classifier that fits the image in the training dataset to learn the distinctions between classes labeled as LT-HSC, ST-HSC, and MPP.
  • the trained model was applied to predict the type of new cell image data and the overall performance was evaluated (illustrated in the four frames of the bottom of FIG. 20).
  • One or more DL models predict HSCs and MPPs with high accuracy.
  • the model(s) were fed with unseen single-cell inputs from the validation datasets of LSK/SLAM sorting. For example, five-folds cross-validation was leveraged to find the best hyper-parameters for the model’s training. From each run of the cross-validation, one-fifth of the dataset was withheld for validation and used the rest for training. Iterative training for the model was used to tune/calibrate the hyper-parameters for five runs when one or more, or each, example was used as validation at least once. The performance of the model was evaluated by the mean overall accuracy in the five runs with different data as the validation subset. The final model(s) were trained with the best hyper-parameters and validated on a 20% holdout.
  • the confusion matrix (illustrated in the top left frame of FIG. 22) concludes with 74% overall accuracy with default thresholding that gives the prediction class according to the highest score.
  • the top right frame of FIG. 22 illustrates the performance metrics including recall, precision, and Fl-score for the three different classes.
  • a ROC-AUC curve (illustrated in the bottom left frame of FIG. 22) indicates that the model successfully learned cell features with various threshold values.
  • the decision threshold of the output probabilities was dynamically adjusted to examine the change in the prediction. With different decision thresholds applied, the number of total data points and the corresponding well-classified (illustrated in the bottom middle frame of FIG. 22) are shown in a solid curve, and the overall accuracy is calculated accordingly, reported as the dashed line.
  • the model may perform better with increased threshold values. As the thresholding of the probability gate was set to 0.6, the overall accuracy went to 85%. A high decision gating would reduce the false positive rate, but it also reduced the number of HSCs being classified. With a threshold value of 0.8, the cell count drops to 1000, but the prediction accuracy jumps to 96%. This strategy is useful in applications where a small number of HSCs need to be selected for the transplant.
  • the model performances were investigated at various sizes of training datasets by dynamically changing sample sizes and registering the corresponding model’s overall accuracy (illustrated in the bottom right frame of FIG. 22).
  • 80% of the total dataset was randomly sampled to serve as the full-scale training set and the remainder as the validation set.
  • the model was then incrementally with a fraction of training data until the entire training set was used (10 sample fractions from 0.1 to 1.0).
  • five (5) iterations were practiced and plotted the change in the model’s mean performance in these iterations (illustrated in the bottom right frame of FIG. 22). Results showed that the model performance is positively correlated with training sample size. However, the correlation was not linear.
  • the model’s predictive performance was nearly as good as that after training with the entire training set.
  • FIG. 22 illustrates a summary of model’s performance on the holdout validation data of LSK/SLAM sorting.
  • the top left frame illustrates a confusion matrix - the prediction by the model on the validation hold-out for different classes.
  • the top right frame illustrates recall, precision, and Fl-score of the trained model.
  • the bottom right frame illustrates a ROC-AUC curve. This curve reported the relationship between the true-positive rate and false-positive rate for the three classes and their micro and macro averages.
  • the bottom middle frame illustrates an accuracy and remaining data point count under different thresholds. As the threshold gating increased from the baseline (0.33), the total number of cell counts filtered out with the threshold was shown as “correct” and “total” curves. The dashed curve shows how the overall accuracy changes.
  • the bottom right frame illustrates how the DL model’s performance is positively correlated with the training sample size.
  • the model’s performance changes accordingly.
  • the mean and the corresponding 95% confidence interval of the best overall accuracy from the training experiments are shown.
  • PCA principal components analysis
  • the trained CNN operated on the input images, during which high-dimensional information (64 x 64) was reduced into two- dimensional space.
  • cell-type specific clusters were formed before fully connected layers were ready to perform prediction (illustrated in the top two frames of FIG. 23).
  • a class activation map (CAM) was constructed from convolutional layers (illustrated in the bottom three frames of FIG. 23). CAMs are commonly used to explain how a DL model learns to classify an input image into a particular class.
  • a Score-CAM was applied.
  • a visual explanation method may utilize perturbation over an input image and may record the corresponding responses in the model(s)’ output.
  • the Score-CAM produced a heatmap of the class-discriminative features for the cell images, in which high intensity represented regions attracting strong attention from the one or more DL model(s) (illustrated in the bottom three frames of FIG. 23). Obtained results indicated that cellular morphological features necessary for classification had been extracted from the original images.
  • FIG. 23 illustrates an interpretation of the one or more DL model(s).
  • the top left and right frames illustrate data clustering generated by PCA before and after data training.
  • the bottom frames illustrate a visual explanation of the one or more DL model(s) with Score-CAM.
  • Cells from the three classes were randomly selected and their attention heatmaps in the one or more DL model(s) were produced by ScoreCAM, a visual explanation method that utilized perturbation over an input image and records the corresponding responses in the model’s output. Given an input image, the perturbation occurring in the regions useful ( e.g ., essential) to the model’s reception may lead to a significant change in the model’s prediction, which may translate to a strong activation intensity.
  • the Score-CAM may produce the class-discriminative visualization for the cell images from different classes.
  • some regions may receive higher attention (e.g., the center/darker regions) by the model(s), while other regions (e.g., the peripheral/lighter regions) may receive less attention by the model(s).
  • the model when training the model, the model’s robustness was analyzed with different training datasets used. ScoreCAM and feature distribution plots were used to interpret how the model understood/processed the images. The model’s output probability was used to reconstruct the given DIC view, in which the cells were assigned the scores given/output by the model to see how they matched with the fluorescence intensities, among other reasons.
  • Model training may include a Decision Tree rule, a Random Forest rule, a Support Vector Machine rule, a Naive Bayes Classification rule, and/or a Logistic Regression rule.
  • One or more DL model(s) may distinguish LT-HSCs, ST-HSCs, and MPPs that were sorted by a different set of surface markers.
  • LSK/CD34/CD135 is another well-established set of surface markers to identify murine HSCs.
  • HSCs and MPPs that were sorted with the new markers (illustrated in the top left frame of FIG. 24). Images of these cells might not (e.g., might never) have been used in DL model training. Cell images were processed to generate single-cell validation datasets as previously described.
  • One or more DL models used herein maintained a high accuracy when predicting these new data.
  • the overall accuracy of prediction is 73%, and the precision and recall for LT-HSC group is 0.86 and 0.73, respectively.
  • the one or more DL model(s) may distinguish LT-HSCs, ST-HSCs, and MPPs that were sorted by intracellular reporters.
  • HSCs sorted from BM of two transgenic mouse models were used which both have an HSC-specific intracellular GFP reporter.
  • the first mouse model is the a-catulinGFP mouse a- catulin is a protein with homology to a-catenin and has been found to be expressed predominantly in mouse HSCs.
  • a-catulinGFP mouse In the a-catulinGFP mouse, GFP coding sequence is knocked into a-catulin gene locus, therefore GFP expression is under the control of a-catulin gene promoter.
  • a-catulin-GFP + c-Kit + cells are highly enriched HSCs and are almost exclusively CD150 + CD48-, which are comparable to the LT-HSC population described above.
  • FIG. 24 illustrates an example of the one or more DL model(s)-based classification of HSCs/MPPs sorted with LSK/CD34/CD135 surface markers and a-catulin-GFP.
  • the four top left frames of FIG. 24 illustrate representative FACS density dot plots showing the gating strategy employed to identify and isolate LT-HSCs, ST-HSCs, and MPPs from BM using LSK/CD34/CD135 surface markers.
  • the top right frame of FIG. 24 illustrates the performance of the DL model in predicting cell types of HSCs/MPPs from the four top left frames of FIG. 24 was gauged by its precision, recall, and FI -score.
  • FIG. 24 illustrate example DIC and fluorescence images of LSK/a-catulin-GFP + cells that were taken (e.g., immediately) after FACS sorting. Representative images are shown.
  • the scale bar 10 mpi.
  • the bottom right frame of FIG. 24 illustrates a majority of LSK/a-catulin-GFP + cells that were classified as LT-HSC by the one or more DL model(s).
  • Evil is a transcription factor of the SET/PR domain protein family and has been shown to play a critical role in maintaining HSC sternness.
  • GFP fluorescence is mainly restricted in LT-HSCs, ST- HSCs, and MPPs.
  • One or more DL models used herein were challenged with single-cell inputs derived from the images of LSK/Evil-GFP + cells that were sorted from BM of Evil GFP mice. Being a three-way classifier, the one or more DL model(s) classified those cells into three categories (LT-HSC, ST-HSC, and MPP) as anticipated.
  • the certainty of each decision-making varies greatly as evidenced by the confidence score of each cell classification, which can be anywhere between 0.34 and 1.0 (illustrated in the four top frames and the three middle frames of FIG. 2).
  • the percentage of high-GFP cells (fluorescence unit > 3000) in HSCs is much higher than that in MPPs (LT-HSCs:ST-HSCs:MPPs as illustrated in the bottom three frames of FIG. 2.
  • the difference in GFP fluorescence intensity became more significant if the confidence score was increased from 0.34 to 0.5 or 0.8.
  • increasing the confidence score threshold will greatly improve the accuracy of classification of the one or more DL model(s). It was found that at the confidence score of 0.8, top 3% cells with the highest GFP levels were exclusively LT-HSCs or ST-HSCs. This finding is consistent with FACS results.
  • FIG. 2 illustrates a Model tested on Evil-GFP + populations.
  • the four frames of the top row of FIG. 2 illustrate the DIC image and the corresponding fluorescent label, the predicted cell type, and the predicted probability for each cell from the deep learning model.
  • the middle three frames of FIG. 2 illustrate the model’s output given the cell image crops from the original DIC image.
  • the corresponding predicted types and the model’s probability output are shown in the three frames of the bottom of FIG. 2.
  • HSCs/MPPs are the most relevant component of bone marrow transplants, which is a mainstay of life-saving therapy for treating patients with leukemia and congenital blood disorders.
  • HSC/MPP research heavily relies on the separation of HSCs and MPPs, and FACS sorting is the only technology available to do so.
  • FACS is a powerful tool that has great applications in immunology, stem cell biology, bacteriology, virology, and cancer biology as well as a clinical diagnosis.
  • the technology has made dramatic advances over the last 30 years, allowing unprecedented studies of the immune system and other areas of cell biology.
  • this technology has several key drawbacks as it requires antibody staining and lasers as light sources to produce both scattered and fluorescent light signals. It is well known that both antibody staining and laser can impair cell viability and stem cell activity.
  • a new and more gentle sorting technology may be useful to facilitate HSC research.
  • the one or more DL-based platform(s) use an antibody label-free and laser-free approach to identify distinct subpopulations of HSC/MPP with high accuracy. It provides a proof-of-principle demonstration that DL can recognize very subtle morphological differences between subpopulations of HSC/MPP from their light microscopic images. This technology might have broader applications to identify and isolate other cell populations in the hematopoiesis system. It may provide a basis for developing the next generation of label-free cell sorting approaches.
  • a high-quality training dataset is useful (e.g., essential) to successful DL model training.
  • a processing toolbox as described herein
  • accurate marking and preparation of the training and validation datasets was made.
  • the image datasets derived from LSK/SLAM sorting one or more DL models were trained and the resulting DL model(s) were able to classify a particular cell, which the model(s) has never seen, into one of the three categories (LT-HSC, ST- HSC, MPP) with high accuracy.
  • Results obtained herein showed that distinct morphological features of each cell population exhibited in light microscopic images were extracted by the fine- tuned/calibrated convolutional layers.
  • the convolutional network pre-trained from the ImageNet has had great success in general multi-class image classification tasks, the pre-trained parameters needed to be adjusted for the purpose of cell image classification, during which a proper selection of learning procedures for the convolutional layers was important.
  • the accuracy of classification can be further improved with more training data.
  • an investigation of the impact of training sample size enables cost- effective experiments and estimates the confidence in the generalizability of a model trained on a limited amount of data. From conducted experiments on the model performance at various sample sizes (shown in the bottom right frame of FIG. 22), it was found that when given greater than 50% of the dataset, the model started to perform reasonably. But increasing the scale of the dataset would also reduce the uncertainty of the model’ s performance, which indicated that the larger dataset also tended to produce training experiments with low variance.
  • one or more DL systems may be trained with more image data to improve the accuracy of the prediction and identification.
  • the model(s) were also tested on the samples from a-catulin. A recall of 74% was received. The single-cell image crops from Evil-GFP + was collected. The model’s prediction was applied to this dataset by a gating threshold of 0.5. The results ( e.g ., FIG. 2) showed that the model predicted that 90% of them are HSCs. Interestingly, although the model was not trained to predict the fluorescence intensities, it was observed that the distribution of the probability output of the model for each cell subpopulation approximately matched the distribution of the fluorescence intensities (illustrated in the top four frames of FIG. 2). This indicates that it might be possible to use DF for in silico fluorescence staining of the cell from a DIC image.
  • One or more morphological features may vary as important features for DF to distinguish FT-HSC, ST-HSC and MPP. It is well known that FT-HSC, ST-HSC, and MPP have different self-renewal capabilities. Morphologically, how FT-HSC, ST-HSC, and MPP are different is an important biological issue. One or more scenarios contemplate using perturbations of several morphological features of HSC/MPP and improved MF systems for differentiation/distinguishing techniques described herein.
  • C57BF/6(CD45.2 + ), C57Bl/6-Boy/J(CD45.1 + ) and a-catulinGFP mice were purchased from the Jackson Faboratory.
  • Evil-IRES-GFP knock-in mice (EvilGFP mice) were kindly provided by the University of Tokyo. All mice were used at 8 - 12 weeks of age. They were bred and maintained in the animal facility at Cooper University Health Care. All procedures and protocols were following NIH-mandated guidelines for animal welfare and were approved by the Institutional Animal Care and Use Committee (IACUC) of Cooper University Health Care.
  • Antibodies Antibodies
  • Fineage cocktail-PE components include anti- mouse CD3, clone 17A2; anti-mouse Fy-6G/Fy-6C, clone RB6-8C5; anti-mouse CDllb, clone Ml/70; anti-mouse CD45R/B220, clone RA3-6B2; anti-mouse TER-119/Erythroid cells, clone Ter- 119; BioFegend, cat# 78035), Fineage cocktail-PE (components include anti-mouse CD3, clone 17A2; anti-mouse Fy-6G/Fy-6C, clone RB6-8C5; anti-mouse CDllb, clone Ml/70; anti-mouse CD45R/B220, clone RA3-6B2; anti-mouse TER- 119/Erythroid cells, clone Ter- 119;
  • Murine BM cells were flushed out from the long bones (tibias and femurs) and ilia with DPBS without calcium or magnesium (Coming). After lysis of red blood cells and rinse with DPBS, BM cells were stained with antibodies on ice for 30 min.
  • BM cells from C57BL/6 mice, the following antibodies were used: Lin-PE, c-Kit-FITC, Sca-l-APC, CD150-BV421, and CD48 PE/Cy7.
  • BM cells from EvilGFP or a-catulinGFPmice the following antibodies were used: Lin-PE, Sca-l-APC, and c-Kit-PE/Cy7.
  • Cells were sorted on a Sony SH800Z automated cell sorter. Negative controls for gating were set by cells without antibody staining. The data was analyzed using the accompanying software with the cell sorter.
  • FACS-sorted cells were plated in coverglass-bottomed chambers (Cellvis) and maintained in DPBS/2% FBS throughout image acquisition.
  • An Olympus FV3000 confocal microscope was used to take DIC and fluorescence images simultaneously at a resolution of 2048x2048.
  • An image MATLAB toolbox may facilitate one or more techniques described herein.
  • the toolbox took the image from the track view that contains bright- field images. This toolbox was used to segment single cells with a pixel size of 64x64 from DIC images and to label the single- cell crops by cell types.
  • the toolbox had two steps: 1) detecting the single cell from bright- field views; 2) characterizing the cell targets and removing the outliers (debris and cell clusters) by applying size thresholding and uniqueness check.
  • Data augmentation was applied to the training examples by arbitrary image transformation including random rotation, horizontal flipping, and/or brightness adjustment on the original single-cell crops.
  • Oversampling was practiced on the minor classes in each run during the training experiment, such that the image data for each class could be balanced when used as training examples.
  • the oversampling algorithm randomly sampled training images from the minority until the number of the examples reached the same number in a majority class for each run.
  • the training dataset for one or more, or each, run contained equivalent numbers of data examples for one or more, or all of the classes.
  • Model selection was practiced with cross-validation and selected to used ResNet- 50 as convolutional layers.
  • the convolutional layers were pre-trained with ImageNet with customized starting layers to match the size of the input single-cell images, followed by four fully-connected layers from scratch.
  • An ADAM optimizer was applied with a weight decay of 0.05, and set the learning rate of 5x104 for the fully connected layers and fine-tuned/calibrated the convolutional layers by retraining the convolutional layers at 1% of the learning rate.
  • the final training outcome was reported with a training and validation split of 8:2 and trained the model with a batch size of 256 on a Tesla PI 00 GPU on Google Colab platform with 20 epochs via Pytorch 1.10.0.
  • a diagram 300 illustrates an example technique for distinguishing cells.
  • the process may start or restart.
  • a processing device may receive a plurality of first images. Each of the plurality of first images depicting first cells of a first type or a second type. At 306, the processing device may, perhaps for each of the plurality of first images, receive an indicator identifying whether the first image depicts a first cell of the first type or the second type.
  • the processing device may input, into a deep-learning (DL) model, the plurality of first images and the indicator for each of the plurality of first images.
  • the processing device my process, via the DL model, the plurality of first images and the indicator for each of the plurality of first images.
  • DL deep-learning
  • the processing device may input, into the DL model, a second image comprising a second cell of the first type or the second type.
  • the process may determine, via the DL model, whether the second cell is of the first type or the second type, based at least in part, on the processing of the plurality of first images and the indicator for each of the plurality of first images.
  • the process may stop or restart.
  • FIG. 4 is a block diagram of a hardware configuration of an example device that may function as a process control device/logic controller that may perform/host at least a part of one or more elements of the DL/ML techniques described herein, for example.
  • the hardware configuration 400 may be operable to facilitate delivery of information from an internal server of a device.
  • the hardware configuration 400 can include a processor 410, a memory 420, a storage device 430, and/or an input/output device 440.
  • One or more of the components 410, 420, 430, and 440 can, for example, be interconnected using a system bus 450.
  • the processor 410 can process instructions for execution within the hardware configuration 400.
  • the processor 410 can be a single-threaded processor or the processor 410 can be a multi-threaded processor.
  • the processor 410 can be capable of processing instructions stored in the memory 420 and/or on the storage device 430.
  • the memory 420 can store information within the hardware configuration 400.
  • the memory 420 can be a computer-readable medium (CRM), for example, a non-transitory CRM.
  • CRM computer-readable medium
  • the memory 420 can be a volatile memory unit, and/or can be a non-volatile memory unit.
  • the storage device 430 can be capable of providing mass storage for the hardware configuration 400.
  • the storage device 430 can be a computer-readable medium (CRM), for example, a non-transitory CRM.
  • CRM computer-readable medium
  • the storage device 430 can, for example, include a hard disk device, an optical disk device, flash memory and/or some other large capacity storage device.
  • the storage device 430 can be a device external to the hardware configuration 400.
  • the input/output device 440 may provide input/output operations for the hardware configuration 400.
  • the input/output device 440 e.g., a transceiver device
  • the input/output device 440 can include one or more of a network interface device (e.g., an Ethernet card), a serial communication device (e.g., an RS-232 port), one or more universal serial bus (USB) interfaces (e.g., a USB 2.0 port) and/or a wireless interface device (e.g., an 802.11 card).
  • the input/output device can include driver devices configured to send communications to, and/or receive communications from one or more networks.
  • the input/output device 400 may be in communication with one or more input/output modules (not shown) that may be proximate to the hardware configuration 400 and/or may be remote from the hardware configuration 400.
  • the one or more output modules may provide input/output functionality in the digital signal form, discrete signal form, TTL form, analog signal form, serial communication protocol, fieldbus protocol communication and/or other open or proprietary communication protocol, and/or the like.
  • the camera device 460 may provide digital video input/output capability for the hardware configuration 400.
  • the camera device 460 may communicate with any of the elements of the hardware configuration 400, perhaps for example via system bus 450.
  • the camera device 460 may capture digital images and/or may scan images of various kinds, such as Universal Product Code (UPC) codes and/or Quick Response (QR) codes, for example, among other images as described herein.
  • UPC Universal Product Code
  • QR Quick Response
  • the camera device 460 may be the same and/or substantially similar to any of the other camera devices described herein.
  • the camera device 460 may include at least one microphone device and/or at least one speaker device.
  • the input/output of the camera device 460 may include audio signals/packets/components, perhaps for example separate/separable from, or in some ( e.g ., separable) combination with, the video signals/packets/components the camera device 460.
  • the camera device 460 may be in wired and/or wireless communication with the hardware configuration 400. In one or more scenarios, the camera device 460 may be external to the hardware configuration 400. In one or more scenarios, the camera device 460 may be internal to the hardware configuration 400.
  • the subject matter of this disclosure, and components thereof, can be realized by instructions that upon execution cause one or more processing devices to carry out the processes and/or functions described herein.
  • Such instructions can, for example, comprise interpreted instructions, such as script instructions, e.g., JavaScript or ECMAScript instructions, or executable code, and/or other instructions stored in a computer readable medium (e.g., on a non-transitory computer readable medium).
  • Implementations of the subject matter and/or the functional operations described in this specification and/or the accompanying figures can be provided in digital electronic circuitry, in computer software, firmware, and/or hardware, including the structures disclosed in this specification and their structural equivalents, and/or in combinations of one or more of them.
  • the subject matter described in this specification can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, and/or to control the operation of, data processing apparatus.
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and/or declarative or procedural languages. It can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, and/or other unit suitable for use in a computing environment.
  • a computer program may or might not correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs and/or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, and/or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that may be located at one site or distributed across multiple sites and/or interconnected by a communication network.
  • the processes and/or logic flows described in this specification and/or in the accompanying figures may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and/or generating output, thereby tying the process to a particular machine (e.g., a machine programmed to perform the processes described herein).
  • the processes and/or logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) and/or an ASIC (application specific integrated circuit).
  • Computer readable media suitable for storing computer program instructions and/or data may include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and/or flash memory devices); magnetic disks (e.g., internal hard disks or removable disks); magneto optical disks; and/or CD ROM and DVD ROM disks.
  • semiconductor memory devices e.g., EPROM, EEPROM, and/or flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto optical disks e.g., CD ROM and DVD ROM disks
  • the processor and/or the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Abstract

Technologies are disclosed for distinguishing among different types of cells. Techniques may include receiving a plurality of first images. The plurality of first images may depict first cells of a first type or a second type. Techniques may include, for each of the plurality of first images, receiving an indicator identifying whether the first image depicts a first cell of the first type or the second type. Techniques may include inputting, into a deep-learning (DL) model, the plurality of first images and the indicator for each of the plurality of first images. Techniques may include inputting, into the DL model, a second image comprising a second cell of the first type or the second type. Techniques may include determining whether the second cell is of the first type or the second type based on the plurality of first images and the indicator for each of the plurality of first images.

Description

UABEU-FREE CUASSIFICATION OF CEUUS BY IMAGE ANAUYSIS AND MACHINE UEARNING
CROSS-REFERENCE TO REUATED APPUICATION [0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/224,187, filed on July 21, 2021, the entire contents of which hereby incorporated by reference herein, for all purposes.
GOVERNMENT LICENSE RIGHTS
[0002] This invention was made with government support under R00 HL 107747 awarded by the National Institute of Health. The government has certain rights in the invention
BACKGROUND
[0003] Hematopoietic stem cells (HSCs) and multipotent progenitors (MPPs) are crucial for maintaining lifelong hematopoiesis. Several decades of successful clinical HSC transplantation have demonstrated the therapeutic importance of HSCs and MPPs. MPPS are multipotent and may contribute to lineages of erythroid progenitor cells, megakaryocyte progenitor cells, granulocyte progenitor cells, and/or monocyte progenitor cells.
[0004] Hematopoietic stem cells (HSCs) and multipotent progenitors (MPPs) are important for lifelong blood production and are uniquely defined by their capacity to durably self- renew, while still contributing to the pool of differentiating cells. Based on their self-renewal capability, these cells can be divided into long-term (LT), short- term (ST) HSCs, and multipotent progenitors (MPPs).
SUMMARY
[0005] Technologies are disclosed for distinguishing among different types of cells. Techniques may include receiving a plurality of first images. One or more, or each, of the plurality of first images may depict first cells of a first type or a second type. Techniques may include, for each of the plurality of first images, receiving an indicator identifying whether the first image depicts a first cell of the first type or the second type. Techniques may include inputting, into a deep-learning (DL) model, the plurality of first images and the indicator for each of the plurality of first images. Techniques may include processing, by the DL model, the plurality of first images and the indicator for each of the plurality of first images. Techniques may include inputting, into the DL model, a second image comprising a second cell of the first type or the second type. Techniques may include determining, via the DL model, whether the second cell is of the first type or the second type, based at least in part, on the processing of the plurality of first images and the indicator for each of the plurality of first images.
[0006] In one or more scenarios, the DL model may comprise, at least in part, one or more layers. Techniques may comprise determining, by the DL model from the indicator, which of each of the plurality of first images depicts the first cell as that of the first type. Techniques may comprise determining, by the DL model from the indicator, which of each of the plurality of first images depicts the first cell as that of the second type. Techniques may comprise associating, by the DL model, one or more characteristics of each of the plurality of first images determined to depict the first cell of the first type with one or more first image identification parameters of a cell of the first type. Techniques may comprise associating, by the DL model, one or more characteristics of each of the plurality of first images determined to depict the first cell of the second type with one or more second image identification parameters of a cell of the second type. Techniques may comprise tuning, by the DL model, at least some of the one or more layers of the DL model based the first image identification parameters and/or the second image identification parameters.
[0007] In one or more scenarios, the one or more layers may be convolutional layers. The tuning may be performed, by the DL model, at a one or more learning rates that may be associated with the one or more convolutional layers.
BRIEF DESCRIPTION OF DRAWINGS
[0008] The elements and other features, advantages and disclosures contained herein, and the manner of attaining them, will become apparent and the present disclosure will be better understood by reference to the following description of various examples of the present disclosure taken in conjunction with the accompanying drawings, wherein:
[0009] FIG. 1 is an example flowchart of an overview of at least one cell distinguishing technique.
[0010] FIG. 2 is an illustration of at least one tested model used to distinguish cells. [0011] FIG. 3 is an example flow diagram of at least one technique for distinguishing among different types of cells.
[0012] FIG. 4 is a block diagram of a hardware configuration of an example device that may control one or more parts of one or more cell distinguishing techniques.
[0013] FIG. 5 is an example illustration of a Demonstration of Image pre-processing on the raw image data with higher density of cell.
[0014] FIG. 6 illustrates an example demonstration of image data from patient blood.
[0015] FIG. 7 illustrates an example of the architecture of the machine learning model of
ResNet-50.
[0016] FIG. 8 illustrates an example of Five-Fold cross-validation during training and testing experiments.
[0017] FIG. 9 illustrates an example of trained model evaluation.
[0018] FIG. 10 illustrates an example of single cell images cropped from bright field images.
[0019] FIG. 11 illustrates the cropping results after applying size filter, after applying both size filter, and uniqueness operation, and a size distribution characterization according to the crops.
[0020] FIG. 12 is an example illustration of the cropped single cells after normalization actions for the training purposes.
[0021] FIG. 13 is an example illustration of typical data samples from a batch.
[0022] FIG. 14 is an example illustration of a subpopulation of stem cells.
[0023] FIG. 15 illustrates an example of a Deep Learning model workflow.
[0024] FIG. 16 is an example illustration of principle components of at least three subsets of HSCs versus non-HSC: ST-HSC, LT-HSC and MPP.
[0025] FIG. 17 illustrates an example of a confusion matrix for the HSC 3-classes.
[0026] FIG. 18 is an example illustration of a t-SNE plot of four subsets of image data.
[0027] FIG. 19 illustrates an example confusion matrix and learning history of the training experiments.
[0028] FIG. 20 illustrates an example overview of one or more experiments to distinguish one or more cells.
[0029] FIG. 21 illustrates example FACS sorting of murine HSCs and MPPs using LSK/SLAM markers and cell imaging. [0030] FIG. 22 illustrates a summary of a DL model’s performance on the holdout validation data of LSK/SLAM sorting.
[0031] FIG. 23 illustrates an interpretation of the one or more DL model(s).
[0032] FIG. 24 illustrates an example of the one or more DL model(s)-based classification of HSCs/MPPs.
DETAILED DESCRIPTION
[0033] For the purposes of promoting an understanding of the principles of the present disclosure, reference will now be made to the examples illustrated in the drawings, and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of this disclosure is thereby intended.
[0034] Detection, characterization, and classification of different types of cells is important for the diagnosis and monitoring of cells. The traditional way of cell classification via fluorescent images requires a series of tedious experimental procedures and often impacts the status of cells. Described herein are one or more methods for label-free detection and classification of cells, by taking advantage of data analysis of bright field microscopy images. The approach uses the convolutional neural network (CNN), a powerful image classification and machine learning algorithm to perform label-free classification of cells detected in microscopic images of cell samples containing different types of cells. It requires minimal data pre-processing and has an easy experimental setup. Through one or more experiments, one or more methods described herein can achieve high accuracy on the identification of cells without the need for advanced devices or expert users, thus providing a faster and simpler way for counting, identifying, and classification of cells. Details of one or more methods are described herein and their application in cancer cells and/or stem cells.
[0035] CNN may be targeted to have images/videos as primary data flow because how CNN operates on the inputs could maintain the intrinsic relation between neighboring pixels. CNN models are generally over-parameterized mathematical model(s) whose parameters are found via supervised learning. A Softmax function on the last (fully-connected) layer may be applied as the target to be optimized. In one or more scenarios, a relatively complicated CNN structure, ResNet- 50, as pre-trained backbone, followed by at least four fully-connected layers were used. The parameters of the backbone and fully-connected layers were fine-tuned/calibrated, in a way that a full learning rate of the fully-connect layers was used. In one or more scenarios, ( e.g ., only) 1% of the learning rate was used in the backbone.
[0036] In one or more scenarios, different batches of one or more, or all, the classes of HSCs and MPPs were used to provide the one or more training datasets. To prepare for the data for training our DL model, the raw images were processed through Otsu’s filtering edge detection, flood-fill operation(s), and morphological opening operation(s) so that a watershed segmentation could be achieved. Although many, or all, cells contained in the image are located, there are irrelevant spots and duplicated detection(s) that have might have been mistakenly segmented. To help ensure the quality of training data for the CNN machine learning model, the cropped images were (e.g., only) selected that clearly contain an isolated single cell from one or more, or all, cropped images by applying size thresholding and uniqueness checks that were used to train the model (as described herein).
[0037] After the cropped single-cell image were acquired, data augmentation was applied to the training dataset examples by arbitrary image transformation. Oversampling was practiced on the minor classes (LT-HSC) in one or more, or each, run during the training of the model(s), such that the image data for one or more, or each, class could be balanced when used as training examples. The oversampling algorithm randomly sampled training images from the minority, perhaps until the number of the examples reached the same number in a majority class for one or more, or each, run. In one or more scenarios, the training dataset for each run may contain equivalent numbers of data examples for one or more, or all, of the classes.
[0038] Hematopoietic stem cells (HSCs) and multipotent progenitors (MPPs) are crucial for maintaining lifelong hematopoiesis. Several decades of successful clinical HSC transplantation have demonstrated the therapeutic importance of HSCs and MPPs. For basic research, HSCs and MPPs are routinely analyzed and separated by fluorescence- activated cell sorting (FACS) based on cell surface marker staining, a procedure that requires incubation with various antibodies and flow cytometers equipped with lasers. Both antibody staining and laser impair cell viability and stem cell activity. To develop a new staining-free, laser-free method for predicting and separating subpopulations of HSCs and MPPs, a deep learning approach was developed that can extract minutiae from large-scale datasets of long-term HSCs (LT-HSCs), short-term HSCs (ST-HSCs), and multipotent progenitors (MPPs) to distinguish subpopulations of hematopoietic precursor solely based on their morphology. The one or more deep learning model(s) can achieve predictable identification of subsets of HSCs with at least 85% accuracy. It is anticipated that the accurate and robust deep learning-based platform described herein for hematopoietic precursors will provide a basis for the development of a (e.g., next-generation) cell sorting system.
[0039] Hematopoietic stem cells (HSCs) and multipotent progenitors (MPPs) are important for lifelong blood production and are uniquely defined by their capacity to durably self- renew, while still contributing to the pool of differentiating cells. As HSCs differentiate, they give rise to a series of progenitor cell intermediates that undergo a gradual fate commitment to a particular mature blood cell. HSCs reside within bone marrow (BM) and are highly heterogeneous. Numerous studies have defined phenotypic and functional heterogeneity within the HSC/MPP pool and have revealed the coexistence of several HSC/MPP subpopulations with distinct proliferation, self-renewal, and differentiation potentials. Based on their self-renewal capability, they can be divided into long-term (LT), and short- term (ST) HSCs and multipotent progenitors (MPPs). In the past few decades, a lot of effort has been made to identify surface markers permitting the prospective identification and isolation of the subsets of HSCs. In the adult mouse, all HSCs/MPPs are contained in the Lineage-/lowSca-l+c-Kit+ (LSK) fraction of BM cells. Higher levels of HSCs purity can be achieved by applying two signaling lymphocyte activation molecule (SLAM) family markers CD150 and CD48.
[0040] It has been reported that one out of every 2.1 LSK CD150+CD48- cells possess the capability to give long-term repopulation in recipients of BM transplants. Meanwhile, ST-HSCs and MPPs can be isolated by sorting LSK CD150- CD48- and LSK CD150-CD48+ cells, respectively. In addition to the SLAM markers, HSCs can also be subdivided by their characteristic CD34 and CD135 (Flt3) expression profiles. With this staining, LSK CD34-CD135- cells are defined as LT-HSCs, while LSK CD34+CD135- as ST-HSCs and LSK CD34+CD135+ as MPPs. Although the self-renewal capabilities of LT-HSCs, ST-HSCs, and MPPs are dramatically different, there is no evidence to date that those three subpopulations are morphologically distinguishable under the microscope.
[0041] Other than surface antigen markers, several intracellular proteins, e.g., a-catulin and ecotropic viral integration site-l(Evil), have been found to be expressed predominantly in murine HSCs. Thus, GFP expression driven by a-catulin or Evil gene promoters has been used to identify HSCs and track their “sternness” ex vivo or in vivo. The detection of both membrane- bound and intracellular HSC markers relies on antibody staining, which will affect HSC activity, and/or excited fluorescence, which will cause photodamage and loss of HSCs sternness. Therefore, a label-free, laser-free method to identify HSCs would be very advantageous for research and clinical applications.
[0042] Deep learning (DL) has become state-of-the-art for computer vision tasks in biological and biomedical studies. One or more DL algorithms build a mathematical model based on training examples with ground truth labels. It extracts relevant biological microscopic characteristics from massive image data. The primary algorithm(s) for DL image classification is based on the convolutional neural network (CNN). CNN is mainly composed of convolutional layers that perform a convolution with “leamable” filters. The parameters of such filters can be optimized during the learning process. At the final layers, the output is flattened into a vector for classification, which categorizes given inputs into certain classes. DL has proven to be extremely effective on image classification tasks. For example, DL has been used to categorize tumor cells, red blood cells, white blood cells, predict neural stem cell differentiation, and assist cancer diagnosis.
[0043] A DL-based platform has been successfully developed for the automatic detection of rare circulating tumor cells in a label-free, laser-free way. Apparently, malignant tumor cells have distinct morphology from normal cells. Whether this DL-based platform can distinguish the subpopulations of HSCs, which only have very subtle differences, is a challenging question. To answer this question, a Differential Interference Contrast (DIC) was taken of images of different HSCs and MPPs that were sorted by flow cytometer using LSK/SLAM markers, and then processed those images into single-cell DL training datasets and validation datasets. One or more DL models were trained with single-cell training datasets, with multiple rounds of parameter optimization and augmentation of training sample sizes. Next, the efficacy of one or more DL models were evaluated by feeding it with single-cell validation datasets and/or calculating the accuracy of its cell type prediction.
[0044] One or more models were tested on HSCs that were separated either by LSK/CD34/CD135 membrane markers or cellular GFP expression in BM of a-catulinGFP and EvilGFP transgenic mice. Obtained results demonstrated for the first time that DL can extract intrinsic morphological features of distinct HSC subpopulations directly from their light microscopic images, and make reliable and accurate classifications. [0045] Circulating tumor cells (CTCs) found in peripheral blood are originated from solid tumors. They are cells shed by a primary tumor into the vasculature, circulating through bloodstream of cancer patients, and colonizing at distant sites which may form metastatic tumors. CTCs are an important biomarker for early tumor diagnosis and early evaluation of disease recurrence and metastatic spread in various types of cancer. Early detection of CTCs provides high chances for patients to survive before severe cancer growth occurs. The CTC count is also an important prognostic factor for patients with metastatic cancer. For example, a study has shown that the number of CTCs is an independent predictor of survival in patients for breast cancer and prostate cancer, and the changes of the CTC counts predict the survival in patients for lung cancer.
[0046] The identification of the CTCs population is a challenging problem. Various approaches to identifying and isolating CTCs including antibody-based methods and physical- characteristics-based methods have been developed. This task is difficult because of the low concentration of CTCs existing in a patient’s peripheral blood — a few CTCs out of 10 billion blood cells, as well as heterogeneity in the characteristics of CTCs. For example, the mechanism of CTCs maintaining metastatic potential during circulating is not well understood; CTCs derived from some patients allow a cell line to be established, but CTCs from some others lose the capability of proliferation after a few hours of blood drawing. Therefore, the incapability to draw a large volume of blood from patients leads to the need for improvements of CTC isolation methods so that CTCs can be detected in small sample volumes. Further, the inconsistency in the viability of CTCs hinders further explorations of relationships between the mechanism of patient- derived CTCs and tumor dormancy.
[0047] Focating specific target cells such as CTCs often requires tedious procedures. During the processes, CTCs need to be distinct from a (e.g., huge/large) amount of leukocytes via immunofluorescent labeling and fluorescent microscopy, and identifying the CTCs via the fluorescent labeling images could be achieved with high-throughput. Epithelial markers such as cytokeratin (CK), and epithelial cell adhesion molecules (EpCAM), are useful for detecting CTCs in patients. For example, CellSearch (Menarini Silicon Biosystems), an FDA-approved platform for CTC identification, is based on the overexpression of CK and EpCAM. However, detection of some types of CTCs such as CTCs in renal cell carcinoma (RCC) is limited by the lack of epithelial differentiation. RCC shows low expression of epithelial markers, so such type of CTCs cannot be captured by labeling methods. In addition, fluorescent labeling comes with a few disadvantages.
[0048] For example, as described herein, most fluorescence imaging needs antibody-based fluorescence probes, which relies on overexpression of certain proteins on cell membranes, and such overexpression is usually not stable and largely relies on cancer type and patient; photobleaching and phototoxicity occur in a short time after exposure under a fluorescent light source, and choosing proper fluorophores needs expert experience; fluorescence staining often influences the viability of cells, thus impacts further culturing and analysis. In contrast, intelligent cell identification and classification from low-resolution microscopic images allows a fast, cheap, and repeatable process. Techniques described herein are targeted to develop an automatic tool for accurate detection of CTCs as a promising step for diagnosis and clinical management of cancer patients.
[0049] Machine learning (ML), e.g., as a part of Deep Learning (DL), has become a superior tool for developing automated processes of classification, sorting, and detection. ML algorithms build a mathematical or statistical model based on sample “training data” with known “ground truth” annotations, to make inference or predictions. There are traditional machine learning models such as random forest that can perform classification or prediction given high- quality features, and deep learning models such as Convolutional Neural Networks (CNNs) that can leam to extract features in an automatic fashion.
[0050] For instance, CNN has been applied to the categorization of cell lines and red blood cells have integrated feature extraction and deep learning with high-throughput quantitative imaging enabled by photonic time stretch, achieving high accuracy in label-free cell classifications on selected white blood cells (WBCs) and cancer cells. They developed specialized hardware for the task of locating static, captured cells on a slide or in a device via a high speed and/or low- resolution scan. Perhaps for example instead of a simple and cheap experimental setup, their image acquisition is based on a time-stretch quantitative phase imaging system, and the representations of results can be improved by using samples from patients’ blood.
[0051] Technologies disclosed herein are aimed at developing a fast and accurate technique for locating and counting tumor cells among the mixed cells after enrichment from whole blood. For state-of-the-art works using deep learning methods, the model accuracy can be 85% to 95%, given a large amount of data with distinct differences between the candidate images. As described herein, because the CTCs are rare cells from patients’ blood samples, the data size is relatively small. Described herein are the preprocess(es) of the raw dataset to mitigate challenges posed by the data size limitation. As described herein, the CTCs and WBCs may be identified directly in regular bright field microscopic images, perhaps for example without the need for advanced devices and/or time-consuming biological operations. The task may be achieved in a label-free fashion by using a CNN to classify cells detected in images of patient blood samples containing WBCs and CTCs.
[0052] Techniques described herein may include the following: isolation and labeling the blood samples, image data collection, image processing, and training and evaluating the one or more deep learning model(s). A flowchart shown in FIG. X demonstrates the work after acquiring the isolated and labeled blood samples.
[0053] FIG. 1 is a is an example flowchart of an overview of at least one cell distinguishing technique. FIG. 1 illustrates a deep-leaming based analysis framework for microscopy images from isolated blood samples. One or more phases/elements of the process/techniques may include data preparation, image pre-processing, ML, and/or testing. The images collected from bright field and fluorescent microscopy are processed and cropped into images containing a single cell, which are used as the training and testing raw data for the machine learning model with a deep CNN architecture.
[0054] The peripheral blood samples from metastatic RCC (mRCC) patients were provided by Lehigh Valley Health Network, and healthy donor whole blood samples were provided by the University of Maryland. The sample collection processes for both have followed approved institutional review board protocols with patients providing informed consent.
[0055] The patient's whole blood was drawn in the 8.5mL heparin tube and processed expeditiously. For example, 2mL of whole blood was used for each batch of enrichment with EasySep Direct Human CTC Enrichment Kit. The human colorectal cancer cell line (HCT-116, American Type Culture Collection (ATCC), USA ) and healthy donor whole blood were used in this work. WBCs used for the experiments were obtained from whole blood with red blood cell (RBC) lysis. In one or more experiments, 1 mL whole blood was lysed with 18 mL of RBC lysing buffer ( ThermoFisher ), followed by a 30-min incubation in the dark at room temperature. The mixture was then centrifuged at 500 g for 5 mins. The supernatant was discarded, and the pellet was washed for two-ot-three times with PBS. HCT 116 cells were pre-stained with CellTracker (Thermo, USA) Red and all WBCs were pre-stained with CellTracker Green prior use. [0056] CTCs were isolated from peripheral blood samples of metastatic renal cell carcinoma patients. The isolated cells enriched from 2mL of whole blood were triple washed using lx PBS (pH 7.4, Thermo). The enriched cells were mixed with 5pL anti-hCarbonic Anhydrase lx and 2pL Calcein AM (BD Biosciences, USA) to the cells and brought the final volume to 200pL with PBS in a 1.5ml sterile Eppendorf tube for staining. An efficient CTC immunomagnetic separation method ( EasySep direct human CTC enrichment kit, Catalog #19657) was used with negative selection. A manual EasySep protocol was followed where peripheral blood was mixed directly with antibody cocktails (CD2, CD14, CD16, CD19, CD45, CD61, CD66b, and Glycophorin A) that recognize hematopoietic cells and platelets. The antibodies were labeled with unwanted cells, then labeled with magnetic beads and separated by EasySep magnet. The target CTCs will be collected from flow through and available for downstream analysis immediately. Live cells could be identified by being stained with Calcein AM, and CTCs isolated from renal cell carcinoma patients were stained with human carbonic anhydrase IX (Abeam, USA) PE-conjugated antibody. A live cell stained with the carbonic anhydrase IX PE-conjugated antibody would be finally identified as a CTC.
[0057] Optical images were obtained from fluorescent microscopy. Both immunocytochemically stained and bright field images were taken from tumor cell line mixed with WBC from heathy donor whole blood and the negative depletion of peripheral blood from renal cell carcinoma patients. The raw-cell microscopy images are acquired under an Olympus 1X70- microscope, 640/480 microscopy bright field camera, with 20-X and 10-X scope magnification. The corresponding label images shown in the top left frame of FIG. 5 are for a subset of the raw cell images with the image in the top middle frame of FIG. 5 acting as ground truth.
[0058] High resolution and high magnification images contain more details but acquiring them increases the total number of images to be captured and processed. Therefore, the selection of magnifier of the scope can be considered as a trade-off scenario. As described herein, a 20-X was chosen as the magnification scope since it provides a reasonable image resolution for each cell (500 pixels), with acceptable number of images to acquire per testing sample.
[0059] After raw images of cultured cells have been acquired, the first step of image pre processing is applying a customized toolbox to automatically segment the cells. In at least one preprocessing toolbox, the bright field image may be processed with first filtering by edge detection based on Otsu’s method. A flood-fill operation may be applied on the filtered image. Morphological opening operation may locate one or more, or all, cells and/or may remove one or more, or a majority, of irrelevant spots. Afterward, a watershed segmentation can be achieved. The bright field image can be cropped into individual cell images. The segmentation of this type of nuclear images is known in the art.
[0060] Different thresholding algorithms were evaluated to find that Otsu’ s filtering might behave poorly where extreme brightness from some very bright cells leads the algorithm to set a threshold between the very bright cells and the rest, instead of setting it between cells and background. The issue can be resolved by a manual adjustment of brightness, setting a maximum in brightness by judging whether the brightness would result in missing of cells with normal brightness, before the raw images are input into a segmentation toolbox that was created. The toolbox is written in MATLAB and may be used for other segmentation purposes for bright field and fluorescent images. The issue can be also resolved via adaptive thresholding, in which the threshold gate is set according to the relative brightness, for example.
[0061] On the toolbox, a raw image is processed through the Otsu’ s filtering edge detection (top right frame of FIG. 5), flood-fill operation (second frame down on right in FIG. 5), and morphological opening operation (third frame down on right in FIG. 5), so that a watershed segmentation can be achieved (fourth frame down on right in FIG. 5). As one can see from the final segmentation result (bottom middle frame of FIG. 5), although one or more, or all, cells contained in the image are located, there are irrelevant spots that may have been mistakenly segmented as well. Therefore, after running a cropping algorithm to crop the segmented regions, it is not guaranteed that every cropped region corresponds to a single cell. To ensure the quality of training data for the CNN machine learning model, (e.g., only) the single-cell images were manually selected from all cropped images and (e.g., only) those were used to train the ML model. The label for a selected single cell image (WBC or CTC) is (e.g., easily) obtained from the label of the cell in the corresponding fluorescent image.
[0062] FIG. 5 is an example illustration of a Demonstration of Image pre-processing on the raw image data with higher density of cell. The top left and middle frames of FIG. 5 are the fluorescent labeled image and the corresponding bright field image, respectively. The bright field image is then processed in the toolbox (the dashed-line rectangle region). The processing may include filtering by edge detection based on Otsu’s method (top right frame of FIG. 5), flood-fill operation on the filtered image (second frame down on right in FIG. 5), morphological opening operation that locates one or more, or all, cells and removes one or more, or all, irrelevant spots (third frame down on right in FIG. 5), watershed transformation for segmentation (fourth frame down on right in FIG. 5). One or more, or each, individual cell is visualized with a distinct color in this figure. The appearance of segmented cells in the original bright field image is illustrated in the bottom middle frame of FIG. 5. The bright field image can be cropped into individual cell images.
[0063] FIG. 6 illustrates an example of selected patient blood sample images captured from the microscope. Some examples as shown in the top left frame of FIG. 6 (WBCs) and the top right frame of FIG. 6 (isolated CTCs). Some examples of cropped single-cell images are shown in the left middle frame of FIG. 6 and the right middle frame of FIG. 6. The width and height of the cropped images are both 30 pixels. Because a cell may stay near the edge of a cell culture well where there are low intensity and cloudy background, a brightness and background normalization operation has been applied on all the cropped single cells. The cropped and normalized single-cell images are used as the dataset for training and testing the one or more ML model(s). The size ranges of CTCs and WBCs in patient blood samples have been collected and are shown in the bottom frame of FIG. 6. It can observed that both types of cells are similar in size, thus the size alone cannot be used to distinguish the two types of cells.
[0064] FIG. 6 illustrates an example demonstration of image data from patient blood. Selected images captured by the microscope from isolated patient blood samples are shown in the top left and right frames of FIG. 6. The processed WBCs image (top left frame FIG. 6) and CTCs image (top right frame of FIG. 6) and cropped single WBC (middle left frame of FIG. 6) and CTC (middle right frame of FIG. 6), respectively. The bottom frame of FIG. 6 illustrates a summary of the size distributions of CTC and WBC cells. The average diameters of CTCs and WBCs are approximately both 11.5 pm, while the CTC has distinguishable wider range of size distribution.
[0065] Before performing training experiments, the t-distributed stochastic neighbor embedding (t-SNE) plot was generated for the training dataset to show the overall data distribution. As described herein, t-SNE is a non-linear dimensionality reduction technique developed to embed high-dimensional data for visualization in a low-dimensional space. It maps one or more, or each, high-dimensional data point to a low-dimensional (typically two- or three-dimensional) point in such a way that similar data points in the high-dimensional space are mapped to nearby low dimensional points and dissimilar data points are mapped to distant points, with high probability. The visualizations presented by t-SNE can vary due to the selection of different parameters of the algorithm.
[0066] The scikit-learn (version 0.21.2) machine learning package was used to perform the t-SNE dimensionality reduction and map high-dimensional data points to a two-dimensional space. The default parameters of the package were used except changing the perplexity value to be 50 and the learning rate to be 100. Under this setting of parameters for the t-SNE plot, the data points in the two-dimensional map fall into two clearly distinct clusters, which indicates that a deep learning network capable of nonlinear functional mapping should be able to achieve highly accurate classification on the dataset.
[0067] FIG. 7 illustrates the architecture of the machine learning model of ResNet-50, with input images of size 34 x 34 (resized from the cropped cell images), and binary categorical output. The convolutional layers are initialized with pre-trained weights ( e.g ., perhaps based on one or more image recognition parameters, etc.) learned from the ImageNet datasets, a method that allows faster training and reduces the requirement of training data. These pre-trained weights are used for feature extraction, where the extracted features by the convolutional layers usually encode multi scale appearance and shape information. For example, the first convolutional block directly takes the image data as input, extracts features and provides a feature map as illustrated within the dotted-line box of FIG. 7. Further feature extraction is applied by taking the feature map of the previous convolutional block as input for the next block. After the feature extractions, the pre trained ResNet-50 is followed by trainable layers that contain a fully connected layer with ReLu activation function, a dropout layer with a dropout rate of 0.6, and a softmax activation function with a cross-entropy loss implemented to generate the predicted results. The model uses a learning rate of 0.0001 and is optimized by the Adam optimizer. The trainings are processed in mini-batch, with the batch size of 16.
[0068] FIG. 7 illustrates an example of the architecture of the deep convolutional network, ResNet-50 , for transfer learning and CTC-WBC cell classification, and the demonstration of features extracted (within the dotted-line box of FIG. 7) by the first convolutional block. The network receives the input data of cell images and predicts the probability of both classes. The network consists of five stages each containing convolution and identity blocks, and each of the blocks has three convolutional layers. The features of a cell image are extracted by the pre-trained convolutional layers. [0069] As the comparison group, the training on cultured cell lines is based on 1,745 single-cell images (436 cultured cells, 1309 WBCs). A total number of 120 cells (31 cultured cells and 89 WBCs) are tested. The combined performance has shown that all WBCs have been classified correctly, while 3 out of 31 cultured are misclassified as WBCs. The overall accuracy of this learning model is 97.5%. The training on patient blood samples is based on 95 single-cell images as raw input. The cell images originally came from two patients: 15 CTCs from one and 17 CTCs from the other. The training data was enhanced before processing the training by applying data augmentations on the original dataset. The data augmentation may increase the diversity of the original dataset.
[0070] The most popular way to practice data augmentation is by creating a selected amount of new images by performing traditional affine and elastic transformations. The data augmentation provides a larger dataset, which helps improve the overall robustness of the WBC- CTC classification CNN model without additional laboring for the preparation of fluorescent labels. The expanded dataset includes single-cell images with different types of geometric transformations: rotation, shear transformation, horizontal and vertical reflection. The augmented training dataset for one or more, or each, training experiment contains 1,000 CTCs and 1,000 WBCs.
[0071] Due to the limited number of CTCs in patient blood samples, K-fold cross- validation is applied for measuring the overall performance. Cross-validation helps avoid performance differences in different runs of the learning algorithm, caused by the random split of training and testing data. Five-fold cross-validation was used in one or more experiments. The original data is shuffled and divided into five groups with one group becoming the testing subset and the combination of the others becoming the dataset for training and validation. The training and validation data are then augmented for the training process. The final overall performance of the model is presented as the average of the five runs with different data as the testing set. More details on how the data was split and training, validation, and testing datasets were obtained are described as seen in FIG. 8.
[0072] FIG. 8 illustrates an example of Five-Fold cross-validation during training and testing experiments. The original data of single-cell images is shuffled and divided into five non- overlapped subsamples with equal number of images. One subsample is treated as the testing set in an experiment, and training is performed on the remainder of the dataset. The experiment repeats with each of the five subsamples once tested. In one or more, or each experiment, the data for the training purpose is augmented and split into training (80%) and validation (20%) subsets, and then fits the model.
[0073] After augmentation on the cell data, the training dataset was visualized by the t- SNE algorithm. The t-SNE plot (top frame of FIG. 9) shows the distribution of the first and second dimensions of the t-SNE map after performing non-linear dimensionality reduction by the algorithm on the training dataset. This t-SNE plot visualizes the high dimensional image data projected into a two-dimensional space, which helps to understand the overall distribution of the dataset. It can be seen from the output of t-SNE, samples from the two classes (CTCs and WBCs) form largely separate clusters in the two-dimensional space. It is hypothesized that the separation of the two classes holds true in the high-dimensional space as well. That may explain why the trained deep learning model can ( e.g ., reliably) extract high-dimensional features and perform classification with high accuracy. The results of the deep learning model for cell image classification based on cultured cells and patient blood samples are summarized in two frames of the second row in FIG. 9, respectively. Furthermore, examples of misclassified and well-classified CTCs and WBCs from the model are shown in the two frames of the third row in FIG. 9.
[0074] The misclassifications could be due to noise or errors in the manual labeling process, and/or the inherent partial overlap between the distributions of the two classes (e.g., the CTCs mixed in the cluster of WBCs, and vice versa, as shown in the t-SNE plot). The averaged learning history from the five cross-validation experiments of the training and validation during epochs can be seen in the bottom two frames of FIG. 9. The curves indicate that the model does not over- fit the problem and the network converges near the end of the training process. The testing results on cell images of patient blood samples show that the overall accuracy from the five-fold cross-validation is 88.4%, and the F-score, traditionally defined as the weighted harmonic mean of the precision and recall of the result, is 0.89.
[0075] The F-score provides a measure of the overall performance of the model by considering the equal importance of precision and recall. As a comparison, in a recent study, deep learning networks have shown the ability to unlock the hidden information in fluorescent images. The networks could classify fluorescent images of single cells including CTCs with a very high accuracy (96%). Although the bright field images of CTCs described herein may have lower accuracy in classification due to the lack of fluorescent label information, obtained results show (. e.g ., nice) convergence of the learning curve and promising accuracy with (e.g., only) limited amount of data demonstrate the potential of the proposed approach.
[0076] The receiver operating characteristic (ROC) curve was used to show the performance of the model at one or more, or all, classification thresholds, and the corresponding area under the curve (AUC) value to indicate the performance of prediction on one or more, or each, experiment. The bottom right frame of FIG. 9 shows the total ROC curve and the calculated averaged AUC, 0.923, for the classification of patient blood CTCs and WBCs. The high AUC indicates that the model has been successfully trained to distinguish CTCs from WBCs. The examples of misclassified and well-classified CTCs (the two frames of the third row of FIG. 9) show that the CTC images are either correctly detected or incorrectly classified as WBCs. Therefore, once a bright field image containing WBCs and CTCs are segmented and single-cell images are cropped, the trained model works as a binary classifier for the single-cell images without fluorescent labels. Note that the coordinates of all the cropped single cells in the bright field image are recorded during pre-processing.
[0077] Therefore, after a predictive decision is made by the trained model, a label-free CTC count information for this bright field image can be generated when the recorded coordinates and the corresponding predicted cell types are combined. For further enumeration and characterization, this method can be combined with a sorting technique such as acoustic sorting, where the upstream image machine learning results can be used to trigger pulse activation of acoustic forces that sort cells into different channels for isolation and characterization. Such combined label free image detection and label free sorting improves cell viability compared to a labelled approach and enables potential culturing of captured cells for personalized drug screening.
[0078] For future work, the CTC count information from RCC patients can also be combined with molecular characterization for clinical applications. It has been shown that single cell molecular characterization for CTC from RCC patients can unravel the information of clonal evolution. Characterization of CTCs that combines molecular characterization and statistical analysis by CellSearch during therapy can offer important information for the treatment selection for breast cancer patients. CTCs from RCC patient cannot be correctly isolated by CellSearch. Perhaps for example if combined with molecular characterization, this label-free method of statistical analysis for CTCs would provide useful information to help choose the optimal therapy for mRCC patients. [0079] FIG. 9 illustrates an example of trained model evaluation. The top frame of FIG. 9 illustrates the t-SNE plot of the training dataset showing the dimensionality reduction pre processing for the training dataset. Confusion matrices for classification results of samples from cultured samples is shown in the left frame of the second row of FIG. 9, and (in the right frame of the second row of FIG. 9) patient blood vs. the WBCs. Example misclassified and well-classified CTC images and WBC images are shown in the two frames of the third row of FIG. 9. The learning history of the training and validation at each epoch are shown in the left bottom frame of FIG. 9. The overall ROC-AUC result for WBC and CTC classification by cross-validation are shown in the right bottom frame of FIG. 9. The ROC curve and AUC are the total/average performance of the five training experiments from the cross-validation process. As a comparison, a diagonal dashed line from the bottom left to the top right comers represent the non-discriminatory test.
[0080] As described herein, a deep convolutional neural network may be applied to classify cell images acquired from processed patient blood samples to detect rare CTC cells from mRCC patients. A software toolbox (as described herein) for pre-processing raw images acquired from the microscope was developed to apply Otsu's thresholding, segmentation, and cropping on the images. A manual cropped-cell image selection process then ignores incorrect segmentations and chooses good single-cell images for training the CNN model. Ninety-five images containing single cells from patients are used as the original data, which is the source for training, validation, and testing datasets.
[0081] Data augmentation is applied to expand the training and validation datasets. With the augmented data from the combination of two different patient blood samples and cultured cell images, the learning model yields 88.6% and 97% overall accuracy, on patient blood and cultured cells, respectively. The higher accuracy for the cultured cells indicates the potential of achieving a better learning model with more training images. Expectations include that the proposed method(s) can work as an intelligent label-free detector for rare cells in isolated blood samples. The techniques/processes/methods described herein are data-driven and can be further improved with more data samples.
[0082] Described herein is Machine Learning based Hematopoietic Stem Cell (HSC) Identification and Sternness Prediction. Hematopoietic stem cells (HSCs) are able to self-renew and differentiate into all blood cell lineages. Several decades of successful HSC transplantations have demonstrated the therapeutic importance of HSC. Hematopoietic stem cell transplantation (HSCT) is a life-saving procedure for treatment of malignant and non-malignant disorders and is usually a last resort for those with no other treatment options available.
[0083] Image machine learning can also be used to identify the subtype of stem cells and predict their sternness. There are a few outcomes. First, image segmentation and information registration methods can be used to automatically track and record the sternness status and progression of mother and daughter cells. Second, the sternness of the cells can be predicted with the bright field image, phase contrast image, properties and behavior of the cells.
[0084] Segmentation process for the stem cells can be achieved by the following process. A bright field stem cell image is filtered by edge detection based on Otsu’ s method, and then flood- fill operation on the filtered image is applied to fill the filtered image. Then morphological opening operation locates one or more, or all, cells and removes one or more, or all, irrelevant spots. The watershed transformation for segmentation can be achieved, where one or more, or each, individual cell may be visualized with a distinct color. Meanwhile, one or more, or each, cell is assigned a number to indicate their identity.
[0085] The sternness of the HSCs will be predicted from the bright field images, phase contrast images, properties and behavior of the cells with a ML approach. To achieve this goal, both bright field and fluorescent images of the isolated cells will be used to train the convolutional neural network. Based on the intensity of the fluorescent signal, the input images will be separated into two categories, stem cell, and/or non-stem cell images. A certain pixel intensity unit (PIU) threshold will be applied to differentiate the two image groups.
[0086] Besides image features, the sternness of the cells will also be predicted with other features such as properties of the cells. To achieve this goal, image of HSCs may be extracted for the training purposes. And the true identity of the two daughter cells will be determined from the fluorescent images and provided to the algorithm as the answer sheet.
[0087] In one or more scenarios, differentiation/distinguishing at least four different mouse MPP (multipotent progenitor) populations: MPP1, MPP2, MPP3, and/or MPP4 is contemplated to be achievable using one or more of the techniques described herein.
[0088] In one or more scenarios, differentiation/distinguishing at least three human cell populations in human blood: T cell, B cell, and/or NK cells is contemplated to be achievable using one or more of the techniques described herein.
[0089] To train the machine learning to recognize HSC and progenitors, two systems were used to isolate hematopoietic precursors which include HSCs and progenitors. The first system is to use Evil-GFP HSC reporter system. Ecotropic viral integration site 1 (EVI1) is an oncogenic transcription factor that belongs to the SET/PR domain protein family. The Evil locus was initially discovered as a common target of retroviral integration in murine myeloid leukemia. Conditional deletion of Evil in adult mice leads to a profound loss of HSC self-renewal activity, but does not affect HSC specification into the blood cell lineage. These findings suggest that EVI1 is essential for HSC self-renewal in adult hematopoietic systems. One or more studies constructed Evil -green fluorescent protein (GFP) reporter mice (hereafter denoted as Evil-GFP), and demonstrated that Evil is expressed only in BM, where its expression marks a population of hematopoietic cells with long-term multi-lineage repopulating activity.
[0090] Evil-GFP reporter was first used to show that it is a specific reporter to mark hematopoietic precursors ex vivo. To isolate hematopoietic precursors from Evil-GFP transgenic mouse, mouse bone marrow cells were ( e.g ., first) harvested and FACS (fluorescence activated cell sorting) sort GFP+ cells from Evil-GFP transgenic mouse bone marrow. A confocal microscope was used to collect images of GFP+ and GFP- cells which are corresponding to HSC and progenitors, respectively. The second system is using surface markers staining to isolate HSC and progenitors. It is a well-established approach to stain mouse bone marrow cells with fluorescence labeled antibodies to lineage markers, Seal, c-kit, CD150 and CD48 markers.
[0091] Hematopoietic precursors can be distinguished from committed progenitors by different combinations of surface markers. LSK(Lin-Scal+ckit+) are mostly hematopoietic precursors, while Lin-(Lin-Scal- ckit-), LS(Lin-Scal+ckit-) and LK(Lin-Scal-ckit+) are mostly committed progenitors. Furthermore, long-term (LT) HSC, short-term(ST) HSC and multi-potent progenitor (MPP) were defined by different combinations of CD 150 and CD48. Specifically, LSK CD150+CD48- cells are mostly LT-HSC, LSK CD150-CD48- cells are mostly ST-HSC whereas LSK CD150-CD48+ are mostly progenitors. Using FACS-sorting, one or more, or all, the surface markers defined populations were sorted out.
[0092] Data processing, machine learning training, and prediction is described herein.
[0093] At least one task is the characterization of the sternness level of each cell in the bright field. An objective is to use state-of-art deep learning techniques to analyze the raw HSC image data for the discovery of the distinction of HSC populations. Tasks can be divided into two levels. The first task focuses on the distinction between the lineage negative populations, where there are four subsets, different in the intensity of the cell marker seal and ckit: Lin-(Lin-Scal- ckit-), LS(Lin-Scal+ckit-), LK(Lin-Scal-ckit+), LSK(Lin-Scal+ckit+). The second task is the differences between HSC and non-HSC within the population of LSK population: three subsets (Long-Term (LT-HSC), Short-Term (ST-HSC), and Multi-Potent Progenitor (MPP)) that are characterized equivalently by two methods: surface marker and GFP.
[0094] One or more techniques described herein can tell the differences between HSC population, where there may be contained at least three subsets: LT-HSC, ST-HSC and MPP cells. Distinction between the lineage negative populations, where there are four subsets, different in the intensity of the cell marker seal and ckit: Lin-(Lin-Scal- ckit-), LS(Lin-Scal+ckit-), LK(Lin-Scal- ckit+), LSK(Lin-Scal+ckit+) can the identified.
[0095] A toolbox is used to crop and segment individual cell images from microscopic images. For one or more, or all, types of datasets, there may be two different scenarios: the cell population is sparse in the view, where accurate detection may be more important, and/or the cell population is dense, where the wrong crops are to be avoided to reduce post selection. To cover at least both scenarios, a computer script is established to finish ( e.g ., at least two steps) for preprocessing. For example, cropping the single cell from DIC views may be performed. In one or more scenarios, perhaps for example when the cell is at a denser population per view, characterizing the cell targets and/or removing the outliers (debris, merged cells, etc.) after applying a size filter may be performed and/or a uniqueness check with analyzing the diameters and the coordinates of the detected cells may be performed. The size filter is applied by a preset cell size threshold according to the pixel unit of the DIC view. The uniqueness check may evaluate the pairwise distances between each cell and may decide/determine which (e.g., several) cells may be duplicated detected. The algorithm may eliminate these outliers such that the image crops might not contain images from the same cell, for example.
[0096] Cropping the single cell from Bright field images is illustrated in FIG. 10. In one or more scenarios, when the cell is at denser population per view, characterizing the cell targets and/or removing the outliers (debris, merged cells, etc.) may be performed after applying size and uniqueness filter with analyzing the diameters, as illustrated in FIG. 11.
[0097] FIG. 10 illustrates an example of single cell images cropped from bright field images.
[0098] FIG. 11 illustrates the cropping results after applying size filter (the top left frame of FIG. 11), after applying both size filter, and uniqueness operation (the top right frame of FIG. 11), and a size distribution characterization according to the crops (the bottom frame of FIG. 11).
[0099] FIG. 12 illustrates the cropped single cells after normalization actions for the training purposes, among other scenarios.
[00100] FIG. 13 illustrates typical data samples from batch 0507 (LSK / Lin-, 176 cells), and 0517 (GFP, 522 cells). The cropped single cells may be stored in the corresponding category.
[00101] FIG. 14 is an example illustration of subpopulation of stem cells. The cell images are cropped into 72 x 72 and 48 x 48 or other sizes from the original dataset. The image dimension and background brightness and contrast are normalized into the same size before training experiments. At least 20% of the total data are randomly sampled as the testing subset to which a training procedure might never be exposed. The remainder 80% of each class are used as the training materials.
[00102] At least one Deep Learning classifier may be trained to predict the type of the single cell data based on the ResNet50 model as the pre-trained convolutional layers. A pretrained network that has been trained to learn a big public database, ImageNet, is used for transfer learning. The convolutional layers in the framework are then fine-tuned/calibrated with the following fully- connected layers trained from scratch. The whole framework become the architecture of the model. The model then fits the training dataset to learn the distinctions between target cell images. The trained model is used to predict the class of new image data and the overall performances are evaluated for validation.
[00103] Convolutional neural network (CNN) can be used to train and predict the image data. Pre-trained networks can be used and/or transfer learning the image file. A multi-modal image classifier can be built up by learning cell images at different locations. The information collected during the experiment can have different types of data representation: image information (multi-modals image matrix), and/or physical properties (numeric information). A mixed model such as a CNN model for handling static cell image and/or a MLP model to encode physics properties in the form of numeric data might be built.
[00104] FIG. 15 illustrates an example of a Deep Learning model workflow for one or more techniques described herein. The top of FIG. 15 illustrates one or more phases/elements of a convolutional neural network. The bottom of FIG. 15 illustrates a generalization of one or more procedures of one or more DL techniques. Table 1 illustrates an example data set. TABLE 1: Example Data Set
Figure imgf000025_0001
[00105] In one or more scenarios, for the HSC and MPP cells, the differentiation between the most-populated HSC: LT-HSC and the non-HSC cells may be the subject of focus, in this case for example, MPP. The situation for this task was validated by the old data, where some preliminary results were obtained showing that an accuracy around 85% can be achieved.
[00106] From the recent results, the accuracy is boosted up to 97 % based on the new camera data from two batches. The PCA plot illustrated in FIG. 16 shows that there are clear boundaries in the clusters of HSC/MPP but weaker boundaries in the two types of HSC. The confusion matrix and learning curve illustrated in FIG. 17 show that the discriminative model between HSC vs MPP has excellent results ( e.g ., almost perfect), and the overall model is also very promising.
[00107] FIG. 16 illustrates the one or more (e.g., principle) components of three subsets of HSCs versus non-HSC: ST-HSC, LT-HSC and MPP. FIG. 17 illustrates an example confusion matrix for the HSC 3-classes. Table 2 illustrates example performances of the model(s) on three- classes model for HSC/ non-HSC.
TABLE 2: Example Performances of the Model(s)
Figure imgf000025_0002
[00108] One or more techniques may include/use lineage negative population. The variation of the four subsets might not be ( e.g ., substantially) clear cut, but may include one or more preset boundaries. Among the four categories, LSK is the most favorable and least populated ones due their significant sternness. On the contrary, the cells contained in the LIN- subset are the most populated and heterogeneous ones. Expectations may include the LIN- cells having the most separate feature clustering while the LSK's features may be more focused. PCA and t-SNE analysis was conducted for the given dataset.
[00109] From the t-SNE analysis among the 50 primary components of the image data, it can be seen that there are some boundaries that are discriminative to the four subsets. The t-SNE plot shows a weak discrimination between the four subsets thus harder to be differentiated by the model. FIG. 18 is an example illustration of a t-SNE plot of the four subsets of image data.
[00110] The training experiments based on the current image data has leaded to an overall accuracy of 76%. More LS and LK data can be used to validate model learning the lineage negative population when a more balanced dataset may be obtained. FIG. 19 illustrates an example confusion matrix and learning history of the training experiments
[00111] Given the good results from the HSC vs non-HSC task, something can be learned from the model that can differentiate the three types of cells. The features being learned by the model can be visualized via showing the final fully connected layers. The distributions of the final layers could show the features after extracted by the model. The kernels learned in between are the abstract feature extractor that works for the specific data.
[00112] Preparation of HSC subpopulations and collection of image data for DL is described herein. To explore whether DL can be used to distinguish different subsets of HSCs based on their morphology, HSC/MPP subpopulations were (e.g., first) sorted from murine BM by fluorescence-activated cell sorting (FACS). A well-established surface marker combination including LSK and SLAM markers (illustrated in the frames of the top three rows of FIG. 20) was used and sorted out three fractions: LT-HSCs (LSK CD150+CD48-), ST-HSCs (LSK CD150- CD48-) and MPPs (LSK CD150-CD48+) (illustrated in the top frames of FIG. 21). Those cells were seeded in culture chambers with glass bottoms and DIC were taken as well as confocal fluorescence images, as illustrated in the frames of the top three rows of FIG. 20 and the middle frame of FIG. 21). The distinctive fluorescence of CD150, c-Kit, and Sca-1 on LT-HSCs were found to corroborate the accuracy of FACS sorting. The accuracy rate was consistently above 96% in this study (illustrated in the middle frame of FIG. 21). In DIC images, most HSCs (-95%) have a spherical shape (illustrated in the middle frame of FIG. 21) while the rest 5% are irregular or polymorphic.
[00113] In all three populations, the cell membranes of most cells appear rough, but some are relatively smooth. No cell population- specific morphological features can be identified by the human naked eye. In the MPP group, there tends to be larger-sized cells with diameters greater than 15 mpi (the average size of whole BM cells described herein is around 6.8 mpi calculated by Luna II cell counter). Cell size distribution of the three populations may be largely overlapped (illustrated in the bottom frame of FIG. 21), suggesting that differentiation of cell types solely based on cell sizes is not reliable.
[00114] FIG. 20 illustrates an example overview of one or more experiments. The frames of the top three rows of FIG. 20 illustrates an overall workflow from sample preparation to (e.g., final) results, including sample collection, image acquisition, training procedures, and model testing. The frames of the bottom row of FIG. 21 illustrates a DL framework and data flow during training experiments. The DL network(s) contain two parts: a convolutional layer(s) and fully- connected (FC) layers. The convolutional layers were fine-tuned/calibrated with a low learning rate and the following fully-connected layers were trained with a regular learning rate. In this way, the hyper-parameters from the training experiments were tuned/calibrated with cross-validation. With the proper hyper-parameters found, the whole training dataset was used for training. At the prediction stage, the model was tested on image data from different markers.
[00115] FIG. 21 illustrates FACS sorting of murine HSCs and MPPs using LSK/SLAM markers and cell imaging. The top four frames of FIG. 21 illustrate representative FACS density dot plots showing the gating strategy employed to identify and isolate LT-HSCs, ST-HSCs, and MPPs from BM. The middle frames of FIG. 21 illustrate DIC and fluorescence images taken (e.g., immediately) after FACS sorting. Representative images are shown. Scale bar = 10 mpi. The bottom four frames of FIG. 21 illustrate cell size distributions of sorted HSCs and MPPs on FACS plots.
[00116] Development of a DL model to distinguish LT-HSCs, ST-HSCs, and MPPs that were sorted with LSK/SLAM markers. To build the one or more DL model(s) to distinguish HSCs and MPPs, one or more steps were taken: image data collection, image data processing, model training, and evaluation (illustrated in the frames of the top three rows of FIG. 20). At least one objective of data processing was to establish a training dataset. Image processing toolbox was customized to segment one or more DIC images into single cell image crops and then used the crops as training image data and their corresponding cell types as ground truth class labels (illustrated in the four frames of the bottom of FIG. 20).
[00117] Model training began with 4,050, 7,868, and 9,676 cell inputs from LT-HSCs, ST- HSCs, and MPPs, respectively. During the training, data augmentation was applied to enhance data diversity and avoid overfitting. To deal with dataset imbalance, oversampling was practiced to balance the significance of the minority subset, in this case for example, LT-HSCs. One or more models were trained based on CNN (illustrated in the four frames of the bottom of FIG. 20). To obviate the need for big datasets, transfer learning was practiced to filters learned on large, annotated datasets from ImageNet, which can be used for a new purpose where data may be limited.
[00118] The convolutional layers in the framework were fined tuned/calibrated with following fully connected layers trained from scratch. One or more models were designed as a multi-class classifier that fits the image in the training dataset to learn the distinctions between classes labeled as LT-HSC, ST-HSC, and MPP. The trained model was applied to predict the type of new cell image data and the overall performance was evaluated (illustrated in the four frames of the bottom of FIG. 20).
[00119] One or more DL models predict HSCs and MPPs with high accuracy. After training the DL model(s) with the single-cell inputs from LT-HSCs, ST-HSCs, and MPPs, the model(s) were fed with unseen single-cell inputs from the validation datasets of LSK/SLAM sorting. For example, five-folds cross-validation was leveraged to find the best hyper-parameters for the model’s training. From each run of the cross-validation, one-fifth of the dataset was withheld for validation and used the rest for training. Iterative training for the model was used to tune/calibrate the hyper-parameters for five runs when one or more, or each, example was used as validation at least once. The performance of the model was evaluated by the mean overall accuracy in the five runs with different data as the validation subset. The final model(s) were trained with the best hyper-parameters and validated on a 20% holdout.
[00120] The confusion matrix (illustrated in the top left frame of FIG. 22) concludes with 74% overall accuracy with default thresholding that gives the prediction class according to the highest score. The top right frame of FIG. 22 illustrates the performance metrics including recall, precision, and Fl-score for the three different classes. A ROC-AUC curve (illustrated in the bottom left frame of FIG. 22) indicates that the model successfully learned cell features with various threshold values.
[00121] The decision threshold of the output probabilities was dynamically adjusted to examine the change in the prediction. With different decision thresholds applied, the number of total data points and the corresponding well-classified (illustrated in the bottom middle frame of FIG. 22) are shown in a solid curve, and the overall accuracy is calculated accordingly, reported as the dashed line. The model may perform better with increased threshold values. As the thresholding of the probability gate was set to 0.6, the overall accuracy went to 85%. A high decision gating would reduce the false positive rate, but it also reduced the number of HSCs being classified. With a threshold value of 0.8, the cell count drops to 1000, but the prediction accuracy jumps to 96%. This strategy is useful in applications where a small number of HSCs need to be selected for the transplant.
[00122] The model performances were investigated at various sizes of training datasets by dynamically changing sample sizes and registering the corresponding model’s overall accuracy (illustrated in the bottom right frame of FIG. 22). To begin with, 80% of the total dataset was randomly sampled to serve as the full-scale training set and the remainder as the validation set. The model was then incrementally with a fraction of training data until the entire training set was used (10 sample fractions from 0.1 to 1.0). In this way, five (5) iterations were practiced and plotted the change in the model’s mean performance in these iterations (illustrated in the bottom right frame of FIG. 22). Results showed that the model performance is positively correlated with training sample size. However, the correlation was not linear. When 80% of the training set was used, the model’s predictive performance was nearly as good as that after training with the entire training set.
[00123] FIG. 22 illustrates a summary of model’s performance on the holdout validation data of LSK/SLAM sorting. The top left frame illustrates a confusion matrix - the prediction by the model on the validation hold-out for different classes. The top right frame illustrates recall, precision, and Fl-score of the trained model. The bottom right frame illustrates a ROC-AUC curve. This curve reported the relationship between the true-positive rate and false-positive rate for the three classes and their micro and macro averages. The bottom middle frame illustrates an accuracy and remaining data point count under different thresholds. As the threshold gating increased from the baseline (0.33), the total number of cell counts filtered out with the threshold was shown as “correct” and “total” curves. The dashed curve shows how the overall accuracy changes. The bottom right frame illustrates how the DL model’s performance is positively correlated with the training sample size. When given 10% to 100% of the training set in five iterations of training experiments with 20 epochs, the model’s performance changes accordingly. The mean and the corresponding 95% confidence interval of the best overall accuracy from the training experiments are shown.
[00124] Interpretation of the DL model(s) is described herein. After training, the DL model(s) appeared to have learned intrinsic, cell-type specific morphological features. To address the issue, principal components analysis (PCA) were (e.g., first) practiced to visualize the data points by data dimension reduction before and after DL model training. The trained CNN operated on the input images, during which high-dimensional information (64 x 64) was reduced into two- dimensional space. Compared with original data distributions, cell-type specific clusters were formed before fully connected layers were ready to perform prediction (illustrated in the top two frames of FIG. 23). A class activation map (CAM) was constructed from convolutional layers (illustrated in the bottom three frames of FIG. 23). CAMs are commonly used to explain how a DL model learns to classify an input image into a particular class. A Score-CAM was applied. A visual explanation method may utilize perturbation over an input image and may record the corresponding responses in the model(s)’ output.
[00125] When the DL model(s) received single-cell input images, the perturbation occurring in the regions essential to the model’s reception caused a significant change in the model’s prediction. In such a way, the Score-CAM produced a heatmap of the class-discriminative features for the cell images, in which high intensity represented regions attracting strong attention from the one or more DL model(s) (illustrated in the bottom three frames of FIG. 23). Obtained results indicated that cellular morphological features necessary for classification had been extracted from the original images.
[00126] FIG. 23 illustrates an interpretation of the one or more DL model(s). The top left and right frames illustrate data clustering generated by PCA before and after data training. The bottom frames illustrate a visual explanation of the one or more DL model(s) with Score-CAM. Cells from the three classes were randomly selected and their attention heatmaps in the one or more DL model(s) were produced by ScoreCAM, a visual explanation method that utilized perturbation over an input image and records the corresponding responses in the model’s output. Given an input image, the perturbation occurring in the regions useful ( e.g ., essential) to the model’s reception may lead to a significant change in the model’s prediction, which may translate to a strong activation intensity. When the model processes the input images, the Score-CAM may produce the class-discriminative visualization for the cell images from different classes. In one or more scenarios, some regions may receive higher attention (e.g., the center/darker regions) by the model(s), while other regions (e.g., the peripheral/lighter regions) may receive less attention by the model(s).
[00127] For example, when training the model, the model’s robustness was analyzed with different training datasets used. ScoreCAM and feature distribution plots were used to interpret how the model understood/processed the images. The model’s output probability was used to reconstruct the given DIC view, in which the cells were assigned the scores given/output by the model to see how they matched with the fluorescence intensities, among other reasons.
[00128] Model training (e.g., supervised training) may include a Decision Tree rule, a Random Forest rule, a Support Vector Machine rule, a Naive Bayes Classification rule, and/or a Logistic Regression rule.
[00129] One or more DL model(s) may distinguish LT-HSCs, ST-HSCs, and MPPs that were sorted by a different set of surface markers. LSK/CD34/CD135 is another well-established set of surface markers to identify murine HSCs. To demonstrate the robustness and the generalizability of the one or more DL model(s), it was tested on HSCs and MPPs that were sorted with the new markers (illustrated in the top left frame of FIG. 24). Images of these cells might not (e.g., might never) have been used in DL model training. Cell images were processed to generate single-cell validation datasets as previously described. One or more DL models used herein maintained a high accuracy when predicting these new data. As shown in the top right frame of FIG. 24, the overall accuracy of prediction is 73%, and the precision and recall for LT-HSC group is 0.86 and 0.73, respectively. These results suggest that the cellular morphological features extracted by the one or more DL model(s) are intrinsic to cell types and independent of surface markers.
[00130] The one or more DL model(s) may distinguish LT-HSCs, ST-HSCs, and MPPs that were sorted by intracellular reporters. To further exclude the possibility that surface marker staining somehow tagged HSCs and MPPs differently to allow differentiation by the one or more DL model(s), HSCs sorted from BM of two transgenic mouse models were used which both have an HSC-specific intracellular GFP reporter. The first mouse model is the a-catulinGFP mouse a- catulin is a protein with homology to a-catenin and has been found to be expressed predominantly in mouse HSCs. In the a-catulinGFP mouse, GFP coding sequence is knocked into a-catulin gene locus, therefore GFP expression is under the control of a-catulin gene promoter. In a- catulinGFP mice, a-catulin-GFP+c-Kit+ cells are highly enriched HSCs and are almost exclusively CD150+CD48-, which are comparable to the LT-HSC population described above.
[00131] As anticipated, when the DL models were used to predict LSK/a-catulin-GFP+ cells in the DIC images (illustrated in the bottom left frames of FIG. 24), the model identified most cells as LT-HSCs (74%), 14.0% as ST-HSCs, and 12% as MPPs (illustrated in the bottom right frame of FIG. 24). Taken together, results obtained herein demonstrate that after proper training, one or more DL models used herein have become a reliable classifier and its function is sensitive to neither surface marker stain nor GFP reporter expression.
[00132] FIG. 24 illustrates an example of the one or more DL model(s)-based classification of HSCs/MPPs sorted with LSK/CD34/CD135 surface markers and a-catulin-GFP. The four top left frames of FIG. 24 illustrate representative FACS density dot plots showing the gating strategy employed to identify and isolate LT-HSCs, ST-HSCs, and MPPs from BM using LSK/CD34/CD135 surface markers. The top right frame of FIG. 24 illustrates the performance of the DL model in predicting cell types of HSCs/MPPs from the four top left frames of FIG. 24 was gauged by its precision, recall, and FI -score. The two bottom left frames of FIG. 24 illustrate example DIC and fluorescence images of LSK/a-catulin-GFP+ cells that were taken (e.g., immediately) after FACS sorting. Representative images are shown. The scale bar = 10 mpi. The bottom right frame of FIG. 24 illustrates a majority of LSK/a-catulin-GFP+ cells that were classified as LT-HSC by the one or more DL model(s).
[00133] Evil is a transcription factor of the SET/PR domain protein family and has been shown to play a critical role in maintaining HSC sternness. In EvilGFP mice where GFP reporter is controlled by the Evil gene promoter, GFP fluorescence is mainly restricted in LT-HSCs, ST- HSCs, and MPPs. One or more DL models used herein were challenged with single-cell inputs derived from the images of LSK/Evil-GFP+ cells that were sorted from BM of EvilGFP mice. Being a three-way classifier, the one or more DL model(s) classified those cells into three categories (LT-HSC, ST-HSC, and MPP) as anticipated. The certainty of each decision-making varies greatly as evidenced by the confidence score of each cell classification, which can be anywhere between 0.34 and 1.0 (illustrated in the four top frames and the three middle frames of FIG. 2).
[00134] Based on one or more models’ classifications, the percentage of high-GFP cells (fluorescence unit > 3000) in HSCs is much higher than that in MPPs (LT-HSCs:ST-HSCs:MPPs as illustrated in the bottom three frames of FIG. 2. The difference in GFP fluorescence intensity became more significant if the confidence score was increased from 0.34 to 0.5 or 0.8. As shown in the bottom right frame of FIG. 22, increasing the confidence score threshold will greatly improve the accuracy of classification of the one or more DL model(s). It was found that at the confidence score of 0.8, top 3% cells with the highest GFP levels were exclusively LT-HSCs or ST-HSCs. This finding is consistent with FACS results.
[00135] FIG. 2 illustrates a Model tested on Evil-GFP+ populations. The four frames of the top row of FIG. 2 illustrate the DIC image and the corresponding fluorescent label, the predicted cell type, and the predicted probability for each cell from the deep learning model. The middle three frames of FIG. 2 illustrate the model’s output given the cell image crops from the original DIC image. The corresponding predicted types and the model’s probability output are shown in the three frames of the bottom of FIG. 2.
[00136] DL model’s prediction was confirmed by functional transplantation Functional transplantation guided by the findings of the one or more DL models. The results shown in FIG. 2 suggest that in a given LSK/Evil-GFP+ cell population, the higher GFP level, the more likely the cell is a HSC. For the top 3% cells with the highest GFP level, the chance of being a HSC is nearly 100%. To confirm this finding, a competitive reconstitution experiment was performed with FACS-sorted top 3% high-GFP or bottom 3% low-GFP cells from LSK/Evil-GFP+ cells. Transplantation of 5 or 10 high-GFP or low-GFP cells (CD45.2) along with 3x105 wild type (WT) “competitor” cells (to provide short-term radioprotection to each mouse) was made into lethally irradiated recipient mice (CD45.1). After 4 months, BM was harvested from the transplanted mice and measured chimerism (e.g., percentage of donor-derived cells).
[00137] The number of chimeric- positive mice, defined by convention as >1% donor- derived (CD45.2) cells in either the bone marrow or peripheral blood, was significantly higher for the high-GFP (5/5 mice for 5 cells and 5/5 for 10 cells; P < 0.01, unpaired t test) compared to the low-GFP group (0/5 and 0/4 for 5 cells and 10 cells). The degree of chimerism for the high-GFP 5 cells group (mean, 8.116%) and 10 cells group (mean, 13.67%) was substantially higher than that for the low-GFP 5 cells (mean, 0.0348%) and 10 cells group (mean, 0.0635%). These transplantation results substantiate the usefulness of the one or more DL model(s) in HSC study.
[00138] Clinically, HSCs/MPPs are the most relevant component of bone marrow transplants, which is a mainstay of life-saving therapy for treating patients with leukemia and congenital blood disorders. HSC/MPP research heavily relies on the separation of HSCs and MPPs, and FACS sorting is the only technology available to do so. FACS is a powerful tool that has great applications in immunology, stem cell biology, bacteriology, virology, and cancer biology as well as a clinical diagnosis. The technology has made dramatic advances over the last 30 years, allowing unprecedented studies of the immune system and other areas of cell biology. However, this technology has several key drawbacks as it requires antibody staining and lasers as light sources to produce both scattered and fluorescent light signals. It is well known that both antibody staining and laser can impair cell viability and stem cell activity.
[00139] A new and more gentle sorting technology may be useful to facilitate HSC research. The one or more DL-based platform(s) use an antibody label-free and laser-free approach to identify distinct subpopulations of HSC/MPP with high accuracy. It provides a proof-of-principle demonstration that DL can recognize very subtle morphological differences between subpopulations of HSC/MPP from their light microscopic images. This technology might have broader applications to identify and isolate other cell populations in the hematopoiesis system. It may provide a basis for developing the next generation of label-free cell sorting approaches.
[00140] A high-quality training dataset is useful (e.g., essential) to successful DL model training. Using a processing toolbox (as described herein), accurate marking and preparation of the training and validation datasets was made. With the image datasets derived from LSK/SLAM sorting, one or more DL models were trained and the resulting DL model(s) were able to classify a particular cell, which the model(s) has never seen, into one of the three categories (LT-HSC, ST- HSC, MPP) with high accuracy. Results obtained herein showed that distinct morphological features of each cell population exhibited in light microscopic images were extracted by the fine- tuned/calibrated convolutional layers. Although the convolutional network pre-trained from the ImageNet has had great success in general multi-class image classification tasks, the pre-trained parameters needed to be adjusted for the purpose of cell image classification, during which a proper selection of learning procedures for the convolutional layers was important.
[00141] A higher learning rate could cause overfitting issues, but a low learning rate would lead to unacceptable slow training experiments. The proper ratio of learning rates for the convolutional layers and FC layers was tested according to cross-validation. It was found that when the convolutional layers were trained with 1% of the learning rate for the FC layers, the validation gave the most preferable and/or robust results. The ROC-AUC curve of the bottom left frame of FIG. 22 indicated that the model has high robustness and precision. From the model’s interpretation shown in FIG. 23, it was also suggested that the DL model evaluated the input image and made the corresponding prediction by focusing on the image feature in cellular structures, indicating that the discriminating differences between each cell type were learned by the one or more DL model(s).
[00142] The accuracy of classification can be further improved with more training data. In biomedical image applications where inadequate data is a common challenge, an investigation of the impact of training sample size enables cost- effective experiments and estimates the confidence in the generalizability of a model trained on a limited amount of data. From conducted experiments on the model performance at various sample sizes (shown in the bottom right frame of FIG. 22), it was found that when given greater than 50% of the dataset, the model started to perform reasonably. But increasing the scale of the dataset would also reduce the uncertainty of the model’ s performance, which indicated that the larger dataset also tended to produce training experiments with low variance. To further examine the model’s performance, one or more DL systems may be trained with more image data to improve the accuracy of the prediction and identification.
[00143] When supervised-leaming DL models are trained and tested, the training and test subsets are commonly considered to be in a similar distribution. However, for the purpose of real applications, the joint distribution of inputs to the model could vary between training and test data, known as covariate shift. To further estimate the robustness of the results, testing of the model given input HSCs image data classified via other marker combinations was conducted. These data samples were prepared at different times and/or from different batches, and they were not used during the training experiments. The test results ( e.g ., FIG. 2 and FIG. 24) suggested that the model was robust to deal with the situations of the dataset shifts from various sampling sources. The model’s performances on the HSPCs from CD34 were tested. The precision of the model on this unseen dataset matches the model’s performance on the validation subset from LSK-SLAM. [00144] The model(s) were also tested on the samples from a-catulin. A recall of 74% was received. The single-cell image crops from Evil-GFP+ was collected. The model’s prediction was applied to this dataset by a gating threshold of 0.5. The results ( e.g ., FIG. 2) showed that the model predicted that 90% of them are HSCs. Interestingly, although the model was not trained to predict the fluorescence intensities, it was observed that the distribution of the probability output of the model for each cell subpopulation approximately matched the distribution of the fluorescence intensities (illustrated in the top four frames of FIG. 2). This indicates that it might be possible to use DF for in silico fluorescence staining of the cell from a DIC image.
[00145] One or more morphological features may vary as important features for DF to distinguish FT-HSC, ST-HSC and MPP. It is well known that FT-HSC, ST-HSC, and MPP have different self-renewal capabilities. Morphologically, how FT-HSC, ST-HSC, and MPP are different is an important biological issue. One or more scenarios contemplate using perturbations of several morphological features of HSC/MPP and improved MF systems for differentiation/distinguishing techniques described herein.
[00146] One or more materials and associated method were used in developing the one or more techniques described herein.
Animals
[00147] C57BF/6(CD45.2+), C57Bl/6-Boy/J(CD45.1+) and a-catulinGFP mice were purchased from the Jackson Faboratory. Evil-IRES-GFP knock-in mice (EvilGFP mice) were kindly provided by the University of Tokyo. All mice were used at 8 - 12 weeks of age. They were bred and maintained in the animal facility at Cooper University Health Care. All procedures and protocols were following NIH-mandated guidelines for animal welfare and were approved by the Institutional Animal Care and Use Committee (IACUC) of Cooper University Health Care. Antibodies
[00148] The following antibodies were used: Fineage cocktail-PE (components include anti- mouse CD3, clone 17A2; anti-mouse Fy-6G/Fy-6C, clone RB6-8C5; anti-mouse CDllb, clone Ml/70; anti-mouse CD45R/B220, clone RA3-6B2; anti-mouse TER-119/Erythroid cells, clone Ter- 119; BioFegend, cat# 78035), Fineage cocktail-PE (components include anti-mouse CD3, clone 17A2; anti-mouse Fy-6G/Fy-6C, clone RB6-8C5; anti-mouse CDllb, clone Ml/70; anti-mouse CD45R/B220, clone RA3-6B2; anti-mouse TER- 119/Erythroid cells, clone Ter- 119; anti-mouse CD5, Clone 53-7.3, R&D Systems, cat# FFC001A), c-Kit-FITC (S18020H, BioLegend, cat# 161603), c-Kit-PE/Cy7 (2B8, BioLegend, cat# 105814), Sca-l-APC (D7, eBioscience, cat# 17-5981-82), Sca-l-BV421 (D7, BioLegend, cat# 108127), Sca-l-PE (D7, BioLegend, cat# 108107), CD150-BV421 (TC15-12F12.2, BioLegend, cat# 115925), CD48- PE/Cy7 (HM48-1, eBioscience, cat# 25- 0481-80), CD48-BV711 (HM48-1, BioLegend, cat# 103439), CD34-FITC (RAM34, eBioscience, cat# 11-0341-82) , CD135 (A2F10, BioLegend, cat# 135311).
Immunostaining and Flow Cytometry
[00149] Murine BM cells were flushed out from the long bones (tibias and femurs) and ilia with DPBS without calcium or magnesium (Coming). After lysis of red blood cells and rinse with DPBS, BM cells were stained with antibodies on ice for 30 min. For BM cells from C57BL/6 mice, the following antibodies were used: Lin-PE, c-Kit-FITC, Sca-l-APC, CD150-BV421, and CD48 PE/Cy7. For BM cells from EvilGFP or a-catulinGFPmice, the following antibodies were used: Lin-PE, Sca-l-APC, and c-Kit-PE/Cy7. Cells were sorted on a Sony SH800Z automated cell sorter. Negative controls for gating were set by cells without antibody staining. The data was analyzed using the accompanying software with the cell sorter.
DIC Image Acquisition
[00150] FACS-sorted cells were plated in coverglass-bottomed chambers (Cellvis) and maintained in DPBS/2% FBS throughout image acquisition. An Olympus FV3000 confocal microscope was used to take DIC and fluorescence images simultaneously at a resolution of 2048x2048.
Data Preprocessing
[00151] An image MATLAB toolbox (as described herein) may facilitate one or more techniques described herein. The toolbox took the image from the track view that contains bright- field images. This toolbox was used to segment single cells with a pixel size of 64x64 from DIC images and to label the single- cell crops by cell types. The toolbox had two steps: 1) detecting the single cell from bright- field views; 2) characterizing the cell targets and removing the outliers (debris and cell clusters) by applying size thresholding and uniqueness check. Data augmentation was applied to the training examples by arbitrary image transformation including random rotation, horizontal flipping, and/or brightness adjustment on the original single-cell crops.
[00152] Oversampling was practiced on the minor classes in each run during the training experiment, such that the image data for each class could be balanced when used as training examples. The oversampling algorithm randomly sampled training images from the minority until the number of the examples reached the same number in a majority class for each run. In this case, the training dataset for one or more, or each, run contained equivalent numbers of data examples for one or more, or all of the classes.
Deep Learning Framework and Training
[00153] From the model selection, the following settings were chosen for training the models. Model selection was practiced with cross-validation and selected to used ResNet- 50 as convolutional layers. The convolutional layers were pre-trained with ImageNet with customized starting layers to match the size of the input single-cell images, followed by four fully-connected layers from scratch. An ADAM optimizer was applied with a weight decay of 0.05, and set the learning rate of 5x104 for the fully connected layers and fine-tuned/calibrated the convolutional layers by retraining the convolutional layers at 1% of the learning rate. The final training outcome was reported with a training and validation split of 8:2 and trained the model with a batch size of 256 on a Tesla PI 00 GPU on Google Colab platform with 20 epochs via Pytorch 1.10.0. Transplantations
[00154] Transplantations were performed as previously described. Congenic recipient mice (CD45.1) were lethally (1000 rad, split dose 3h apart) irradiated. Purified donor cells were along with 3x105 wild type (WT) “competitor” cells (CD45.1) injected into the retro-orbital plexus and hematopoietic reconstitution was monitored over time in the peripheral blood based on CD45.2 expression. All the recipients were along with 3x105 wild type (WT) “competitor” cells.
[00155] Referring now to FIG. 3, a diagram 300 illustrates an example technique for distinguishing cells. At 302, the process may start or restart.
[00156] At 304, a processing device may receive a plurality of first images. Each of the plurality of first images depicting first cells of a first type or a second type. At 306, the processing device may, perhaps for each of the plurality of first images, receive an indicator identifying whether the first image depicts a first cell of the first type or the second type.
[00157] At 308 the processing device may input, into a deep-learning (DL) model, the plurality of first images and the indicator for each of the plurality of first images. At 310, the processing device my process, via the DL model, the plurality of first images and the indicator for each of the plurality of first images.
[00158] At 312, the processing device may input, into the DL model, a second image comprising a second cell of the first type or the second type. At 314, the process may determine, via the DL model, whether the second cell is of the first type or the second type, based at least in part, on the processing of the plurality of first images and the indicator for each of the plurality of first images. At 316 the process may stop or restart.
[00159] FIG. 4 is a block diagram of a hardware configuration of an example device that may function as a process control device/logic controller that may perform/host at least a part of one or more elements of the DL/ML techniques described herein, for example. The hardware configuration 400 may be operable to facilitate delivery of information from an internal server of a device. The hardware configuration 400 can include a processor 410, a memory 420, a storage device 430, and/or an input/output device 440. One or more of the components 410, 420, 430, and 440 can, for example, be interconnected using a system bus 450. The processor 410 can process instructions for execution within the hardware configuration 400. The processor 410 can be a single-threaded processor or the processor 410 can be a multi-threaded processor. The processor 410 can be capable of processing instructions stored in the memory 420 and/or on the storage device 430.
[00160] The memory 420 can store information within the hardware configuration 400. The memory 420 can be a computer-readable medium (CRM), for example, a non-transitory CRM. The memory 420 can be a volatile memory unit, and/or can be a non-volatile memory unit.
[00161] The storage device 430 can be capable of providing mass storage for the hardware configuration 400. The storage device 430 can be a computer-readable medium (CRM), for example, a non-transitory CRM. The storage device 430 can, for example, include a hard disk device, an optical disk device, flash memory and/or some other large capacity storage device. The storage device 430 can be a device external to the hardware configuration 400.
[00162] The input/output device 440 may provide input/output operations for the hardware configuration 400. The input/output device 440 (e.g., a transceiver device) can include one or more of a network interface device (e.g., an Ethernet card), a serial communication device (e.g., an RS-232 port), one or more universal serial bus (USB) interfaces (e.g., a USB 2.0 port) and/or a wireless interface device (e.g., an 802.11 card). The input/output device can include driver devices configured to send communications to, and/or receive communications from one or more networks. The input/output device 400 may be in communication with one or more input/output modules (not shown) that may be proximate to the hardware configuration 400 and/or may be remote from the hardware configuration 400. The one or more output modules may provide input/output functionality in the digital signal form, discrete signal form, TTL form, analog signal form, serial communication protocol, fieldbus protocol communication and/or other open or proprietary communication protocol, and/or the like.
[00163] The camera device 460 may provide digital video input/output capability for the hardware configuration 400. The camera device 460 may communicate with any of the elements of the hardware configuration 400, perhaps for example via system bus 450. The camera device 460 may capture digital images and/or may scan images of various kinds, such as Universal Product Code (UPC) codes and/or Quick Response (QR) codes, for example, among other images as described herein. In one or more scenarios, the camera device 460 may be the same and/or substantially similar to any of the other camera devices described herein.
[00164] The camera device 460 may include at least one microphone device and/or at least one speaker device. The input/output of the camera device 460 may include audio signals/packets/components, perhaps for example separate/separable from, or in some ( e.g ., separable) combination with, the video signals/packets/components the camera device 460.
[00165] The camera device 460 may be in wired and/or wireless communication with the hardware configuration 400. In one or more scenarios, the camera device 460 may be external to the hardware configuration 400. In one or more scenarios, the camera device 460 may be internal to the hardware configuration 400.
[00166] Those skilled in the art will appreciate that the subject matter described herein may at least facilitate distinguishing between at least two types of cells with ML/DL techniques.
[00167] The subject matter of this disclosure, and components thereof, can be realized by instructions that upon execution cause one or more processing devices to carry out the processes and/or functions described herein. Such instructions can, for example, comprise interpreted instructions, such as script instructions, e.g., JavaScript or ECMAScript instructions, or executable code, and/or other instructions stored in a computer readable medium (e.g., on a non-transitory computer readable medium).
[00168] Implementations of the subject matter and/or the functional operations described in this specification and/or the accompanying figures can be provided in digital electronic circuitry, in computer software, firmware, and/or hardware, including the structures disclosed in this specification and their structural equivalents, and/or in combinations of one or more of them. The subject matter described in this specification can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, and/or to control the operation of, data processing apparatus.
[00169] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and/or declarative or procedural languages. It can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, and/or other unit suitable for use in a computing environment. A computer program may or might not correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs and/or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, and/or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that may be located at one site or distributed across multiple sites and/or interconnected by a communication network.
[00170] The processes and/or logic flows described in this specification and/or in the accompanying figures may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and/or generating output, thereby tying the process to a particular machine (e.g., a machine programmed to perform the processes described herein). The processes and/or logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) and/or an ASIC (application specific integrated circuit).
[00171] Computer readable media suitable for storing computer program instructions and/or data may include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and/or flash memory devices); magnetic disks (e.g., internal hard disks or removable disks); magneto optical disks; and/or CD ROM and DVD ROM disks. The processor and/or the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
[00172] While this specification and the accompanying figures contain many specific implementation details, these should not be construed as limitations on the scope of any invention and/or of what may be claimed, but rather as descriptions of features that may be specific to described example implementations. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in perhaps one implementation. Various features that are described in the context of perhaps one implementation can also be implemented in multiple combinations separately or in any suitable sub-combination. Although features may be described above as acting in certain combinations and/or perhaps even ( e.g ., initially) claimed as such, one or more features from a claimed combination can in some cases be excised from the combination. The claimed combination may be directed to a sub combination and/or variation of a sub-combination.
[00173] While operations may be depicted in the drawings in an order, this should not be understood as requiring that such operations be performed in the particular order shown and/or in sequential order, and/or that all illustrated operations be performed, to achieve useful outcomes. The described program components and/or systems can generally be integrated together in a single software product and/or packaged into multiple software products.
[00174] Examples of the subject matter described in this specification have been described. The actions recited in the claims can be performed in a different order and still achieve useful outcomes, unless expressly noted otherwise. For example, the processes depicted in the accompanying figures do not require the particular order shown, and/or sequential order, to achieve useful outcomes. Multitasking and parallel processing may be advantageous in one or more scenarios.
[00175] While the present disclosure has been illustrated and described in detail in the drawings and foregoing description, the same is to be considered as illustrative and not restrictive in character, it being understood that only certain examples have been shown and described, and that all changes and modifications that come within the spirit of the present disclosure are desired to be protected.

Claims

CLAIMS What Is Claimed Is:
1. A computer-implemented method of distinguishing cells, the method comprising:
(a) receiving a plurality of first images, each of the plurality of first images depicting first cells of a first type or a second type;
(b) for each of the plurality of first images, receiving an indicator identifying whether the first image depicts a first cell of the first type or the second type;
(c) inputting, into a deep-learning (DL) model, the plurality of first images and the indicator for each of the plurality of first images;
(d) processing, by the DL model, the plurality of first images and the indicator for each of the plurality of first images;
(e) inputting, into the DL model, a second image comprising a second cell of the first type or the second type; and
(f) determining, via the DL model, whether the second cell is of the first type or the second type, based at least in part, on the processing of the plurality of first images and the indicator for each of the plurality of first images, wherein at least steps (c)-(f) are performed by one or more processors of one or more computing devices.
2. The method according to claim 1, wherein the first type is a stem cell and the second type is not a stem cell, or the first type is a cancer cell and the second type is not a cancer cell.
3. The method according to any foregoing claim, wherein the first type of cell is a hematopoietic stem cell or a cancer cell.
4. The method according to any foregoing claim, wherein the DL model performs one or more trained machine learning techniques.
5. The method according to any foregoing claim, wherein at least one of the first cells and the second cell are hematopoietic stem cells.
6. The method according to any foregoing claim, wherein at least one of the first cells is a hematopoietic stem cell and the second cell is not a hematopoietic stem cell.
7. The method according to any foregoing claim, wherein the second type of cell is a differentiated cell.
8. The method according to any foregoing claim, wherein at least one of: the first type of cell, or the second tup of cell is at least one of: a long-term (LT) Hematopoietic Stem Cell (HSC), a short-term (ST) Hematopoietic Stem Cell (HSC), or a multipotent progenitor (MPP) Hematopoietic Stem Cell (HSC).
9. The method according to any foregoing claim, wherein the DL model is a machine learning model, the indication identifying whether the second image is associated with the stem cell or are unassociated with the stem cell being determined via rules of the machine learning model.
10. The method according to claim 9, wherein the machine learning model comprises a supervised learning modality.
11. The method according to claim 9, wherein the supervised learning modality comprises at least one of a Decision Tree rule, a Random Forest rule, a Support Vector Machine rule, a Naive Bayes Classification rule, or a Logistic Regression rule.
12. The method according to any foregoing claim, further comprising providing the cells to a patient when the second cell is determined to be a hematopoietic stem cell.
13. The method according to any foregoing claim, wherein the DL model determines whether the second cell is of the first type or the second type based on the shape, size, or color of the second cell.
14. The method according to any foregoing claim, further comprising obtaining at least some of the plurality of first images via at least one of: bright field microscopy, or florescent microscopy.
15. The method according to any foregoing claim, wherein the determining, via the DL model, whether the second cell is of the first type or the second type is conducted via label-free detection.
16. The method according to any foregoing claim, further comprising generating one or more Differential Interference Contrast (DIC) images based on at least one output from the DL model.
17. The method according to any foregoing claim, further comprising: processing the plurality of first images, the processing comprising: cropping at least some of the plurality of first images that contain a single cell; manually selecting at least some of the cropped single cell images; and forming at least one training dataset for the DL model based, at least in part, on the selected cropped single-cell images.
18. The method according to claim 17, further comprising performing at least one of: a brightness normalization, or a background normalization on the selected cropped single-cell images, the forming the at least one training dataset for the DL model being further based on the normalized single-cell images.
19. The method according to claim 18, further comprising augmenting the at least one training dataset for the DL model, the augmenting comprising applying at least one of: an affine transformation, an elastic transformation, a random rotation transformation, a horizontal flipping transformation, or a brightness adjustment transformation to the at least one training dataset.
20. The method according to claim 19, wherein the at least one training dataset comprises the plurality of first images and the indicator for each of the plurality of first images, and the processing the plurality of first images and the indicator for each of the plurality of first images further comprises training the DL model using, at least in part, the training dataset.
21. The method according to any foregoing claim, wherein the determining, via the DL model, whether the second cell is of the first type or the second type, further comprises determining a probability that the second cell is of the first type or the second type.
22. The method according to any foregoing claim, wherein the DL model comprises one or more convolutional layers.
23. The method according to any foregoing claim, further comprising applying five-folds cross validation with a training dataset to determine one or more hyper-parameters for training the DL model.
24. The method according to any foregoing claim, wherein the determining, via the DL model, whether the second cell is of the first type or the second type comprises at least one of: antibody label-free, or laser-free analysis of the second image.
25. The method according to any foregoing claim, wherein the determining, via the DL model, whether the second cell is of the first type or the second type comprises analysis of one or more morphological differences between subpopulations of the first type of cell and the second type of cell.
26. The method according to any foregoing claim, wherein the first type corresponds to a subpopulation of peripheral blood mononuclear cells (PBMC) cells and the second type does not correspond to a subpopulation of PBMC cells, or the first type corresponds to a first subpopulation of PBMC cells and the second type corresponds to a second subpopulation of PBMC cells.
27. The method according to claim 26, wherein a subpopulation of PBMC cells comprises at least one of: lymphocyte cells, monocyte cells, or dendritic cells.
28. The method according to claim 27, wherein the lymphocyte cells comprise at least one of: T cells, B cells, or NK cells.
29. The method according to any foregoing claim, wherein the DL model comprises, at least in part, one or more layers, and the processing the plurality of first images and the indicator for each of the plurality of first images further comprises: determining, by the DL model from the indicator, which of each of the plurality of first images depicts the first cell as that of the first type; determining, by the DL model from the indicator, which of each of the plurality of first images depicts the first cell as that of the second type; associating, by the DL model, one or more characteristics of each of the plurality of first images determined to depict the first cell of the first type with one or more first image identification parameters of a cell of the first type; associating, by the DL model, one or more characteristics of each of the plurality of first images determined to depict the first cell of the second type with one or more second image identification parameters of a cell of the second type; and tuning, by the DL model, at least some of the one or more layers of the DL model based on at least one of: the first image identification parameters, or the second image identification parameters.
30. The method according to claim 29, wherein the one or more layers are convolutional layers, and the tuning is performed, by the DL model, at a one or more learning rates associated with the one or more convolutional layers.
PCT/US2022/037790 2021-07-21 2022-07-21 Label-free classification of cells by image analysis and machine learning WO2023003993A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163224187P 2021-07-21 2021-07-21
US63/224,187 2021-07-21

Publications (1)

Publication Number Publication Date
WO2023003993A1 true WO2023003993A1 (en) 2023-01-26

Family

ID=84980110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/037790 WO2023003993A1 (en) 2021-07-21 2022-07-21 Label-free classification of cells by image analysis and machine learning

Country Status (1)

Country Link
WO (1) WO2023003993A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9739783B1 (en) * 2016-03-15 2017-08-22 Anixa Diagnostics Corporation Convolutional neural networks for cancer diagnosis
US20180286038A1 (en) * 2015-09-23 2018-10-04 The Regents Of The University Of California Deep learning in label-free cell classification and machine vision extraction of particles
US20210117729A1 (en) * 2018-03-16 2021-04-22 The United States Of America, As Represented By The Secretary, Department Of Health & Human Services Using machine learning and/or neural networks to validate stem cells and their derivatives (2-d cells and 3-d tissues) for use in cell therapy and tissue engineered products
US20210201484A1 (en) * 2017-04-27 2021-07-01 Sysmex Corporation Image analysis method, image analysis apparatus, and image analysis program for analyzing cell with deep learning algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180286038A1 (en) * 2015-09-23 2018-10-04 The Regents Of The University Of California Deep learning in label-free cell classification and machine vision extraction of particles
US9739783B1 (en) * 2016-03-15 2017-08-22 Anixa Diagnostics Corporation Convolutional neural networks for cancer diagnosis
US20210201484A1 (en) * 2017-04-27 2021-07-01 Sysmex Corporation Image analysis method, image analysis apparatus, and image analysis program for analyzing cell with deep learning algorithm
US20210117729A1 (en) * 2018-03-16 2021-04-22 The United States Of America, As Represented By The Secretary, Department Of Health & Human Services Using machine learning and/or neural networks to validate stem cells and their derivatives (2-d cells and 3-d tissues) for use in cell therapy and tissue engineered products

Similar Documents

Publication Publication Date Title
CN113454733B (en) Multi-instance learner for prognostic tissue pattern recognition
US11796788B2 (en) Detecting a defect within a bodily sample
US20230127698A1 (en) Automated stereology for determining tissue characteristics
CN107748256B (en) Liquid biopsy detection method for circulating tumor cells
Wang et al. Label-free detection of rare circulating tumor cells by image analysis and machine learning
JP2022106909A (en) Image-based cell sorting systems and methods
CN113330292A (en) System and method for applying machine learning to analyze microscopic images in high throughput systems
KR20210145778A (en) Method for Determination of Biomarkers from Histopathology Slide Images
Davidson et al. Automated detection and staging of malaria parasites from cytological smears using convolutional neural networks
Tantikitti et al. Image processing for detection of dengue virus based on WBC classification and decision tree
Rauf et al. Attention-guided multi-scale deep object detection framework for lymphocyte analysis in IHC histological images
EP4073685A1 (en) Classification models for analyzing a sample
Eren et al. Deepcan: A modular deep learning system for automated cell counting and viability analysis
Oscanoa et al. Automated segmentation and classification of cell nuclei in immunohistochemical breast cancer images with estrogen receptor marker
Tehsin et al. Myeloma cell detection in bone marrow aspiration using microscopic images
KR102084683B1 (en) Analysing method for cell image using artificial neural network and image processing apparatus for cell image
CA3193025A1 (en) Methods and systems for predicting neurodegenerative disease state
US20230011382A1 (en) Off-focus microscopic images of a sample
Lee et al. Classification of metastatic breast cancer cell using deep learning approach
Wang et al. OC_Finder: osteoclast segmentation, counting, and classification using watershed and deep learning
WO2023003993A1 (en) Label-free classification of cells by image analysis and machine learning
WO2022047171A1 (en) Method and system for label-free imaging and classification of malaria parasites
Pulfer et al. Transformer-based spatial-temporal detection of apoptotic cell death in live-cell imaging
Li et al. Interpretatively automated identification of circulating tumor cells from human peripheral blood with high performance
Mohammed et al. Leukemia Blood Cell Image Classification Using Machine Learning-A Systematic Literature Review

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22846586

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE