CN116997972A

CN116997972A - Sensing biological cells in a sample for cell type identification

Info

Publication number: CN116997972A
Application number: CN202280019636.3A
Authority: CN
Inventors: E·德雷纳尔迪斯; V·萨沃华; M·张伯伦
Original assignee: Sanofi Aventis France
Current assignee: Sanofi Aventis France
Priority date: 2021-01-15
Filing date: 2022-01-13
Publication date: 2023-11-03

Abstract

A cell sampler is configured to: sensing a physical phenomenon of the biological cells in the sample receiver with the sensor; and transmitting sensor data generated from the sensing of the biological cells to a processing device. The processing device is configured to: receiving the sensor data from the cell sampler; identifying individual cells of the biological cell using the sensor data; for each individual cell: generating a cell type of the individual cell using the sensor data; generating a feature vector of the individual cells using the sensor data; classifying at least some cell types as rare using the sensor data; for each rare cell type: accessing a feature vector of individual cells of the rare cell type; generating a bootstrap vector for the rare cell type by applying noise to a feature vector of individual cells of the rare cell type; and generating a cell database by aggregating the bootstrap vector and the feature vector.

Description

Sensing biological cells in a sample for cell type identification

Technical Field

This document describes techniques for using sensor data to identify and classify biological cells.

Background

Single cell analysis in cell biology is the study of genomics, transcriptomics, proteomics, metabolomics and cell-to-cell interactions at the single cell level. Analysis of single cells makes it possible to find mechanisms that were not found when a large number of cell populations were studied, since heterogeneity was found in both eukaryotic and prokaryotic cell populations. Techniques such as Fluorescence Activated Cell Sorting (FACS) allow for precise separation of selected single cells from complex samples, while high throughput single cell partitioning techniques enable molecular analysis of hundreds or thousands of individual unsorted cells simultaneously.

Disclosure of Invention

A technique for identifying single cells (including previously unknown cells) is described. Sensor data is collected from a sample of biological cells, and a machine learning classifier can be used to classify each cell sensed in the sample. To train these classifiers, a training set is created based on the cell identity. Some cells in the sample are relatively numerous and can therefore be used directly as a training database (corps). However, rare cells may not provide enough data points and may not provide data points with sufficient distinguishing force to train a reliable machine learning classifier. For these rare cells, the database may be derived based on rare examples combined with mathematical noise whose statistical distribution matches known changes in known cells. In this way, high quality data sets may be generated and high quality classifiers may be trained from these high quality data sets. By using the high quality classifier, the cell sampler and its associated computing device may better sense and identify biological cells.

In an example, a system may be adapted to sense data from a sample of biological cells. The system includes a cell sampler including a sample receiver and one or more sensors; wherein the cell sampler is configured to sense a physical phenomenon of a biological cell in the sample receiver with the sensor; and transmitting sensor data generated from the sensing of the biological cells to a processing device. The system includes a processing device including a computer memory and one or more processors, the processing device configured to: receiving the sensor data from the cell sampler; identifying individual cells of the biological cell using the sensor data; for each individual cell: generating a cell type of the individual cell using the sensor data; generating a feature vector of the individual cells using the sensor data; classifying at least some of the cell types as rare using the sensor data; for each rare cell type: accessing a feature vector of individual cells of the rare cell type; generating a bootstrap vector for the rare cell type by applying noise to a feature vector of individual cells of the rare cell type; and generating a cell database by aggregating bootstrap vectors and feature vectors of individual cells of the common cell type. Other examples include methods, computer-readable media, apparatus, and software.

Examples may include some, all, or none of the following features. The processing device is further configured to perform at least one of the group consisting of: i) Storing at least one of the cell databases into a data repository as a result of sensing the biological cells; ii) transmitting a report of at least one of the cell databases over a data network, and iii) in response to generating at least one of the cell databases, initiating an automated process without specific user input to initiate the automated process. To generate cell types of the individual cells using the sensor data, the processing device is further configured to submit the sensor data to one or more machine learning classifiers configured to receive the sensor data as input and generate an indication of a cell type as output. The one or more machine-learned classifiers include a plurality of classifiers arranged in a hierarchical decision tree with a set of machine-learned classifiers configured to vote on a classification at each of a plurality of nodes of the decision tree. The root node of the decision tree has child nodes for immune cells and child nodes for non-immune cells. The machine learning classifier is trained on an initial database of training data; and the processing device is further configured to: generating an updated database of training data by incorporating at least one of the cell databases into the initial database; and training an updated machine learning classifier using the updated database. The processing device is further configured to: since high entropy cells are found in clusters with high entropy levels, one of the individual cells is identified as the high entropy cell; disassociating the generated cell type from the high entropy cell; and classifying the high entropy cells as novel cell types. The processing device is further configured to: since high entropy cells are found in clusters with high entropy levels, one of the individual cells is identified as the high entropy cell; disassociating the generated cell type from the high entropy cell; and performing at least one of the group consisting of: i) Storing information about the high entropy cells as a result of sensing the biological cells to a data repository; ii) transmitting a report on the high entropy cells over a data network; and iii) in response to identifying the high entropy cell, initiating an automated process without specific user input to initiate the automated process. Identifying one of the individual cells as a high entropy cell includes calculating a shannon entropy value of the high entropy cell. The noise is generated based on statistical measures of previously analyzed cells. The processing device is further configured to generate the noise based on a statistical measure of the sensor data.

Implementations may include any, all, or none of the following features. Single cell analysis techniques are advanced. Machine-learned classifiers can be trained on very rare cell data that would not be available without this technique. This allows the creation of sensors and their associated controllers to classify these rare cells as they are encountered. Further, previously unknown cell types may be identified and analyzed. The analysis may be incorporated into a classifier to improve the performance of rare cells when they are encountered a second time.

Other features, aspects, and potential advantages will become apparent from the accompanying description and drawings.

Drawings

FIG. 1 illustrates an example system for sensing data from a sample of biological cells.

Fig. 2 shows an example of data that may be used in sensing data from a sample of biological cells.

FIG. 3 shows a lane diagram of an example process of sensing data from a sample of biological cells.

Fig. 4 shows a schematic diagram of an example of a computing device and a mobile computing device.

Fig. 5 illustrates an example process of sensing data from a sample of biological cells.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

Cell identification is improved by using sensing techniques and classification techniques that begin with training data from common cell types, update the training data by bootstrapping data of rare cells, and then train a machine learning classifier based on the training data. These classifiers can then be arranged in a hierarchical decision tree that can be used to classify the sensed cells.

FIG. 1 illustrates an example system 100 for sensing data from a sample of biological cells. In the system 100, the cell sampler 102 cooperates with the processing device 104 to generate a machine-learned classifier 118 that can be used to classify new cells and identify new cell types that were previously unknown.

The cell sampler 102 is any one or combination of devices capable of receiving a sample of cells 106 in a sample receiver and sensing physical phenomena of the cells 106 with one or more sensors. Example cell samplers 102 include, but are not limited to, well-based or droplet-based cell sequencers. Some example cell samplers 102 include devices that use microfluidic structures to perform single cell partitioning and bar coding. In some examples, the cell sampler 102 performs multidimensional and transcriptomic sensing.

The cell sampler 102 is in data communication with a processing device 104 that includes computer memory and one or more processors that are capable of executing instructions to receive data, perform data calculations, generate reports, transmit data over a network, and the like. As will be appreciated, the processing device 104 may include one or more apparatuses, such as a computer, monitor, data networking device, and the like. Some or all of the apparatus 104 may be physically integrated with the cell sampler 102, for example, in the form of a dedicated device controller. Some or all of the devices 104 may be geographically remote, but in data communication over one or more networks, including the internet.

The system 100 may operate to create training data on which the machine learning classifier 118 may be trained. The cell sampler 102 receives a sample of the cells 106 and senses physical phenomena of the cells 106. From this sensing, sensor data 108 is generated, thereby recording data reflecting the phenomenon. Individual cells 106 are identified-that is, many different single cells 106 are identified-and classified 110 as common or rare. For cells 106 of a common cell type, common cell characteristics 112 are identified and associated with their corresponding type. For rare types of cells 106, the additional features 114 are bootstrapped according to the features that are directly sensed and recorded in the sensor data 108.

The common cell features 112 and bootstrap features 114 are combined into one or more machine learning data sets 116. By using the bootstrap feature 114, the device 104 is able to construct a dataset suitable for training a classifier, even in the face of cell types for which only one or a few cells are available. Such techniques may advantageously train a machine-learned classifier by sensing fewer physical phenomena than would otherwise be the case. This may have the advantageous feature of being able to classify more rare cell types than would otherwise be possible.

Further sensor data 120 may be submitted to the classifier 118 for analysis using one or more machine-learned classifiers 118. For cell types that have been seen (including those rare cells that are otherwise too rare to allow machine learning training), the sensor data 120 may be classified into a cell classification 122. In addition, new cell types may be identified 124 for recording and/or research. This may advantageously advance single cell identification, classification and sequencing techniques.

Fig. 2 shows an example of data that may be used in sensing data from a sample of biological cells. For example, the data shown in FIG. 2 may be used by system 100 or other systems. The data shown herein may be recorded in one or more data stores, used by a processor in short-term memory storage, transmitted over a data network, etc.

The sensor data 108/120 includes data generated by sensors and/or controllers operating the sensors. Various types of sensors include various types of hardware that, in some environmental conditions, generate electrical signals based on differences in characteristics of the environment. In other words, the sensor data 108/120 reflects the physical condition of the cell 106.

Single cell record 200 records information about specific cells in a cell sample. The record 200 may be in a structured format with fields to store, for example, the name of the cell type, the feature vector 202, the date of creation, the sample identifier to which the single cell belongs, etc., and/or a reference to similar data fields.

Feature vector 202 may store a set of features (e.g., array, list, vector) determined for a single cell, and may be stored as part of single cell record 200. In one implementation, each index of feature vector 202 records a value to reflect a single gene expression of a single cell, although other data repository schemes may be used.

Noise 204 may store a set of random or pseudo-random values (e.g., array, list, vector) that have been adjusted to conform to one or more statistical rules. For example, the set mean, standard deviation, and range values may be compiled based on a record of changes in known common cells. Noise 204 may also exhibit the same mean, standard deviation, and range values. In some cases, noise 204 is generated based on statistical measures of cells previously analyzed in system 100. For example, the processing device 104 may be configured to generate the noise 204 based on a statistical measure of the sensor data.

Bootstrap vector 206 may store a set of features (e.g., array, list, vector) generated by applying noise 204 to feature vector 202. In this case, bootstrap vector 206 may be, for example, a value similar to and within reasonable variation of feature vector 202 for rare types of cells. This may be advantageous, for example, in situations where there are not enough rare cells found to generate feature vectors 202 for a particular task. One such task is training a machine learning classifier, but other tasks are also possible.

Cell database 208 contains data representing a number of cells. For example, the cell database 208 may include the feature vector 202 and the bootstrap vector 206. The cell database 208 may be used for a number of useful tasks. One such task is training a machine learning classifier, but other tasks are also possible.

The cell classifier 210 includes functionality configured to receive the feature vector 202 as an input and return a classification value as an output. For example, the feature vectors 202 of unclassified, recently sensed cells may be submitted to the cell classifier 210 for a first classification.

The cell classifiers 210 may be arranged in a hierarchical decision tree with a set of machine-learned classifiers 210 configured to vote on classification at each of a plurality of nodes 212 of the decision tree. Thus, the cell classifier 210 may provide a single classification, a series of classifications with confidence values, and classifications corresponding to various levels of specificity of the various levels of the decision tree.

Entropy values 214 and 216 may record entropy values for individual cells or clusters of cells. For high entropy clusters or cells in high entropy clusters, a high value 214 may be recorded. For low entropy clusters or cells in low entropy clusters, a low value 216 may be recorded.

FIG. 3 shows a lane diagram of an example process of sensing data from a sample of biological cells. Process 300 may be performed by, for example, system 100, and thus elements of system 100 will be used in this example, although other systems may be used to perform process 300 and other processes.

In this example, the processing device 104 includes a computer apparatus 302, a data repository 304, and a networking client 306. The devices 302 through 306 are each geographically separated and connected to one or more data networks, including the internet. However, other elements of the processing device 104 may be used in other examples.

The cell sampler 102 is configured to utilize a sensor to sense 102 a physical phenomenon of biological cells in the sample receiver. For example, a processor (e.g., a human technician or an automated material handling robot) may load a sample of biological cells into a sample receiver of the cell sampler 102 and may issue commands (e.g., push buttons or transmit data messages) to analyze the cells.

The cell sampler 102 is configured to communicate 310 sensor data generated from the sensing of the biological cells to the processing device, and the processing device 104 is configured to receive 312 the sensor data from the cell sampler 102. For example, cell sampler 102 may send data messages from sensing directly to client device 302, may store the messages in data repository 304, and send messages to computer device 302 with pointers to the data, or otherwise communicate the data.

The processing device 104 is configured to use the sensor data to identify 314 individual cells of the biological cells. For example, the computer device 302 may parse the received data and create a corresponding unique identifier (e.g., a bar code) for each single cell.

For each individual cell, the processing device 104 is configured to generate 316 a cell type for the individual cell using the sensor data. For example, computer device 302 may use one or more techniques to classify each single cell. The computer device 302 may submit the sensor data to one or more machine-learned classifiers configured to receive the sensor data as input and generate an indication of the cell type as output. This may generate an indication of the cell type for each unique identifier and thus for a single cell.

In some cases, the one or more machine learning classifiers include a plurality of classifiers. These classifiers may work together (e.g., by pooling votes or confidence levels) to create a classification. These classifiers may be arranged in a hierarchical decision tree. The tree may have a set of machine-learned classifiers configured to vote on the classification at each node of the tree. The vote may be used to create a classification.

The tree may be generic and therefore used when a completely unknown type of cell or other situation is to be received. In some cases, the tree may be constructed for a particular use. One such use is to differentiate and classify immune cells. In this case, the tree may be organized such that the root node of the decision tree has child nodes for immune cells and child nodes for non-immune cells. Thus, each cell may be first classified as immunized (e.g., and retained for further analysis) or non-immunized (e.g., discarded from further analysis). Immune cells, non-immune cells, or both immune and non-immune cells may be further classified.

For each individual cell, the processing device 102 is configured to generate 318 a feature vector for the individual cell using the sensor data. The vector may record various characteristics of the cell. As will be appreciated, each cell may have a corresponding vector of the same format, with a first element of the vector being used for the same data throughout all vectors, a second element of the vector being used for another same data throughout all vectors, and so on.

In addition to the uses described herein, feature vectors may also be used as inputs for other operations. For example, feature vectors may be used for deconvolution and signature analysis, among other purposes.

For each individual cell, the processing device 102 is configured to classify 320 at least some of the cell types as rare using the sensor data. For example, any cell type in the sample having a cell number less than a threshold may be classified as rare. The threshold may be a static value (e.g., 2, 10, 100) or a derivative value of another value (e.g., less than two standard deviations of the mean, N minimum population of cell types)). This other value may be a value associated with the sample (i.e., for finding rare cells in the sample) or another dataset (i.e., for finding rare cells when all known available cells are considered).

For each rare cell type, the processing device 104 is configured to access 322 a feature vector for the individual cells of the rare cell type. For example, the computer device 302 may access all feature vectors and filter out feature vectors of common cells. In another example, the computer device 302 may construct and submit a query that returns feature vectors for only rare cells.

For each rare cell type, the processing device 102 is configured to generate 324 a bootstrap vector for the rare cell type by applying noise to the feature vector of the individual cells of the rare cell type. For example, each feature vector may have I elements with data ranging from 0 to M. Noise may contain values between 0 and M that are random and conform to the variations found among common cell types. The computer means 302 may combine each feature vector element with the next unused number in the noise using a back-winding addition such that the value remains between 0 and M, but is altered by the noise. Other forms of combining besides wrapping addition may be used. This may depend on, for example, the manner in which the data is represented and stored.

The processing device 102 is configured to generate 326 a cell database by aggregating bootstrap vectors and feature vectors of individual cells of common cell types. For example, the computer device 302 may start from all feature vectors generated, or only from feature vectors of common cells, and add 324 all bootstrap vectors to the set. In some cases, computer device 302 may be configured to run one or more post-processing checks on the database to ensure that it meets minimum criteria established for a particular use. For example, a minimum number of data entries for machine learning categorization may be established.

The processing device 102 is further configured to store 328 at least one of the cell databases to a data repository as a result of sensing the biological cells. For example, the data repository may store the cell database in a long-term and stable storage device. When data repository 304 receives such a query, data repository 304 may then respond to the query on the cell database.

The processing device 102 is further configured to transmit 330 a report of at least one of the cell databases over the data network. For example, a networked client may send a report to a clinician regarding a patient's cells for diagnostic care of the patient.

The processing device 102 is further configured to initiate 332 the automated process without specific user input to initiate the automated process in response to generating at least one of the cell databases. For example, the networking client 306 may run one or more quality checks on the database and initiate one or more processes if the database passes the checks.

One example of such a process is the training of a machine learning classifier. In some cases, the classifier used by the computer device 302 may have been created in this way. That is, the machine learning classifier is trained on an initial database of training data, which is then updated. In this case, the processing device 102 is configured to generate an updated database of training data by incorporating at least one of the cell databases into the initial database. The processing device 104 is in this case configured to train the updated machine learning classifier using the updated database. Thus, the updated database will include a greater number of cell types, allowing for more flexible classification.

An example of such a process is the classification of high entropy cells. For example, since high entropy cells are found in clusters having high entropy levels (including, but not limited to, shannon entropy), the processing device may identify one of the individual cells as a high entropy cell. For example, clusters of O cells having O or nearly O different types of the identified type may be used as an indication of: the clusters are actually composed of O individual cells of a previously unknown type for which there is no specific classifier.

In this case, the processing device 104 may disassociate the generated cell type from the high entropy cells and, alternatively, classify the high entropy cells as novel cell types. In response, the processing device may perform a number of useful actions, such as: i) Storing information about the high entropy cells as a result of sensing the biological cells to a data repository; ii) transmitting a report on the high entropy cells over the data network; and/or iii) in response to identifying the high entropy cell, initiating the automated process without specific user input to initiate the automated process.

Fig. 4 illustrates an example of a computing device 400 and an example of a mobile computing device that may be used to implement the techniques described herein. Computing device 400 is intended to represent various forms of digital computers such as: laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Mobile computing devices are intended to represent various forms of mobile devices such as: personal digital assistants, cellular telephones, smartphones, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

Computing device 400 includes a processor 402, a memory 404, a storage device 406, a high-speed interface 408 coupled to memory 404 and a plurality of high-speed expansion ports 410, and a low-speed interface 412 coupled to low-speed expansion ports 414 and storage device 406. Each of the processor 402, memory 404, storage 406, high-speed interface 408, high-speed expansion port 410, and low-speed interface 412 are interconnected using various buses, and may be mounted on a common motherboard or in other manners as desired. The processor 402 may process instructions for execution within the computing device 400, including instructions stored in the memory 404 or on the storage device 406, to display graphical information for a GUI on an external input/output device, such as a display 416 coupled to the high-speed interface 408. In other implementations, multiple processors and/or multiple buses may be used, as desired, along with multiple memories and types of memory. Also, multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a set of blade servers, or a multiprocessor system).

Memory 404 stores information within computing device 400. In some implementations, the memory 404 is one or more volatile memory units. In some implementations, the memory 404 is one or more nonvolatile memory units. Memory 404 may also be another form of computer-readable medium, such as a magnetic or optical disk.

Storage device 406 is capable of providing mass storage for computing device 400. In some implementations, the storage device 406 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices (including devices in a storage area network or other configurations). The computer program product may be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The computer program product may also be tangibly embodied in a computer-readable medium or machine-readable medium, such as the memory 404, the storage device 406, or a memory on the processor 402.

High-speed interface 408 manages bandwidth-intensive operations for computing device 400 while low-speed interface 412 manages lower bandwidth-intensive operations. Such allocation of functions is merely exemplary. In some implementations, the high-speed interface 408 is coupled to the memory 404, the display 416 (e.g., via a graphics processor or accelerator), and to a high-speed expansion port 410 that can accept various expansion cards (not shown). In an implementation, low-speed interface 412 is coupled to storage 406 and low-speed expansion port 414. The low-speed expansion port 414, which may include various communication ports (e.g., USB, bluetooth, ethernet, wireless ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device, such as a switch or router, for example, through a network adapter.

As shown in the figures, the computing device 400 may be implemented in a number of different forms. For example, it may be implemented as a standard server 420, or multiple times in such a server bank. Furthermore, it may be implemented in a personal computer such as laptop 422. It may also be implemented as part of a rack server system 424. Alternatively, components from computing device 400 may be combined with other components in a mobile device (not shown), such as mobile computing device 450. Each of such devices may include one or more of computing device 400 and mobile computing device 450, and the entire system may be made up of multiple computing devices in communication with each other.

The mobile computing device 450 includes a processor 452, memory 464, input/output devices such as a display 454, a communication interface 466, and other components, such as a transceiver 468. The mobile computing device 450 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the processor 452, memory 464, display 454, communication interface 466, and transceiver 468 are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as desired.

Processor 452 can execute instructions within mobile computing device 450, including instructions stored in memory 464. Processor 452 may be implemented as a chipset of chips that include separate analog and digital processors. The processor 452 may provide, for example, for coordination of the other components of the mobile computing device 450, such as control of user interfaces, applications run by the mobile computing device 450, and wireless communication by the mobile computing device 450.

The processor 452 may communicate with a user through a control interface 458 and a display interface 456 coupled to the display 454. The display 454 may be, for example, a TFT (thin film transistor liquid crystal display) display or an OLED (organic light emitting diode) display, or other suitable display technology. The display interface 456 may include suitable circuitry for driving the display 454 to present graphical and other information to a user. The control interface 458 may receive commands from a user and convert them for submission to the processor 452. In addition, an external interface 462 may provide communication with the processor 452 to enable the mobile computing device 450 to communicate with other devices in the near area. External interface 462 may be provided for wired communication, for example, in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.

Memory 464 stores information within mobile computing device 450. Memory 464 may be implemented as one or more of one or more computer-readable media, one or more volatile memory units, or one or more non-volatile memory units. Expansion memory 474 may also be provided and connected to mobile computing device 450 through expansion interface 472, which may include, for example, a SIMM (Single in line memory Module) card interface. Expansion memory 474 may provide additional storage for mobile computing device 450 or may also store applications or other information for mobile computing device 450. Specifically, expansion memory 474 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 474 may be provided as a security module for mobile computing device 450 and may be programmed with instructions that permit secure use of mobile computing device 450. Further, secure applications may be provided via the SIMM cards along with additional information, such as placing identifying information on the SIMM cards in an indestructible manner.

The memory may include, for example, flash memory and/or NVRAM memory (non-volatile random access memory), as discussed below. In some implementations, the computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The computer program product may be a computer-readable medium or machine-readable medium such as memory 464, expansion memory 474, or memory on processor 452. In some implementations, the computer program product may be received in a propagated signal, e.g., through transceiver 468 or external interface 462.

The mobile computing device 450 may communicate wirelessly through a communication interface 466, which may include digital signal processing circuitry if necessary. Communication interface 466 may provide communication under various modes or protocols, such as GSM (global system for mobile communications) voice calls, SMS (short message service), EMS (enhanced message service) or MMS (multimedia message service) messaging, CDMA (code division multiple access), TDMA (time division multiple access), PDC (personal digital cellular system), WCDMA (wideband code division multiple access), CDMA2000, or GPRS (general packet radio service), among others. Such communication may occur, for example, using radio frequencies through transceiver 468. Further, short-range communications may be made, such as using a Bluetooth, wiFi, or other such transceiver (not shown). In addition, the GPS (global positioning system) receiver module 470 may provide additional navigation-related and location-related wireless data to the mobile computing device 450, which may be used as needed by applications running on the mobile computing device 450.

The mobile computing device 450 may also communicate audibly using an audio codec 460 that may receive verbal information from the user and convert it to usable digital information. The audio codec 460 may likewise generate audible sound for a user, such as through a speaker (e.g., in a handset of the mobile computing device 450). Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.), and may also include sound generated by applications operating on the mobile computing device 450.

As shown in the figures, the mobile computing device 450 may be implemented in a number of different forms. For example, it may be implemented as a cellular telephone 480. It may also be implemented as part of a smart phone 482, personal digital assistant, or other similar mobile device.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementations in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display device) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, audible feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server) or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

In one example, the technique successfully separates immune cells from non-immune cells in data from three mixed tissue experiments that result in cells from the kidney, synovium, and lung that were generated using plate-based or droplet-based techniques. It also correctly rejected non-immune tags in the example dataset derived from blood. Immune cells and non-immune cells exhibited significant changes in gene expression against well established immune cell markers and non-immune cell markers (such as PTPRC and CD 53) (p-value <0.05, wilcoxon rank sum test), indicating a broad accurate classification of immune cells and non-immune cells in peripheral tissues as well as blood.

In another example, data were generated from human whole blood using cell index technology (CITE-seq) for transcriptome and epitope by sequencing to observe cell type specific protein expression. In these data, the technology identifies cell types consistent with the expected protein expression: CD19 ⁺ B cell, CD19 ⁺ CD25+ memory B cells, CD19 ⁺ CD25 ^- CCR7 ⁺ Naive B cells, CD14 ⁺⁺ CD16 ^- Classical monocytes, CD14 ⁺ CD16 ⁺⁺ Nonclassical monocytes, CD3 ⁺ T cells, CD45RA ⁺ CD4 ⁺ Naive T cells, CD45RO ⁺ CD4 ⁺ Memory T cells, CD4 ⁺ TIGIT ⁺ FOXP3 ⁺ Regulatory T cells, CD45RO ⁺ CD8 ⁺ Effector memory T cells, CD56 ⁺ CD3 ^- NK cells, CLEC10A ⁺ Dendritic Cells (DC), MZB1 ⁺ Plasma cells and CD56 ⁺ CD3 ^- NK cells. Notably, this technique did not detect any macrophages in these data obtained from blood, consistent with the idea that monocyte differentiation occurred in tissue rather than blood.

In another example, recent studies of drug acceleration partnership (Accelerating Medicines Partnership, AMP) isolated human cells from joint synovial tissue (n= 8,920 cells from n=26 human samples), and removed the scRNA-seq ^15,20 Flow cytometry was also performed. The proteins observed in this study were directed to the following four different speciesFully established lineage specific markers for cell types: CD45 ⁺ CD3 ⁺ T cells, CD45 ⁺ CD3 ^- CD19 ⁺ B cell, CD45 ⁺ CD14 ⁺ Monocytes and CD45 ^- CD31 ^- PDPN ⁺ Fibroblast cells ¹⁵ Allowing us to compare previously established flow cytometry tags with tags generated by our method ¹⁵ . This technique uses only transcriptional measurements for each cell, identifying 98.2% flow cytometry tags (95% c.i. [98.0%; 98.5%)]P value<0.001, double sided binomial assay, n= 8,334 cells; ). In addition, this technique resulted in accurate classification, where as few as 200 unique genes were detected per cell (95.2% average recall; 95% C.I. [76.2%; 99.9%)]P value<0.001, two-sided binomial test; n=21 cells; ) This technique has thus proved to be robust in classifying cells with low sequencing depth. Next, we turned our attention to the classification of cell types, which extends beyond the flow cytometer panel, reaching the deepest level of this technical annotation, resulting in new cell type annotations. To help verify these annotations, we note that the images identified herein are consistent with well established biology (e.g., FOXP3 in regulatory T cells and CD19 in B cells), indicating that the technique accurately classifies these cell types. However, we also note that CD45 at only 46.9% (n=734/1, 564) ⁺ CD3 ^- CD19 ⁺ CD19 transcripts were detected in B cells, demonstrating the importance of using a cell type classifier (i.e., this technique) to identify cell phenotypes in scRNA-seq data.

The fact that this technique makes accurate classification in single cell data is surprising. Single cell data is considered to be technically distinct from sequencing experiments performed with cell collections. For example, the potential distribution of gene transcripts from single cell data is close to poisson (or negative binomial); omission (undetected transcripts or transcripts that were temporarily deleted in cells) is also a clear feature of this data. It is assumed that the use of neural networks helps overcome limitations due to the nonlinearity of the classifier and the ability of the neural network to classify based on subtle changes in the gene expression profile that distinguish cell types. This technique allows for reliable classification of data from different samples, tissues, species and diseases sequenced using well-based or droplet-based techniques. The observed changes in cell phenotype are consistent with known macrophage biology and thus, this technique can be used to study changes in cell phenotype due to the biological background of the dataset. When using other methods, such as some surface protein measurements in flow cytometry (FACS) analysis, it is not possible to make such consistent identification between tissues/diseases/species, as these are context dependent. Thus, this is the only unbiased (as explained elsewhere in this document) classification known based on measurements.

In another example, the novel cell type populations are classified based on single cell data. New data was introduced to compare with bootstrap data as described above, which allowed the technique to learn and refine the classification for regulatory T cells, γδ T cells, and plasmacytoid dendritic cells from the training dataset. Notably, pdcs were categorized in additional datasets, demonstrating that this technique learns cell populations on different single cell datasets.

In another example, the techniques are used to classify model organisms that typically lack a flow sort dataset. The technology classifies cynomolgus monkey and mini-pig PBMCs by using homologous gene symbols across species without any additional species-specific training.

In an example, the technique is applicable to the study of disease biology using four different data sets. This analysis reveals shared and distinct markers for cell types and identifies cell types enriched in diseased tissue. This technique identified two enriched populations in the dataset.

In an example, the technique is used on large data sets. This technique classifies a large number (i.e., >300,000 cells) of scRNA-seq data.

One such example is shown in fig. 5. The technique (a) accurately and consistently maps single cell identities to a detailed hierarchy of known immunophenotypes; (b) identifying a novel population of cells; and (c) revealing disease biology from the single cell data. See fig. 5. In general, the methods convert scRNA-seq data into objective readings that can be used to study immune cells on diseases, techniques, species and tissues.

To annotate cell phenotypes in single cell data, the techniques described herein can use machine learning to classify each cell (such as fibroblasts, endothelial cells, epithelial cells) in unlabeled scRNA-seq data according to the detailed hierarchy of immunophenotyping and/or non-immunophenotyping. It will be appreciated that other applications of the technique may be used for other phenotypes. The method is based on a neural network classifier trained on a reference dataset for large gene expression profiles of pure cell types derived from flow sorted cells. The training includes identifying transcribed gene signatures of cell types using differential gene expression analysis and/or other sources of previously established gene signatures. Some of these sources may contain as few as one or two samples per cell population, which may be too few for machine learning methods that typically require hundreds or thousands of samples. To generate useful training data, the techniques described herein bootstrap a data set from rare samples in order to train a machine learning classifier, such as a neural network classifier.

In one example implementation, the technique uses a reference dataset of pure cell types with 713 microarray samples annotated to 157 cell types. In this dataset, ribosomal proteins and mitochondrial genes were removed, bone marrow derived samples were removed (n=544 samples corresponding to the remaining 113 cell types), and a subset of genes previously broadly identified as exhibiting cell type specific expression was used (n= 10,808). Within this subset, the technique normalized using relative counts identified genes in the dataset that were significantly (p-value < 0.05) differentially expressed between samples annotated as different cell types. This did not produce differentially expressed genes for comparison between memory B cells and naive B cells, between plasma cells and B cells, between memory CD 4T cells and naive CD 4T cells, between regulatory T cells and CD4 memory T cells, between memory CD 8T cells and naive CD 8T cells, and between effector memory CD 8T cells and central memory CD 8T cells, in which case the technique used previously identified gene signatures.

To create a predictive model of cell type, a training dataset is first created from samples in the dataset by pooling the samples at each level of the hierarchy shown in fig. 5, and then bootstrapping is performed by random resampling and substitution within each group (e.g., n=1,000 bootstrapping for immune cells, n=1,000 bootstrapping for lymphocytes, etc.), and then sampling from a random normal distribution, with the mean and standard deviation set by the mean and standard deviation of the resampling features. The technique trains a neural network (n=100) with automatically optimized hyper-parameters.

The technique then constructs a k-nearest neighbor (KNN) pattern. After classification with each neural network, the label of each cell is assigned to its own and nearest neighbor most frequent label. Each cell unique identifier (e.g., a bar code) is assigned to a cell type label corresponding to the highest probability derived from the average of the set of neural networks (n=100). The probabilities are then averaged over a set of classifiers and the cell type label corresponds to the highest probability of the set. The technique generates a report of the prediction error (standard deviation); individual cell barcodes in KNN networks were labeled "unclassified" when they had a large (2 standard deviations greater than the mean) normalized shannon entropy within the four nearest neighbors. This process may occur at any level of the hierarchy (e.g., unclassified T cell subtypes).

Cell barcodes labeled "unclassified" occupied significantly (p <0.01, hypergeometric distribution test) the Louvain cluster in KNN networks, corrected with tags corresponding to the first two expressed genes determined by z-score conversion. The tag may extend to the last node being classified (e.g., T cell others).

For single cell classification, analysis of the technique starts from unfiltered counts. First, all cell barcodes with less than 200 detected genes were removed. Next, all cell barcodes with a large (greater than the mean plus two standard deviations) percent mitochondrial gene expression were removed. Next, all genes not detected in any cell barcode, as well as all mitochondrial and ribosomal genes, were removed. The bin size is normalized to the mean bin size.

To classify cell types, the technique establishes a subset of expression matrices corresponding to intersections of gene signatures in the reference dataset and genes in the scRNA-seq matrix. After this step, each cell barcode is normalized to the mean library size, and then each gene is scaled by dividing by the maximum gene expression value in any cell barcode. Any genes with zero standard deviation were removed. Next, K-soft interpolation is performed, and then scaled again.

By systematically identifying cell type populations in single cell data, the technique identifies generic feature vectors and context-specific feature vectors. We predict that these feature vectors are useful for several techniques (e.g., enrichment scoring/signature based on gene expression (e.g., GSVA/GSEA)) and cell type deconvolution (e.g., cibelort) that require well established cellular gene expression vectors.

To interpolate the value of each gene of each cell, the total number of genes detected in each cell is set to a cell-by-cell matrix W _jj Is a diagonal of (a). Next, the technique is based on an adjacency matrix a _jj And according to A _jj K of (2) ^th The power establishes cells with direct and higher k degree connections in the KNN network, thereby forming a network-based interpolation operator D _jj The interpolation operator is weighted by the total number of genes detected in each cell and normalized such that each row sums to two:

then, by matrix E of observed expressions _ij Performing operations to directly calculate the estimated expression matrix E' _ij ：

E′ _ij ＝E _ij D _jj 。

Claims

1. A system for sensing data from a sample of biological cells, the system comprising:

a cell sampler comprising a sample receiver and one or more sensors; wherein the cell sampler is configured to:

sensing a physical phenomenon of a biological cell in the sample receiver with the sensor; and

transmitting sensor data generated from sensing of the biological cells to a processing device; and

a processing device comprising a computer memory and one or more processors, the processing device configured to:

Receiving the sensor data from the cell sampler;

identifying individual cells of the biological cell using the sensor data;

for each individual cell:

generating a cell type of the individual cell using the sensor data;

generating a feature vector of the individual cells using the sensor data;

classifying at least some of the cell types as rare using the sensor data;

for each rare cell type:

accessing a feature vector of individual cells of the rare cell type;

generating a bootstrap vector for the rare cell type by applying noise to a feature vector of individual cells of the rare cell type; and

the cell database is generated by aggregating bootstrap vectors and feature vectors of individual cells of the common cell types.

2. The system of claim 1, wherein the processing device is further configured to perform at least one of the group consisting of: i) Storing at least one of the cell databases into a data repository as a result of sensing the biological cells; ii) transmitting a report of at least one of the cell databases over a data network, and iii) in response to generating at least one of the cell databases, initiating an automated process without specific user input to initiate the automated process.

3. The system of claim 1, wherein to generate a cell type of the individual cell using the sensor data, the processing device is further configured to submit the sensor data to one or more machine learning classifiers configured to receive the sensor data as input and generate an indication of a cell type as output.

4. The system of claim 3, wherein the one or more machine-learned classifiers comprise a plurality of classifiers arranged in a hierarchical decision tree with a set of machine-learned classifiers configured to vote on a classification at each of a plurality of nodes of the decision tree.

5. The system of claim 4, wherein the root node of the decision tree has child nodes for immune cells and child nodes for non-immune cells.

6. A system according to claim 3, wherein:

training the machine learning classifier on an initial database of training data; and

the processing device is further configured to:

generating an updated database of training data by incorporating at least one of the cell databases into the initial database; and

And training an updated machine learning classifier by utilizing the updated database.

7. The system of claim 6, wherein the processing device is further configured to:

since high entropy cells are found in clusters with high entropy levels, one of the individual cells is identified as the high entropy cell;

disassociating the generated cell type from the high entropy cell; and

the high entropy cells are classified as novel cell types.

8. The system of claim 1, wherein the processing device is further configured to:

disassociating the generated cell type from the high entropy cell; and

performing at least one of the group consisting of: i) Storing information about the high entropy cells as a result of sensing the biological cells to a data repository; ii) transmitting a report on the high entropy cells over a data network; and iii) in response to identifying the high entropy cell, initiating an automated process without specific user input to initiate the automated process.

9. The system of claim 8, wherein identifying one of the individual cells as a high entropy cell comprises calculating a shannon entropy value of the high entropy cell.

10. The system of any one of claims 1-2, wherein the noise is generated based on a statistical measure of previously analyzed cells.

11. The system of any of claims 1-2, wherein the processing device is further configured to generate the noise based on a statistical measure of the sensor data.

12. A method for sensing data from a sample of biological cells, the method comprising:

identifying individual cells of the biological cells using the sensor data;

for each individual cell:

generating a cell type of the individual cell using the sensor data;

generating a feature vector of the individual cells using the sensor data;

classifying at least some of the cell types as rare using the sensor data;

for each rare cell type:

accessing a feature vector of individual cells of the rare cell type;

13. The method of claim 12, the method further comprising at least one of the group consisting of: i) Storing at least one of the cell databases into a data repository as a result of sensing the biological cells; ii) transmitting a report of at least one of the cell databases over a data network; and iii) in response to generating at least one of the cell databases, initiating an automated process without specific user input to initiate the automated process.

14. The method of claim 12, wherein generating a cell type of the individual cell using the sensor data comprises submitting the sensor data to one or more machine-learning classifiers configured to receive the sensor data as input and generate an indication of a cell type as output.

15. The method of claim 14, wherein the one or more machine-learned classifiers comprise a plurality of classifiers arranged in a hierarchical decision tree with a set of machine-learned classifiers configured to vote on a classification at each of a plurality of nodes of the decision tree.

16. The method of claim 15, wherein the root node of the decision tree has child nodes for immune cells and child nodes for non-immune cells.

17. The method according to claim 14, wherein:

the method further comprises:

18. The method of claim 17, the method further comprising:

disassociating the generated cell type from the high entropy cell; and

the high entropy cells are classified as novel cell types.

19. The method of claim 12, the method further comprising:

Disassociating the generated cell type from the high entropy cell; and

20. The method of claim 19, wherein identifying one of the individual cells as a high entropy cell comprises calculating a shannon entropy value of the high entropy cell.

21. The method of any one of claims 12 to 13, wherein the noise is generated based on a statistical measure of previously analyzed cells.

22. The method of any of claims 12 to 13, further comprising generating the noise based on a statistical measure of the sensor data.

23. A computer-readable medium tangibly storing instructions that, when executed by one or more processors, cause the processors to perform operations comprising:

Identifying individual cells in the collection of biological cells using the sensor data;

for each individual cell:

generating a cell type of the individual cell using the sensor data;

generating a feature vector of the individual cells using the sensor data;

classifying at least some of the cell types as rare using the sensor data;

for each rare cell type:

accessing a feature vector of individual cells of the rare cell type;

24. The computer-readable medium of claim 23, the operations further comprising at least one of the group consisting of: i) Storing at least one of the cell databases into a data repository as a result of sensing the biological cells; ii) transmitting a report of at least one of the cell databases over a data network; and iii) in response to generating at least one of the cell databases, initiating an automated process without specific user input to initiate the automated process.

25. The computer-readable medium of claim 23, wherein generating a cell type of the individual cell using the sensor data comprises submitting the sensor data to one or more machine-learning classifiers configured to receive the sensor data as input and generate an indication of a cell type as output.

26. The computer-readable medium of claim 25, wherein the one or more machine-learned classifiers comprise a plurality of classifiers arranged in a hierarchical decision tree with a set of machine-learned classifiers configured to vote on a classification at each of a plurality of nodes of the decision tree.

27. The computer-readable medium of claim 26, wherein a root node of the decision tree has child nodes for immune cells and child nodes for non-immune cells.

28. The computer-readable medium of claim 25, wherein:

the method further comprises:

29. The computer-readable medium of claim 28, the operations further comprising:

disassociating the generated cell type from the high entropy cell; and

the high entropy cells are classified as novel cell types.

30. The computer-readable medium of claim 23, the operations further comprising:

disassociating the generated cell type from the high entropy cell; and

31. The computer-readable medium of claim 30, wherein identifying one of the individual cells as a high entropy cell comprises calculating a shannon entropy value of the high entropy cell.

32. The computer readable medium of any one of claims 23 to 24, wherein the noise is generated based on a statistical measure of previously analyzed cells.

33. The computer readable medium of any of claims 23 to 24, comprising generating the noise based on a statistical measure of the sensor data.