CN116189771A - Cell type detection method, system, device and medium - Google Patents

Cell type detection method, system, device and medium Download PDF

Info

Publication number
CN116189771A
CN116189771A CN202211622308.0A CN202211622308A CN116189771A CN 116189771 A CN116189771 A CN 116189771A CN 202211622308 A CN202211622308 A CN 202211622308A CN 116189771 A CN116189771 A CN 116189771A
Authority
CN
China
Prior art keywords
cell
cells
expression
detected
cell type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211622308.0A
Other languages
Chinese (zh)
Inventor
肖雨沙
李锡丹
韩蓝青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Original Assignee
CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Research Institute Of Tsinghua Pearl River Delta
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CYAGEN BIOSCIENCES (GUANGZHOU) Inc, Research Institute Of Tsinghua Pearl River Delta filed Critical CYAGEN BIOSCIENCES (GUANGZHOU) Inc
Priority to CN202211622308.0A priority Critical patent/CN116189771A/en
Publication of CN116189771A publication Critical patent/CN116189771A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application discloses a method, a system, a device and a medium for detecting cell types, wherein the method acquires whole genome expression quantity data, marker gene information and images comprising cells to be detected in batches; determining a first expression level of a marker gene corresponding to each cell to be detected according to the whole genome expression level data and the marker gene information; screening out a significant expression cell set according to the first expression quantity, wherein the set comprises a plurality of first cells; determining the cell type of the first cell according to the whole genome expression amount data and the marker gene information of the first cell; determining the cell type of the second cell by a K-nearest neighbor algorithm according to the cell type and the image of each first cell; the second cell is the cell to be detected other than the first cell. The method can rapidly and accurately determine the type information of a large number of cells, and can be applied to single cell type labeling. The method and the device can be widely applied to the technical field of biological information.

Description

Cell type detection method, system, device and medium
Technical Field
The application relates to the technical field of biological information, in particular to a method, a system, a device and a medium for detecting cell types.
Background
Single-cell sequencing (single-cell sequencing) refers to an emerging technology for high throughput sequencing analysis of genomes and transcriptomes in the dimension of single cells, which is generally used to reveal differences in heterogeneity between cells, thereby helping to understand in depth the cell functions and cross-talk between cells, increasing understanding of complex ecosystems and diseases in multicellular organisms, and locating precisely pathogenic cells of complex diseases. With the generation of massive single-cell sequencing data, the huge data are comprehensively integrated and analyzed, potential information is mined, and the method becomes common knowledge in the bioinformation industry.
However, in the process of performing single-cell sequencing tasks, how to efficiently and accurately label a large number of single cells with cell types is not an ideal method at present. In the related art, most single cell type labeling is performed based on a cell marker database, namely, the detection of cell types is realized by using the high expression condition of a marker gene. However, since some cells may have a gene loss or be in a division growth stage, the expression amount of the marker gene thereof is significantly smaller than that in the normal case, so that the cell type result determined in the above manner tends to have a problem of lower accuracy.
Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the related art to a certain extent.
It is therefore an object of embodiments of the present application to provide a method, system, device and medium for detecting a cell type.
In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the application comprises the following steps:
in one aspect, embodiments of the present application provide a method for detecting a cell type, the method comprising:
acquiring whole genome expression quantity data and marker gene information of a batch of cells to be detected and an image comprising the batch of cells to be detected;
determining a first expression level of a marker gene corresponding to each cell to be detected according to the whole genome expression level data and the marker gene information;
screening out a significant expression cell set from the cells to be detected according to the first expression quantity; the set of significantly expressed cells includes a number of first cells;
determining the cell type of the first cell according to the whole genome expression amount data and the marker gene information corresponding to the first cell;
determining the cell type of a second cell by a K nearest neighbor algorithm according to the cell type corresponding to each first cell and the image; the second cell is other cells to be detected than the first cell.
In addition, the method for detecting a cell type according to the above embodiment of the present application may further have the following additional technical features:
further, in one embodiment of the present application, after the step of obtaining the whole genome expression amount data of the batch of cells to be detected, the method further comprises:
converting the whole genome expression amount data into matrix data; the matrix data comprises n rows and 1 column, wherein the numerical value elements of each row correspond to the expression quantity data of one gene, the number of the row of the numerical value elements are arranged according to the preset sequence of the gene types, and n is a positive integer.
Further, in one embodiment of the present application, obtaining marker gene information includes:
obtaining species information of the cell to be detected;
and acquiring the marker gene information according to the species information.
Further, in one embodiment of the present application, the screening the significantly expressed cell set from the cells to be detected according to the first expression level includes:
calculating the average value of the expression quantity of the marker genes corresponding to the cells to be detected according to the first expression quantity;
and screening out a significant expression cell set from the cells to be detected according to the first expression quantity and the expression quantity average value.
Further, in one embodiment of the present application, the screening the significant expression cell set from the cells to be detected according to the first expression level and the expression level average value includes:
calculating the expression reference value of the cell to be detected by the following formula:
Figure BDA0004002923230000021
wherein Pi represents the expression reference value of the ith cell to be detected, ki represents the first expression level of the marker gene of the ith cell to be detected, lambda represents the average value of the expression levels, and e represents the Euler constant;
and screening out a significant expression cell set from the cells to be detected according to the expression reference value.
Further, in one embodiment of the present application, the determining, by a K-nearest neighbor algorithm, the cell type of the second cell according to the cell type corresponding to each of the first cells and the image includes:
training a KNN model according to the cell type corresponding to each first cell and the image, and determining a K value in the KNN model;
determining from the image K first cells closest to the second cells;
determining the cell type with the largest number of the K first cells, and determining the cell type with the largest number of the K first cells as the cell type of the second cells.
Further, in one embodiment of the present application, said determining K first cells closest to the second cell from the image comprises:
calculating euclidean distances between the second cells and the respective first cells;
and sequencing the first cells according to the Euclidean distance, and selecting K first cells with the minimum Euclidean distance from the first cells.
In another aspect, embodiments of the present application provide a device for detecting a cell type, the device comprising:
the acquisition module is used for acquiring whole genome expression quantity data, marker gene information and images comprising the batch of cells to be detected;
the processing module is used for determining the first expression quantity of the marker gene corresponding to each cell to be detected according to the whole genome expression quantity data and the marker gene information;
the screening module is used for screening out a significant expression cell set from the cells to be detected according to the first expression quantity; the set of significantly expressed cells includes a number of first cells;
the first labeling module is used for determining the cell type of the first cell according to the whole genome expression quantity data and the marker gene information corresponding to the first cell;
the second labeling module is used for determining the cell type of the second cell through a K nearest neighbor algorithm according to the cell type corresponding to each first cell and the image; the second cell is other cells to be detected than the first cell.
In another aspect, embodiments of the present application provide a computer device, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement a method of detecting a cell type as described above.
In another aspect, embodiments of the present application further provide a computer readable storage medium having stored therein a program executable by a processor, where the program executable by the processor is configured to implement the method for detecting a cell type described above when executed by the processor.
The advantages and benefits of the present application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the present application.
The embodiment of the application discloses a method for detecting a cell type, which comprises the following steps: acquiring whole genome expression quantity data and marker gene information of a batch of cells to be detected and an image comprising the batch of cells to be detected; determining a first expression level of a marker gene corresponding to each cell to be detected according to the whole genome expression level data and the marker gene information; screening out a significant expression cell set from the cells to be detected according to the first expression quantity; the set of significantly expressed cells includes a number of first cells; determining the cell type of the first cell according to the whole genome expression amount data and the marker gene information corresponding to the first cell; determining the cell type of a second cell by a K nearest neighbor algorithm according to the cell type corresponding to each first cell and the image; the second cell is other cells to be detected than the first cell. The method can rapidly and accurately determine the type information of a large number of cells, can be applied to single cell type labeling, and is convenient for subsequent execution of single cell sequencing and other tasks.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present application or the related technical solutions in the prior art, it should be understood that, in the following description, the drawings are only for convenience and clarity to describe some embodiments in the technical solutions of the present application, and other drawings may be obtained according to these drawings without any inventive effort for those skilled in the art.
FIG. 1 is a flow chart of a method for detecting a cell type according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a screening of a collection of significantly expressing cells provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of determining a cell type of a second cell by a K-nearest neighbor algorithm provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The present application is further described below with reference to the drawings and specific examples. The described embodiments should not be construed as limitations on the present application, and all other embodiments, which may be made by those of ordinary skill in the art without the exercise of inventive faculty, are intended to be within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
Artificial intelligence (Artificial Intelligence, AI), is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine Learning (ML), which is a multi-domain interdisciplinary, involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc., and is a special study of how a computer simulates or implements Learning behavior of a human being to acquire new knowledge or skills, and reorganizes the existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence and is the fundamental approach to make computers have intelligence, which is applied throughout various fields of artificial intelligence, and machine learning (deep learning) generally includes technologies such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Single-cell sequencing (single-cell sequencing) refers to an emerging technology for high throughput sequencing analysis of genomes and transcriptomes in the dimension of single cells, which is generally used to reveal differences in heterogeneity between cells, thereby helping to understand in depth the cell functions and cross-talk between cells, increasing understanding of complex ecosystems and diseases in multicellular organisms, and locating precisely pathogenic cells of complex diseases. With the generation of massive single-cell sequencing data, the huge data are comprehensively integrated and analyzed, potential information is mined, and the method becomes common knowledge in the bioinformation industry.
However, in the process of performing single-cell sequencing tasks, how to efficiently and accurately label a large number of single cells with cell types is not an ideal method at present. In the related art, most single cell type labeling is performed based on a cell marker database, namely, the detection of cell types is realized by using the high expression condition of a marker gene. However, since some cells may have a gene loss or be in a division growth stage, the expression amount of the marker gene thereof is significantly smaller than that in the normal case, so that the cell type result determined in the above manner tends to have a problem of lower accuracy.
In view of this, in the embodiment of the present application, a method for detecting a cell type is provided, where when detecting a cell type, the method determines a first expression level of a marker gene of each cell to be detected by using genome-wide expression level data and marker gene information of the cell to be detected, screens a first cell in which the marker gene is significantly expressed, and determines a cell type of the first cell according to an expression condition of the marker gene. For the second cells with low marker gene expression significance, the K-Nearest Neighbors (K neighbor, KNN) algorithm in the machine learning in the artificial intelligence technology is adopted, and the second cells are clustered based on the image data of the cells, so that the cell types of the second cells are determined. Therefore, the type information of a large number of cells can be rapidly and accurately determined, the method can be applied to single cell type labeling, and the follow-up execution of single cell sequencing and other tasks is facilitated.
Referring to fig. 1, fig. 1 is a flow chart of a method for detecting a cell type according to an embodiment of the present application, and referring to fig. 1, the method for detecting a cell type includes, but is not limited to:
step 110, acquiring whole genome expression quantity data, marker gene information and images comprising the batch of cells to be detected;
in this step, the cells for which the cell type is to be detected can be obtained from the whole genome expression amount data, the marker gene information, and the image data. Here, the commonly used gene expression level data indexes include FPKM, TPM, and the like, and the index type is not limited in this application. The process of calculating the gene expression level data may be implemented with reference to the prior art, and will not be described herein. In the embodiment of the application, the marker gene information is used for recording the marker gene of the cell, and the marker gene is a marker gene, namely a gene with known functions or known sequences, and can play a role of specific markers. For cell type detection, since there is often a large difference in the expression of the marker genes in different cell types, they can be distinguished based on the expression amount data of the marker genes. It will be appreciated that the marker gene information may be obtained by collection through existing information channels, which is not limiting in this application. In some embodiments, the species information to which the cell to be detected belongs may be first obtained, and then marker gene information known to the organism of the species may be obtained based on the species information query. The images of the cells to be detected in batches can be acquired by related microscopic observation equipment, and the images can be subjected to pretreatment such as filtering denoising in advance so as to improve the data precision of the images.
Step 120, determining a first expression level of a marker gene corresponding to each cell to be detected according to the whole genome expression level data and the marker gene information;
in this step, after obtaining the whole genome expression amount data and the marker gene information of each cell to be detected, the marker gene can be selected from the whole genome, and then the expression amount data corresponding to the marker gene can be obtained by inquiry. Here, the number of the marker genes may be plural for each cell to be detected, so that the expression amount data corresponding to each marker gene may be accumulated to obtain the expression amount corresponding to the total marker gene, which is referred to as the first expression amount in this application. It will be appreciated that for a cell to be tested, if its first expression level is overall high, a high probability indicates that the cell is in a state in which gene expression is normal, e.g., not in the dividing growth phase. On the contrary, if the first expression level is low as a whole, it is indicated that the cell may have a gene loss or be in the division growth stage, and in this case, if it is classified based on the expression level data of its marker gene, there is a high possibility that the classification may be inaccurate.
Step 130, screening out a significant expression cell set from the cells to be detected according to the first expression quantity; the set of significantly expressed cells includes a number of first cells;
in this embodiment, after determining the first expression level data corresponding to each cell to be detected, a set of significant expression cells may be selected from the cells to be detected according to the first expression level data, where the cells in the set may be denoted as first cells, and other cells to be detected may be denoted as second cells. It can be understood that in the embodiment of the present application, the first expression amount data of each cell to be detected is differentiated according to the first expression amount data, so that a cell set (first cell) in which the marker gene is significantly expressed and a cell set (second cell) in which the marker gene is not significantly expressed can be distinguished. Referring to fig. 2, for example, in some embodiments, an expression level threshold may be set as a dividing line, and then whether the first expression level of the cell to be detected exceeds the expression level threshold is compared, and if the first expression level of the cell to be detected exceeds the expression level threshold, it is determined as the first cell belonging to the significant expression case; if the first expression level of the cell to be detected does not exceed the expression level threshold, it is determined as a second cell belonging to the condition of insignificant expression. Of course, the expression level threshold value herein may be set according to the specific case, and the present application is not limited in size.
Step 140, determining the cell type of the first cell according to the whole genome expression amount data and the marker gene information corresponding to the first cell;
in this step, based on the foregoing description, it can be understood that, for the first cell, since it can be determined that the marker gene therein is significantly expressed, the cell type to which it belongs can be determined based on known statistical information according to the specific expression of the marker gene. For example, if a cell of a certain cell type may have one or more marker genes expressed in a significantly higher amount than other cell types, it can be determined whether each first cell belongs to that cell type by comparing the expression levels of these marker genes. Of course, it may be specifically determined whether the expression level of each marker gene of the first cell is within the expression level range of the marker gene corresponding to the known cell type, and the cell type to which the first cell most likely belongs is determined.
It should be emphasized that in the embodiment of the present application, since the first cell is already a cell with a significant expression level of the marker gene after screening, the cell type result determined in step 140 is relatively accurate. In the case of the second cell, it is necessary to determine the cell type by other means since it is impossible to determine whether the cell type is due to the cell type or other abnormality only by determining that the marker gene is expressed in a small amount as a whole.
Step 150, determining the cell type of the second cell by a K nearest neighbor algorithm according to the cell type corresponding to each first cell and the image; the second cell is other cells to be detected than the first cell.
In this step, in the image data, the positional relationship between the respective cells can reflect the correlation between their categories to some extent. Generally, the cells of the same type are located in a relatively close area in the image, so that after determining the cell type corresponding to the first cell, the cell type of the second cell can be classified and marked based on a K-nearest neighbor algorithm. For example, referring to fig. 3, it is assumed that in a region of the image, the cell type of the majority of cells has been determined to be two types, cell type a and cell type B, at which point for a second cell of an unknown cell type it can be determined by the KNN algorithm which group it is closer to, thereby determining whether it belongs to cell type a or cell type B. In this way, the cell type of the second cell in which the marker gene is not significantly expressed can be more accurately determined.
In some embodiments, after the step of obtaining whole genome expression data for a batch of cells to be tested, the method further comprises:
converting the whole genome expression amount data into matrix data; the matrix data comprises n rows and 1 column, wherein the numerical value elements of each row correspond to the expression quantity data of one gene, the number of the row of the numerical value elements are arranged according to the preset sequence of the gene types, and n is a positive integer.
In the embodiment of the application, after obtaining the whole genome expression amount data of the cell to be detected, the whole genome expression amount data can be converted into matrix data for facilitating data processing. Specifically, the matrix data is in the form of n rows and 1 column, wherein n is a positive integer, and the number of the n is the same as that of the whole genome expression amount data, namely the number of genes in the whole genome of the cell to be detected. In the matrix data, each row has a numerical element, and the numerical element corresponds to the expression quantity data of one gene; the gene types corresponding to the numerical elements in each row may be assigned in advance in a predetermined order according to species information. For example, it is assumed that 4 genes exist in the whole genome of a certain species, and these genes differ in expression data in various cells in the species. In this embodiment of the present application, for the species, an arrangement sequence of a gene class may be set in advance, for example, may be ordered according to the initial of the name of the gene, or may be ordered according to the function, which is not limited in this application. After the sequencing is completed, the whole genome expression amount data of a certain cell to be detected in the species is acquired, and then the data can be processed into matrix data according to the sequence. For example, assume that four genes A, B, C, D are included in the whole genome; the obtained whole genome expression amount data are as follows: when the expression level datase:Sub>A of the ase:Sub>A gene was 0.94, the expression level datase:Sub>A of the B gene was 0, the expression level datase:Sub>A of the C gene was 2.13, the expression level datase:Sub>A of the D gene was 1.36, and the predetermined sequence of the gene types in the whole genome was arranged as C-B-ase:Sub>A-D, the matrix datase:Sub>A included 4 rows and 1 column, the first row datase:Sub>A was 2.13, the second row datase:Sub>A was 0, the third row datase:Sub>A was 0.94, and the second row datase:Sub>A was 1.36, when the whole genome expression level datase:Sub>A was converted into matrix datase:Sub>A. It can be understood that in the embodiment of the present application, the whole genome expression amount data is converted into matrix data according to a preset gene type sequence, so that the data structure can be unified, the whole genome expression amount data converted into matrix data under the same species, and the meaning of the data at the same position is the same, thereby facilitating the subsequent processing.
In some embodiments, the screening the set of significantly expressed cells from the cells to be detected according to the first expression level comprises:
calculating the average value of the expression quantity of the marker genes corresponding to the cells to be detected according to the first expression quantity;
and screening out a significant expression cell set from the cells to be detected according to the first expression quantity and the expression quantity average value.
In the embodiment of the application, when the significant expression cell set is selected from the cells to be detected, the expression quantity average value of the marker genes corresponding to each cell to be detected can be calculated according to the first expression quantity data, and then the first cell is determined according to the expression quantity average value. For example, the first cell and the second cell may be distinguished from each other by using the expression level average as the expression level threshold.
In other embodiments, the screening the significantly expressed cell collection from the cells to be detected according to the first expression level and the expression level average value includes:
calculating the expression reference value of the cell to be detected by the following formula:
Figure BDA0004002923230000081
wherein Pi represents the expression reference value of the ith cell to be detected, ki represents the first expression level of the marker gene of the ith cell to be detected, lambda represents the average value of the expression levels, and e represents the Euler constant;
and screening out a significant expression cell set from the cells to be detected according to the expression reference value.
In this embodiment, the first cell and the second cell may be further divided by the above-described expression level reference value. Specifically, for the expression reference value of a certain cell to be detected, if it is smaller than a certain value (a comparison threshold value given in advance, which can be flexibly set), it may be indicated that the marker gene of the cell to be detected is significantly expressed, that is, the cell to be detected belongs to the first cell. In other words, in the embodiment of the present application, the smaller the value of the expression reference value, the higher the probability that the cell to be detected belongs to the first cell, so the probability that the cell to be detected belongs to the first cell can be determined according to the relationship of negative correlation, and then the cell to be detected with higher probability is divided into the significant expression cell set, and the specific functional relationship between the two is not limited in the present application.
In some embodiments, said determining the cell type of the second cell by K-nearest neighbor algorithm from the respective cell type of the first cell and the image comprises:
training a KNN model according to the cell type corresponding to each first cell and the image, and determining a K value in the KNN model;
determining from the image K first cells closest to the second cells;
determining the cell type with the largest number of the K first cells, and determining the cell type with the largest number of the K first cells as the cell type of the second cells.
In this embodiment of the present application, when using the K nearest neighbor algorithm, a KNN model may be trained based on the positional relationship of the first cell of the determined cell type in the image, and the value range of K may be defined in advance, so as to find the optimal neighbor value, that is, the K value through training. The specific formula is K.epsilon.n.M-n. Where n represents the number of possible cell types and M represents the number of cells.
Then, in the case of performing type detection on the second cell, K first cells closest to the second cell can be found in the image. For example, the euclidean distance between the second cell and each first cell may be calculated, and then the first cells are sorted according to the euclidean distance, and K first cells with the minimum euclidean distance are selected from the sorted first cells. The cell type of the second cell may then be determined based on the cell type having the largest number of the K first cells. For example, for a certain second cell, the corresponding 10 cells are 7 cells belonging to cell type a,2 cells belonging to cell type B, and 1 cell type C from the nearest first cell, so that the number of cell types a is the largest, and therefore, the second cell is located in a position which is in agreement with the region of cell type a, so that the cell type of the second cell can be determined as cell type a.
It can be understood that the method for detecting a cell type provided in the embodiments of the present application determines the first expression level of the marker gene of each cell to be detected by using the whole genome expression level data and the marker gene information of the cell to be detected, screens the first cell with the significantly expressed marker gene from the first expression level data and the marker gene information, and determines the cell type of the first cell according to the expression condition of the marker gene. For the second cells with low marker gene expression significance, the K-Nearest Neighbors (K neighbor, KNN) algorithm in the machine learning in the artificial intelligence technology is adopted, and the second cells are clustered based on the image data of the cells, so that the cell types of the second cells are determined. The method can improve the accuracy of cell type detection labeling, and is beneficial to improving the efficiency and accuracy of subsequent cell data analysis and processing.
The embodiment of the application also provides a cell type detection device, which comprises:
the acquisition module is used for acquiring whole genome expression quantity data, marker gene information and images comprising the batch of cells to be detected;
the processing module is used for determining the first expression quantity of the marker gene corresponding to each cell to be detected according to the whole genome expression quantity data and the marker gene information;
the screening module is used for screening out a significant expression cell set from the cells to be detected according to the first expression quantity; the set of significantly expressed cells includes a number of first cells;
the first labeling module is used for determining the cell type of the first cell according to the whole genome expression quantity data and the marker gene information corresponding to the first cell;
the second labeling module is used for determining the cell type of the second cell through a K nearest neighbor algorithm according to the cell type corresponding to each first cell and the image; the second cell is other cells to be detected than the first cell.
It will be appreciated that the embodiments of the method for detecting a cell type shown in fig. 1 are applicable to the embodiments of the apparatus for detecting a cell type, and the functions of the embodiments of the apparatus for detecting a cell type are the same as those of the embodiments of the method for detecting a cell type shown in fig. 1, and the advantages achieved by the embodiments of the method for detecting a cell type shown in fig. 1 are the same as those achieved by the embodiments of the method for detecting a cell type shown in fig. 1.
Referring to fig. 4, the embodiment of the application further discloses a computer device, including:
at least one processor 201;
at least one memory 202 for storing at least one program;
the at least one program, when executed by the at least one processor 201, causes the at least one processor 201 to implement an embodiment of a method for detecting a cell type as shown in fig. 1.
It will be appreciated that the embodiments of the method for detecting a cell type shown in fig. 1 are applicable to the embodiments of the computer device, and the functions of the embodiments of the computer device are the same as those of the embodiments of the method for detecting a cell type shown in fig. 1, and the advantages achieved are the same as those achieved by the embodiments of the method for detecting a cell type shown in fig. 1.
The present application also discloses a computer readable storage medium in which a processor executable program is stored, which when executed by a processor is adapted to carry out an embodiment of a method for detecting a cell type as shown in fig. 1.
It will be appreciated that the embodiments of the method for detecting a cell type shown in fig. 1 are applicable to the embodiments of the computer-readable storage medium, and the functions of the embodiments of the computer-readable storage medium are the same as those of the embodiments of the method for detecting a cell type shown in fig. 1, and the advantages achieved are the same as those achieved by the embodiments of the method for detecting a cell type shown in fig. 1.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of this application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the present application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical system and/or software module or may be implemented in separate physical systems or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the systems disclosed herein will be apparent to engineers in ordinary skill in view of their attributes, functions, and internal relationships. Thus, those of ordinary skill in the art will be able to implement the present application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined by the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any system that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, or apparatus.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic system) with one or more wires, a portable computer diskette (magnetic system), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber system, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, descriptions of the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments, and one skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are intended to be included in the scope of the present invention as defined by the appended claims
In the description of the present specification, reference to the terms "one embodiment," "another embodiment," or "certain embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. A method for detecting a cell type, the method comprising:
acquiring whole genome expression quantity data and marker gene information of a batch of cells to be detected and an image comprising the batch of cells to be detected;
determining a first expression level of a marker gene corresponding to each cell to be detected according to the whole genome expression level data and the marker gene information;
screening out a significant expression cell set from the cells to be detected according to the first expression quantity; the set of significantly expressed cells includes a number of first cells;
determining the cell type of the first cell according to the whole genome expression amount data and the marker gene information corresponding to the first cell;
determining the cell type of a second cell by a K nearest neighbor algorithm according to the cell type corresponding to each first cell and the image; the second cell is other cells to be detected than the first cell.
2. The method according to claim 1, wherein after the step of obtaining whole genome expression data of a plurality of cells to be detected, the method further comprises:
converting the whole genome expression amount data into matrix data; the matrix data comprises n rows and 1 column, wherein the numerical value elements of each row correspond to the expression quantity data of one gene, the number of the row of the numerical value elements are arranged according to the preset sequence of the gene types, and n is a positive integer.
3. The method of claim 1, wherein obtaining marker gene information comprises:
obtaining species information of the cell to be detected;
and acquiring the marker gene information according to the species information.
4. The method according to claim 1, wherein the step of selecting a significant expression cell population from the cells to be detected based on the first expression level comprises:
calculating the average value of the expression quantity of the marker genes corresponding to the cells to be detected according to the first expression quantity;
and screening out a significant expression cell set from the cells to be detected according to the first expression quantity and the expression quantity average value.
5. The method according to claim 4, wherein the step of selecting a significant expression cell set from the cells to be detected based on the first expression level and the expression level average value comprises:
calculating the expression reference value of the cell to be detected by the following formula:
Figure FDA0004002923220000011
wherein Pi represents the expression reference value of the ith cell to be detected, ki represents the first expression level of the marker gene of the ith cell to be detected, lambda represents the average value of the expression levels, and e represents the Euler constant;
and screening out a significant expression cell set from the cells to be detected according to the expression reference value.
6. The method according to claim 1, wherein determining the cell type of the second cell by K-nearest neighbor algorithm based on the respective cell type of the first cell and the image comprises:
training a KNN model according to the cell type corresponding to each first cell and the image, and determining a K value in the KNN model;
determining from the image K first cells closest to the second cells;
determining the cell type with the largest number of the K first cells, and determining the cell type with the largest number of the K first cells as the cell type of the second cells.
7. The method of claim 6, wherein determining K first cells closest to the second cell from the image comprises:
calculating euclidean distances between the second cells and the respective first cells;
and sequencing the first cells according to the Euclidean distance, and selecting K first cells with the minimum Euclidean distance from the first cells.
8. A device for detecting a cell type, the device comprising:
the acquisition module is used for acquiring whole genome expression quantity data, marker gene information and images comprising the batch of cells to be detected;
the processing module is used for determining the first expression quantity of the marker gene corresponding to each cell to be detected according to the whole genome expression quantity data and the marker gene information;
the screening module is used for screening out a significant expression cell set from the cells to be detected according to the first expression quantity; the set of significantly expressed cells includes a number of first cells;
the first labeling module is used for determining the cell type of the first cell according to the whole genome expression quantity data and the marker gene information corresponding to the first cell;
the second labeling module is used for determining the cell type of the second cell through a K nearest neighbor algorithm according to the cell type corresponding to each first cell and the image; the second cell is other cells to be detected than the first cell.
9. A computer device, comprising:
at least one processor;
at least one memory for storing at least one program;
when said at least one program is executed by said at least one processor, said at least one processor is caused to carry out a method of detecting a cell type according to any one of claims 1-7.
10. A computer-readable storage medium having stored therein a program executable by a processor, characterized in that: the processor executable program when executed by a processor is for implementing a method of detecting a cell type according to any one of claims 1-7.
CN202211622308.0A 2022-12-16 2022-12-16 Cell type detection method, system, device and medium Pending CN116189771A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211622308.0A CN116189771A (en) 2022-12-16 2022-12-16 Cell type detection method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211622308.0A CN116189771A (en) 2022-12-16 2022-12-16 Cell type detection method, system, device and medium

Publications (1)

Publication Number Publication Date
CN116189771A true CN116189771A (en) 2023-05-30

Family

ID=86439299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211622308.0A Pending CN116189771A (en) 2022-12-16 2022-12-16 Cell type detection method, system, device and medium

Country Status (1)

Country Link
CN (1) CN116189771A (en)

Similar Documents

Publication Publication Date Title
Jiang et al. Asymmetric deep supervised hashing
Talukder et al. Interpretation of deep learning in genomics and epigenomics
Jabeen et al. Machine learning-based state-of-the-art methods for the classification of rna-seq data
CN102622535A (en) Processing method and processing device based on multiple sequence alignment genetic algorithm
CN111325264A (en) Multi-label data classification method based on entropy
CN111564179A (en) Species biology classification method and system based on triple neural network
CN112735514A (en) Training and visualization method and system for neural network extraction regulation and control DNA combination mode
Ji et al. scAnnotate: an automated cell-type annotation tool for single-cell RNA-sequencing data
Dhyaram et al. RANDOM SUBSET FEATURE SELECTION FOR CLASSIFICATION.
CN113160886A (en) Cell type prediction system based on single cell Hi-C data
Zhen et al. A review and performance evaluation of clustering frameworks for single-cell Hi-C data
CN116189771A (en) Cell type detection method, system, device and medium
Sun et al. Designing patterns for profile HMM search
Pereira et al. Assessing active learning strategies to improve the quality control of the soybean seed vigor
CN113361752B (en) Protein solvent accessibility prediction method based on multi-view learning
CN114927163A (en) Method for predicting genetic model based on single cell map and storage medium
KR20230171930A (en) Deep convolutional neural networks to predict variant pathogenicity using three-dimensional (3D) protein structures
Nguyen et al. Efficient agglomerative hierarchical clustering for biological sequence analysis
CN116304790A (en) Single cell classification method, device, equipment and storage medium
Rexie et al. K-mer based prediction of gene family by applying multinomial naïve bayes algorithm in DNA sequence
CN113673574B (en) Soft measurement method, device and medium for water outlet variable prediction
US11538555B1 (en) Protein structure-based protein language models
Yang Clustering Models with Applications in Gene Expression Profiles
Wang et al. Semisupervised Bacterial Heuristic Feature Selection Algorithm for High-Dimensional Classification with Missing Labels
Leifeld et al. Curve form based quantization of short time series data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination