CN116978452A

CN116978452A - Cell type annotation method and device, electronic equipment and storage medium

Info

Publication number: CN116978452A
Application number: CN202310122192.2A
Authority: CN
Inventors: 王亮; 姚建华
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-10-31

Abstract

The application provides a cell type annotation method, which comprises the following steps: the embodiment of the application obtains the cell data to be marked and the marker gene for marking the cell type; determining an automatic annotation model matched with the cell data to be marked; clustering the cell data to be marked through an automatic annotation model to obtain a clustering result; annotating the clustering result based on the marker gene to obtain a first annotation result, so that quick annotation of the cell data to be marked can be realized, the speed of cell type annotation is improved, and the un-annotated result is adjusted to obtain a supplementary annotation result; and combining the correct annotation result, the corrected annotation result and the supplementary annotation result to obtain the type annotation result of the cell to be marked. Based on the machine annotation, the application further adjusts the results of the machine learning mode failing to annotate and misannotate in time.

Description

Cell type annotation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of biological information analysis technology, and in particular, to a method and apparatus for annotating cell types, an electronic device, and a storage medium.

Background

With the continuous development of bioinformatics, spatial transcriptome technology has been widely applied to various medical studies, such as tumor research, neuroscience, developmental biology, molecular pathology, and other different fields. In the course of medical research, in order to conduct studies such as Gene differential expression analysis, cell development trajectory analysis, gene Ontology (GO Gene Ontology) enrichment analysis, etc., based on spatial transcriptome cell data, it is necessary to make cell annotation on the spatial transcriptome cell data. In the related art, when cell type annotation is performed, the accuracy of the neural network subjected to machine learning is low, and the efficiency of manually performing cell type annotation is low, so that the requirements of the space transcriptome technology on the accuracy and the annotation efficiency of the cell type annotation cannot be met.

Disclosure of Invention

In view of this, the embodiment of the invention provides a cell type annotation method, a device, an electronic device and a storage medium, and the technical scheme of the embodiment of the invention is realized as follows:

The embodiment of the invention provides a cell type annotation method, which comprises the following steps:

acquiring cell data to be marked and a marker gene for marking cell types;

determining an automatic annotation model matched with the cell data to be marked;

clustering the cell data to be marked through the automatic annotation model to obtain a clustering result;

annotating the clustering result based on the marker gene to obtain a first annotation result, wherein the first annotation result comprises: correct annotation result, incorrect annotation result and un-annotated result;

adjusting the error annotation result to obtain a corrected annotation result;

adjusting the unexplored result according to the correct annotation result associated with the unexplored result to obtain a supplementary annotation result;

and combining the correct annotation result, the corrected annotation result and the supplementary annotation result to obtain the type annotation result of the cell to be marked.

The embodiment of the invention also provides a cell type annotation device, which comprises:

the information transmission module is used for acquiring cell data to be marked and marker genes for marking cell types to acquire target texts to be processed;

The information processing module is used for determining an automatic annotation model matched with the cell data to be marked;

the information processing module is used for clustering the cell data to be marked through the automatic annotation model to obtain a clustering result;

the information processing module is configured to annotate the clustering result based on the marker gene to obtain a first annotation result, where the first annotation result includes: correct annotation result, incorrect annotation result and un-annotated result;

the information processing module is used for adjusting the error annotation result to obtain a corrected annotation result;

the information processing module is used for adjusting the un-annotated result according to the correct annotated result related to the un-annotated result to obtain a supplementary annotated result;

and the information processing module is used for combining the correct annotation result, the corrected annotation result and the supplementary annotation result to obtain the type annotation result of the cell to be marked.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for acquiring at least one single cell expression spectrum matched with the identification information according to the identification information of the cells to be marked;

The information processing module is used for combining the at least one single cell expression profile to obtain the cell data to be marked;

the information processing module is used for acquiring a marker gene matched with the identification information according to the identification information of the cells to be marked.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for determining an automatic annotation model matched with the cell data to be marked as a unified manifold approximation and projection model when the type annotation result of the cell to be marked needs to characterize the continuity and organization of differentiation among all cell populations;

and the information processing module is used for determining an automatic annotation model matched with the cell data to be marked as a T distribution and random neighbor embedding model when the type annotation result of the cell to be marked does not need to characterize the continuity and organization of differentiation among all cell populations.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for acquiring each clustering group in the clustering result;

the information processing module is used for searching gene fragments matched with the marker genes in each cluster group based on the marker genes;

The information processing module is used for annotating each cluster group when the gene fragment is acquired, so as to obtain a cluster group annotation result;

and the information processing module is used for combining the cluster group annotation results to obtain the first annotation result.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for responding to the annotation result adjustment instruction and adjusting the error annotation result;

the information processing module is used for updating the cell data to be marked according to the correction annotation result to obtain updated cell data;

the information processing module is used for storing the updated cell data as metadata.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for searching a first high-expression gene in a correct annotation result associated with the un-annotated result;

the information processing module is used for determining a second high-expression gene in the un-annotated result;

the information processing module is used for determining that the correct annotation result is the same as the supplementary annotation result when the first high-expression gene and the second high-expression gene are the same;

the information processing module is used for triggering an artificial annotation process when the first high-expression gene and the second high-expression gene are different, wherein the artificial annotation process is used for manually adjusting the un-annotated result to obtain the supplementary annotation result.

In the above-described arrangement, the first and second embodiments,

the information processing module is used for updating the single-cell expression profile of the cells to be marked according to the type annotation result of the cells to be marked to obtain an updated single-cell expression profile;

the information processing module is used for carrying out format conversion on the updated single cell expression spectrum and storing the updated single cell expression spectrum into a target database.

The embodiment of the application also provides electronic equipment, which comprises:

a memory for storing executable instructions;

and the processor is used for realizing the preface cell type annotation method when executing the executable instructions stored in the memory.

The embodiment of the application also provides a computer readable storage medium, which stores executable instructions, wherein the executable instructions realize the preface cell type annotation method when being executed by a processor.

The embodiment of the application has the following beneficial effects:

1) The embodiment of the application obtains the cell data to be marked and the marker gene for marking the cell type; determining an automatic annotation model matched with the cell data to be marked; clustering the cell data to be marked through the automatic annotation model to obtain a clustering result; annotating the clustering result based on the marker gene to obtain a first annotation result, wherein the first annotation result comprises: correct annotation result, incorrect annotation result and un-annotated result; therefore, quick annotation of the cell data to be marked can be realized, and the speed of cell type annotation is improved.

2) Adjusting the error annotation result to obtain a corrected annotation result; adjusting the unexplored result according to the correct annotation result associated with the unexplored result to obtain a supplementary annotation result; and combining the correct annotation result, the corrected annotation result and the supplementary annotation result to obtain the type annotation result of the cell to be marked. Therefore, on the basis of machine annotation, the results of machine learning mode failing to annotate and misannotate are further timely adjusted, and the accuracy of cell type annotation can be improved.

Drawings

FIG. 1 is a schematic diagram of the usage scenario of a cell type annotation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the structure of a cell type annotation device according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an alternative method for annotating cell types according to an embodiment of the present invention;

FIG. 4 is a schematic view of clustering effects of different automatic annotation models according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a process for annotating clustering results based on marker genes in an embodiment of the present invention;

FIG. 6 is a schematic diagram of the effect of annotating clustering results based on marker genes in an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating the adjustment of the error annotation result according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating the adjustment of the error annotation result according to an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating the effect of adjusting the error annotation result according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a process for obtaining a supplementary note result in an embodiment of the present invention;

FIG. 11 is a schematic diagram of a process for obtaining a supplementary note result in an embodiment of the present invention;

FIG. 12 is a schematic diagram of a process for obtaining a supplementary note result in an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

1) In response to a condition or state that is used to represent the condition or state upon which the performed operation depends, the performed operation or operations may be in real-time or with a set delay when the condition or state upon which it depends is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

2) Neural Networks (NN): an artificial neural network (Artificial Neural Network, ANN), abbreviated as neural network or neural-like network, is a mathematical or computational model that mimics the structure and function of biological neural networks (the central nervous system of animals, particularly the brain) for estimating or approximating functions in the field of machine learning and cognitive sciences.

3) K-means: and (3) an unsupervised clustering method is adopted, and the cell clusters are continuously and iteratively gathered into a specified number of clusters by calculating the similarity between the cell clusters.

4) Spatial transcriptome (Spatial Transcriptomics, ST): the collection of all transcripts in a cell may be under physiological conditions, for example, messenger RNA, ribosomal RNA, transfer RNA and other non-coding RNA. Alternatively, a spatial transcriptome may be considered to be a collection of all mRNAs.

5) Cell annotation results: the method refers to a processing result obtained after analysis and cell annotation of the space transcriptome data, and is used for identifying the identity information of the cell data in the space transcriptome, so that the cell type information, the cell attribute characteristics and the like of the space transcriptome data can be rapidly obtained.

Before introducing the cell type annotation method provided by the application, the cell type annotation method in the related technology is described first, in the related technology, when the cell type annotation is performed, the accuracy of the neural network subjected to machine learning is lower, and when the cell type annotation is performed manually, the efficiency is low, so that the requirements of the space transcriptome technology on the cell type annotation accuracy and the annotation efficiency cannot be met. In the related art, it is possible that the same cell type should be used for several different clusters. In this case, the related art algorithm cannot automatically integrate the several clusters belonging to the same cell type. Manual work is required to modify and edit such information, increasing the time for cell annotation. Meanwhile, with the cell type annotation based on the deep learning method, there may sometimes be a situation that the probability that some cells belong to a certain cell exceeds 0.50, and the probability of the membership is low, which means that the cell may need to manually analyze the ratio of the internal genes, and further judgment also increases the time of cell annotation.

In order to overcome the above-mentioned drawbacks, referring to fig. 1, fig. 1 is a schematic view of a usage scenario of a cell type annotation method according to an embodiment of the present invention, where a terminal (including a terminal 10-1 and a terminal 10-2) is provided with corresponding clients capable of performing different functions, where the clients browse by acquiring different single cell expression spectrums from corresponding servers 200 through a network 300 for the terminal (including the terminal 10-1 and the terminal 10-2), the terminal is connected to the servers 200 through the network 300, the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented by using a wireless link, where the types of single cell expression spectrums acquired from corresponding servers 200 through the network 300 by the terminal (including the terminal 10-1 and the terminal 10-2) may be the same or different, for example: terminals (including the terminal 10-1 and the terminal 10-2) can acquire pathological images or pathological videos matched with a target object from the corresponding server 200 through the network 300, and can acquire a single-cell expression profile matched with the current target from the corresponding server 200 through the network 300 for browsing. The batch annotation can be performed by acquiring a set of a plurality of single-cell expression profiles from the microscope 400, and the server 200 can store single-cell expression profiles corresponding to different target objects, or store auxiliary analysis information matched with the single-cell expression profiles of the target objects.

The automatic annotation model in the artificial intelligence field deployed by the server can collect images of a sample to be observed by using a camera on a traditional optical microscope and analyze the real-time images in combination with a machine learning algorithm. Artificial intelligence (AI Artificial Intelligence) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results.

In particular, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that reacts in a manner similar to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It should be noted that, looking under the microscope system (medical device in contact with single cell section of target object), PBMC samples containing T cells, B cells, monocytes, granulocytes; the tumor sample may include tumor cells, epithelial cells, endothelial cells, fibroblasts, T cells, B cells, macrophages, dendritic cells, mast cells, and tissue cell types corresponding to different tumor types, such as liver cells in liver cancer, astrocytes, club cells in lung cancer, secretory cells, goblet cells, etc., which are not particularly limited in the present application.

The server 200 transmits the pathology information of the same target object to the terminal (terminal 10-1 and/or terminal 10-2) through the network 300, and the user of the pathology information terminal (terminal 10-1 and/or terminal 10-2) analyzes the pathology information of the target object, and thus. As an example, server 200 is deployed in the field of sequencing analysis, and the process of annotating cellular data of a space transcriptome may be performed either at terminal 10-1 or at server 200. For example, by receiving cell data input by a user through the terminal 10-1, cell annotation can be performed locally at the terminal 10-1, and a cell annotation result is obtained; the cell data may also be sent to the server 200, so that the server 200 receives the cell data, performs cell annotation according to the cell data, obtains a cell annotation result, and then sends the cell annotation result to the server 200 to implement cell annotation of the cell data of the transcriptome to be predicted.

The following describes the structure of the cell type annotation device according to the embodiments of the present invention in detail, and the cell type annotation device may be implemented in various forms, such as a dedicated terminal with a cell type annotation function, or a server provided with a cell type annotation function, for example, the server 200 in fig. 1. Fig. 2 is a schematic diagram of the composition structure of a cell type annotation device according to an embodiment of the present invention, and it will be understood that fig. 2 only shows an exemplary structure of the cell type annotation device, but not all the structure, and some or all of the structures shown in fig. 2 may be implemented as required.

The cell type annotation device provided by the embodiment of the invention comprises: at least one processor 201, a memory 202, a user interface 203, and at least one network interface 204. The various components in the cell type annotation device 20 are coupled together by a bus system 205. It is understood that the bus system 205 is used to enable connected communications between these components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

It will be appreciated that the memory 202 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operation on the terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application may comprise various applications.

In some embodiments, the cell type annotation device provided in the embodiments of the present invention may be implemented by combining software and hardware, and by way of example, the cell type annotation device provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to perform the cell type annotation method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASICs, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable Logic Device), field programmable gate arrays (FPGAs, field-Programmable Gate Array), or other electronic components.

As an example of implementation of the cell type annotation device provided by the embodiment of the present invention by combining software and hardware, the cell type annotation device provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, the software modules may be located in a storage medium, the storage medium is located in the memory 202, the processor 201 reads executable instructions included in the software modules in the memory 202, and the cell type annotation method provided by the embodiment of the present invention is completed by combining necessary hardware (including, for example, the processor 201 and other components connected to the bus 205).

By way of example, the processor 201 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

As an example of a hardware implementation of the cell type annotation device provided by the embodiment of the present invention, the device provided by the embodiment of the present invention may be implemented directly by the processor 201 in the form of a hardware decoding processor, for example, by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field-Programmable Gate Array) or other electronic components to implement the cell type annotation method provided by the embodiment of the present invention.

The memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the cell type annotation device 20. Examples of such data include: any executable instructions, such as executable instructions, for operation on the cell type annotation device 20, a program implementing the slave cell type annotation method of embodiments of the invention may be included in the executable instructions.

In other embodiments, the cell type annotation device provided in the embodiments of the present invention may be implemented in software, and fig. 2 shows the cell type annotation device stored in the memory 202, which may be software in the form of a program, a plug-in, etc., and includes a series of modules, and as an example of the program stored in the memory 202, may include the cell type annotation device, where the following software modules are included in the cell type annotation device: information transmission module 2081, information processing module 2082. When the software modules in the cell type annotation device are read into the RAM by the processor 201 and executed, the cell type annotation method provided by the embodiment of the present invention will be implemented, and the functions of the respective software modules in the cell type annotation device will be described, wherein,

the information transmission module 2081 acquires the cell data to be marked and the marker gene for cell type labeling to acquire the target text to be processed.

An information processing module 2082 for determining an automatically annotated model that matches the cell data to be labeled.

The information processing module 2082 is configured to cluster the cell data to be marked through the automatic annotation model, so as to obtain a clustering result.

The information processing module 2082 is configured to annotate the clustering result based on the marker gene to obtain a first annotation result, where the first annotation result includes: correct annotated results, incorrect annotated results, and un-annotated results.

The information processing module 2082 is configured to adjust the error annotation result to obtain a corrected annotation result.

The information processing module 2082 is configured to adjust the un-annotated result according to the correct annotated result associated with the un-annotated result, so as to obtain a supplementary annotated result.

The information processing module 2082 is configured to combine the correct annotation result, the corrected annotation result, and the supplementary annotation result to obtain the type annotation result of the cell to be marked.

In the above-described arrangement, the first and second embodiments,

the information processing module 2082 is configured to obtain, according to the identification information of the cells to be marked, at least one single cell expression profile that is matched with the identification information;

the information processing module 2082 is configured to combine the at least one single cell expression profile to obtain the cell data to be marked;

the information processing module 2082 is configured to obtain, according to the identification information of the cell to be marked, a marker gene that matches the identification information.

In the above-described arrangement, the first and second embodiments,

the information processing module 2082 is configured to determine that an automatic annotation model matched with the cell data to be marked is a unified manifold approximation and projection model when the type annotation result of the cell to be marked needs to characterize continuity and organization of differentiation among all cell populations;

the information processing module 2082 is configured to determine that the automatic annotation model matched with the cell data to be marked is a T-distribution and random neighbor embedding model when the type annotation result of the cell to be marked does not need to characterize continuity and organization of differentiation among all cell populations.

In the above-described arrangement, the first and second embodiments,

the information processing module 2082 is configured to obtain each cluster group in the cluster result;

the information processing module 2082 is configured to search, based on the marker genes, for gene segments that match the marker genes in each cluster group;

the information processing module 2082 is configured to annotate each cluster group when the gene segment is acquired, so as to obtain a cluster group annotation result;

the information processing module 2082 is configured to combine the cluster group annotation results to obtain the first annotation result.

In the above-described arrangement, the first and second embodiments,

the information processing module 2082 is configured to adjust the erroneous annotation result in response to the annotation result adjustment instruction;

the information processing module 2082 is configured to update the cell data to be marked according to the corrected annotation result, so as to obtain updated cell data;

the information processing module 2082 is configured to store the updated cell data as metadata.

In the above-described arrangement, the first and second embodiments,

the information processing module 2082 is configured to find a first high-expression gene in a correct annotation result associated with the un-annotated result;

the information processing module 2082 is configured to determine a second highly expressed gene in the un-annotated result;

the information processing module 2082 is configured to trigger an artificial annotation process when the first high-expression gene and the second high-expression gene are different, where the artificial annotation process is configured to manually adjust the un-annotated result to obtain the supplemented annotation result.

In the above-described arrangement, the first and second embodiments,

the information processing module 2082 is configured to update a single-cell expression profile of the cell to be marked according to the type annotation result of the cell to be marked, so as to obtain an updated single-cell expression profile;

the information processing module 2082 is configured to perform format conversion on the updated single cell expression profile, and store the updated single cell expression profile in a target database.

According to the electronic device shown in fig. 2, in one aspect of the application, the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the cell type annotation methods provided by the application.

In order to overcome the above-mentioned drawbacks, the cell type annotation method according to the embodiment of the present application is described with reference to the cell type annotation device shown in fig. 2, and referring to fig. 3, fig. 3 is a schematic flow chart showing an alternative method for annotating a cell type according to the embodiment of the present application, it will be understood that the steps shown in fig. 3 may be performed by various electronic devices running the cell type annotation device, for example, a cell analysis instrument, a server or a server cluster with an image classification processing function, to implement annotation of a cell type in data of a cell stream to be marked. The following is a description of the steps shown in fig. 3.

Step 301: the cell type annotation device acquires the cell data to be marked and the marker genes for cell type annotation.

In some embodiments of the present application, the cell type annotation method provided by the present application can annotate the cell data to be marked, which includes only one single cell expression profile, or annotate the set of cell data to be marked, which includes a plurality of single cell expression profiles, in batch, and the user can flexibly adjust the number of single cell expression profiles according to different usage requirements, and when obtaining the marked cell data and the marker gene for marking the cell type, at least one single cell expression profile matched with the marker information is obtained according to the marker information of the cell to be marked; combining at least one single cell expression profile to obtain cell data to be marked; and finally, according to the identification information of the cells to be marked, obtaining the marker genes matched with the identification information.

In some embodiments of the invention, when cell transcription is performed using the annotation result, the cell data to be labeled may be information obtained by genetic sequencing of the sample (tissue or site) to be predicted. For example, a sequencing spot may be selected on a section of a sample to be predicted and gene sequenced to obtain cellular data of the transcriptome to be predicted. The tissue may be brain tissue, heart tissue, lung tissue, or the like. The sequencing spots may be some cells in the section used to obtain transcript information for the transcriptome to be predicted.

In some embodiments of the present invention, when the annotation result is used for cell transcription, the cell data to be labeled uses the cell data of the spatial transcriptome, and the method for obtaining the cell data of the spatial transcriptome may include: 1) Transcriptome data in a slice after a certain tissue is obtained by combining a microscopic imaging technology and a sequencing technology, and then a 10x genomics Visium platform is called to analyze the data and obtain the data visually; 2) Transcriptome sequencing-based and laser microdissection (Laser capture microdissection, LCM) -based transcriptome analysis methods, which may include, for example, new spatial transcriptome methods (transcriptome in vivo analysis, TIVA), analysis methods for lock probe in situ sequencing (in situsequencing, ISS), fluorescent in situ sequencing (FISSEQ fluorescence in situ sequencing).

In some embodiments of the invention, when the annotation result is used for cellular transcription, the cell type annotation result corresponding to the transcriptome to be predicted may be a cellular object that is homologous to the transcriptome to be predicted and the cell annotation result is known. For example, a cell type annotated object may be a single cell or a group of cells comprised by a gene sequencing object corresponding to a transcriptome to be predicted. Wherein, single cells refer to the cell data formed by sequencing each sample in units of cells. Cell group refers to cell data, such as bulk data, formed by sequencing a plurality of cells per sample, and may also be sub-cell level cell data per sample.

In some embodiments of the invention, when the annotation result is used for cellular transcription, the gene sequencing object corresponding to the transcriptome to be predicted may be a gene sequencing object homologous to the transcriptome to be predicted, e.g., a gene sequencing object taken from the same tissue or site as the transcriptome to be predicted. Illustratively, the transcriptome to be predicted taken from the brain corresponds to a gene sequencing subject taken from the brain, and not to a gene sequencing subject taken from the heart. Cellular data of the transcriptome to be predicted taken from the heart corresponds to the gene sequencing subjects taken from the heart. If the cell data of the transcriptome to be predicted is a brain slice, the determined gene sequencing object should be taken from the same region of the brain; if the cellular data of the transcriptome to be predicted is a heart slice, it is determined that the corresponding gene sequencing object should also be taken from the same region of the heart.

In step 301, marker genes can be obtained in a targeted manner according to the different uses of the cell annotation result, wherein the marker genes are genes whose functions and positions on the chromosome are determined, and can be used as references for analyzing other genes, for example: the HBA1/HBB gene, wherein the HBA1/HBA2 gene is located on chromosome 16 (16p13.3), has 3 exons, consists of 141 amino acids, and is about 29kb; the HBB gene is located on chromosome 11 (11p15.4), has 3 exons and consists of 146 amino acids. The gene of the marker gene HBA1/HBB is hemoglobin which is a special protein for transporting oxygen in erythrocytes, is a protein for making blood red, and consists of globin and heme, and the globin part is a tetramer consisting of two pairs of different globin chains (alpha chain and beta chain). The different peptide chains constitute six haemoglobins known to humans.

Step 302: the cell type annotation device determines an automatic annotation model that matches the cell data to be labeled.

Step 303: the cell type annotation device clusters the cell data to be marked through the automatic annotation model to obtain a clustering result.

In step 303, the purpose of clustering the cell data to be marked by the automatic annotation model is to group a group of cells into a large class according to the similarity (or distance) of the expression patterns of each gene in the cells in the cell data to be marked, so that when the large classes are clustered into sub-groups with mathematical significance, all the cell data to be marked can be clustered to determine the types of all the cell data to be marked, or the designated part of the cell data to be marked in the observation field of view can be clustered according to the use requirement of the user to determine the types of the designated cell data to be marked.

In step 303, when clustering the cell data to be marked, the optional clustering manners include:

1) K-means clustering, using iteration to find the center of the K-cluster, and assigning each unit to the nearest unit by Euclidean distance minimization. K-means clustering is sensitive to outliers and tends to find clusters of equal size, so that clustering can be performed quickly for the absence of non-rare cell types in the cell data to be labeled.

2) Singlere automated clustering, singlere is an automated labeling method for single cell RNA sequencing (scRNAseq) data. Given a reference sample set (single or bulk cells) with known tags, singlers automatically cluster new cells from the test dataset based on similarity to the reference. Thus, the burden of manually interpreting clusters and defining marker genes need only be done once for the reference dataset, and this biological knowledge can be propagated to the new dataset in an automated fashion. Clustering of rare cell types can be achieved with 7 reference datasets on their own, 5 of which are human data, and 2 of which are mouse data, when singlers are automatically clustered.

3) According to the application of the cell data to be marked, cell clustering based on single cell transcriptome sequencing data can be performed, wherein other non-single cell migration modes can be adopted, for example, when a plurality of cells are used for each sample, a corresponding algorithm can be a mode of combining deconvolution and migration so as to obtain an initial cell annotation result of a transcriptome to be predicted; for example, when the non-single cells are on the subcellular level, the corresponding algorithm is a combination of aggregation and migration to obtain the initial cell annotation result for the transcriptome to be predicted.

In some embodiments of the present invention, referring to FIG. 4, FIG. 4 is a schematic view of clustering effects of different automatic annotation models according to embodiments of the present invention, where the automatic annotation models in FIG. 4 may include 2 cases 1) when the type annotation result of the cells to be marked requires characterization of continuity and organization of differentiation among all cell populations, determining that the automatic annotation model matching the cell data to be marked is a unified manifold approximation and projection model (UMAP Uniform Manifold Approximation and Projection), for example; when the helper T cells, the suppressor T cells, the effector T cells, the cytotoxic T cells and the memory T cells included in the T cells are annotated, the automatic annotation model is a unified manifold approximation and projection model, and the differentiation continuity and organization among the helper T cells, the suppressor T cells, the effector T cells, the cytotoxic T cells and the memory T cell population can be determined.

2) When the type annotation result of the cells to be marked does not need to characterize the continuity and organization of differentiation between all cell populations, the automatic annotation model matching the cell data to be marked is determined to be a T-distribution and random neighbor embedding model (TSNE Stochastic neighbor Embedding). For example: when T cells and B cells are annotated, the automatic annotation model is a T distribution and random neighbor embedding model, so that the clustering speed can be effectively improved.

Referring to fig. 5, fig. 5 is a schematic diagram of a process of annotating a clustering result based on a marker gene in the embodiment of the present invention, a cell annotating tool and an existing marker gene annotate a cell type of a cluster group of class 44 shown in fig. 4, and a user can see information of detailed attributes of the data in a region "Data Descriptiion" shown in fig. 5, so that controllability of an annotating process is improved, in the annotating process shown in fig. 5, 2 parameters need to be input in a parameter input part, one is to select cluster group names for performing cell annotation, and to select group names of class 44 clusters shown in fig. 4, "spectral_leiden_4" can be input and selected in a drop-down menu. The second parameter to be input is a marker gene file, for example, the input data is human bone marrow data, so that the cell type annotation is realized by using a file of "correspondence between cell type name and gene name" provided by the prior art of human_immune.

In some embodiments of the present invention, the file of the correspondence between cell type names and gene names may be referred to in table 1:

TABLE 1

In addition, since the cell annotation by the user is different in the computing power of the used hardware device, the unified manifold approximation and projection model and the T-distribution and random neighbor embedding model can be flexibly selected according to the computing power of the hardware device, wherein the UMAP uses an exponential probability distribution in a high dimension, but not necessarily the euclidean distance like TSNE, but any distance can be substituted. UMAP uses binary Cross Entropy (CE) as a cost function, rather than K-L divergence as in the TSNE model. UMAP assigns initial low-dimensional coordinates using the graph Laplace transform, in contrast to random normal initialization used by TSNE. UMAP uses random gradient descent (SGD) instead of regular Gradient Descent (GD), which both speeds up computation and reduces memory consumption, so that unified manifold approximation and projection models are preferred for terminals with insufficient floating point computing power.

Step 304: the cell type annotation device annotates the clustering result based on the marker gene to obtain a first annotation result, wherein the first annotation result comprises: correct annotated results, incorrect annotated results, and un-annotated results.

In some embodiments of the present invention, referring to fig. 6, fig. 6 is a schematic diagram illustrating an effect of annotating a clustering result based on a marker gene in an embodiment of the present invention, where each cluster group in the clustering result may be obtained; searching gene fragments matched with the marker genes in each cluster group based on the marker genes; when the gene fragment is obtained, annotating each cluster group to obtain a cluster group annotation result; and combining the cluster group annotation results to obtain a first annotation result. As shown in fig. 6, after 44 cluster groups were annotated with marker genes, it included: the correct annotation result is totally 40 cluster groups, the wrong annotation result b cell cluster group and the un-annotated result 7/20/41.

In some embodiments of the present invention, when cell clusters are based on single cell transcriptome sequencing data, the gene expression information and the spatial information can be comprehensively analyzed by a mark marker gene identification mode of cell type specific expression so as to determine an initial cell annotation result of the transcriptome to be predicted; the method can also adopt a mode of introducing priori knowledge in the corresponding field, the cell data of the transcriptome to be predicted is clustered through a clustering algorithm to obtain clustering results, and then the initial cell annotation result of each clustering result in the transcriptome to be predicted is determined by using manual priori biological knowledge; or a pre-established transcription profile database of known cell types can be queried, the cell characteristics of the cell data of the transcriptome to be predicted of an unknown type can be compared with the cell characteristics of the transcription profile database of the known cell types, and the cell types with the same cell characteristics are determined to be the initial cell annotation result of the transcriptome to be predicted.

Step 305: the cell type annotation device adjusts the error annotation result to obtain a corrected annotation result.

In step 304, only the three clusters of 7, 20, 41 are not annotated with cells. One possibility is that their interiors are relatively diffuse containing cells of different species, so none of the known cell classes can match them. It is also possible to be a new cell type such that the file of the "correspondence between cell type name and gene name" of human_immune does not record such cells.

In addition, we see that there are different classes identified as the same cell type, false annotation results such as "Bcell" and "B cell-2", both clusters identified as 'B cell', and a number representing the suffix. The same can be found in other cell types, such as "Naive T cells", where 12 clusters are identified for this cell type, and thus require modification.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating adjustment of error annotation results according to an embodiment of the invention; responding to the annotation result adjustment instruction to adjust the error annotation result; updating the cell data to be marked according to the correction annotation result to obtain updated cell data; the updated cell data is saved as metadata. The "update data" control component shown in fig. 7 can be triggered by the annotation result adjustment instruction, so that the code end can automatically update the manually modified content into single-cell data, and meanwhile, new metadata is established, so that the user can conveniently call.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating adjustment of error annotation results according to an embodiment of the invention; and when the error annotation result B cell cluster group is more positive, clicking the "B cell-2", displaying an Annostate button, changing the B cell-2 into the B cell in the "Category Name", and clicking the "ok" control component for confirmation.

Referring to fig. 9, fig. 9 is a schematic diagram showing the effect of adjusting the error annotation result in the embodiment of the present invention, wherein B cell-2 is changed into B cell, so as to integrate different clusters of the same cell.

Step 306: the cell type annotation device adjusts the un-annotated result according to the correct annotation result related to the un-annotated result to obtain a supplementary annotation result.

Referring to fig. 10, fig. 10 is a schematic diagram illustrating a process of obtaining a supplementary note result according to an embodiment of the present invention; searching for a first highly expressed gene in the correct annotation result associated with the un-annotated result (1001 as shown in fig. 10); determining a second highly expressed gene in the un-annotated result (1002 as shown in fig. 10); when the first high-expression gene and the second high-expression gene are identical, the correct annotation result and the supplementary annotation result are determined to be identical. Referring to fig. 9, the non-annotated results obtained by the processing of steps 301 to 305 are cluster No. 20, and it can be determined that the neighboring class of class 20 is Erythroid cells.

Referring to FIG. 11, FIG. 11 is a schematic diagram showing the procedure for obtaining the supplementary annotation result in the example of the present invention, by analyzing the high expression gene of "Erythroid cells", we know that its high expression gene is HBB, HBA1, as shown in FIG. 11, wherein the hatched area of the lower small image represents HBB, the high expression of HBA1, i.e., HBB, HBA1 is the high expression gene of "Erythroid cells".

Referring to FIG. 12, FIG. 12 is a schematic diagram of the process of obtaining the supplementary annotation result in the example of the present invention, HBB, HBA1 is also a high-expression gene of "cluster No. 20" cells, because these two genes are also shown as bright shadows, representing high gene expression.

In some embodiments of the present invention, when the first high-expression gene and the second high-expression gene cannot be determined to be the same through fig. 12 and 11, that is, when the first high-expression gene and the second high-expression gene are different, a manual annotation process is triggered, where the manual annotation process is used to manually adjust the un-annotated result, and obtain a supplementary annotation result.

To this end, the correct annotation result, the corrected annotation result, and the supplementary annotation result are obtained through steps 301 to 306.

Step 307: the cell type annotation device combines the correct annotation result, the corrected annotation result and the supplementary annotation result to obtain the cell type annotation result to be marked.

Wherein, all cells to be marked after cell clustering are divided into different cell groups, and the cell groups are marked as Cluster0,1,2 and … … n; each cell population type was annotated by an automatic annotation model, taking n=8 as an example, where Cluster0,Cluster 1,Cluster 2,Cluster 3 is the correct annotation result, cluster 4,Cluster 5,Cluster 6 is the wrong annotation result, and clusters 7, 8 are rare cell types and therefore not annotated results. Wherein Cluster 4 is a type annotation error, cluster 5 is a type annotation incomplete, cluster 5 is a type annotation misplacement, the error annotation result of Cluster 4,Cluster 5,Cluster 6 is corrected through the processing of steps 301-306, and the corrected annotation result is used for replacing the error annotation result; for both rare cell types Cluster 7 (precursor cells that exist briefly), cluster 8 (circulating endothelial cells), after the supplementary annotation result is obtained by the supplementary annotation, cluster0,1,2, … … have all completed the annotation and the annotation result of each Cluster is correct, and by combining the correct annotation result, the corrected annotation result and the supplementary annotation result in step 307, the type annotation result of all the cells to be marked can be obtained and all the annotation results are correct.

In some embodiments of the present invention, the cell annotation result obtained in step 307 may be used to identify the transcriptome to be predicted, so that information, characteristics, etc. of the transcriptome to be predicted can be quickly obtained from the annotation result of the transcriptome to be predicted. For example, the annotated result of the transcriptome to be predicted may include a cell type of the transcriptome to be predicted, or may include a plurality of cell attributes of the transcriptome to be predicted under the cell type. For example, the cell type may be an epithelial cell, a T cell, a fibrous cell, a glial cell, an endothelial cell, and the like. The cell attribute corresponding to the T cell may be, for example, helper T cell, suppressor T cell, effector T cell, cytotoxic T cell, memory T cell, or the like. T cells corresponding to different cell attributes have different functions. Helper T cells have the function of assisting humoral immunity and cellular immunity; the inhibitory T cells have the function of inhibiting cellular immunity and humoral immunity; effector T cells have the function of releasing lymphokines; cytotoxic T cells have the function of killing target cells; memory T cells have the function of memory-specific antigen stimulation.

In some embodiments of the present invention, when the cells to be marked are used for cell transcription, the cells to be marked may be divided into cells of different transcriptomes to be predicted, in the process of annotating the cell data of the transcriptome to be predicted, when annotating the cell data of the transcriptome to be predicted, the cell data of the transcriptome to be predicted may be obtained, where the cell data of the transcriptome to be predicted includes gene expression information of a plurality of sequencing points in the transcriptome to be predicted and spatial information of the plurality of sequencing points, then an annotated cell object corresponding to the transcriptome to be predicted is determined, by obtaining a tissue or region from a large number of data sets, then determining the same cell data as the tissue or region from which the transcriptome to be predicted is obtained as a cell object, determining a new cell annotation result of the transcriptome to be predicted according to the cell object and the gene expression information in the transcriptome to be predicted, and determining the cell annotation result of the transcriptome to be predicted according to the gene expression information in a single cell migration manner, so that the type result of the cells to be marked may be applied to the cell transcription more rapidly, and efficiency of cell transcription is improved.

After the type annotation result of the cells to be marked is finished, the single cell expression profile of the cells to be marked is updated according to the type annotation result of the cells to be marked as different use requirements are met by a user, so that an updated single cell expression profile is obtained; the updated single Cell expression profile is formatted and stored in a target database, for example, the updated single Cell expression profile may be formatted according to protocols used by different platforms (e.g., drop-seq, 10X Cell range).

In summary, the embodiment of the application has the following technical effects:

1) The embodiment of the application obtains the cell data to be marked and the marker gene for marking the cell type; determining an automatic annotation model matched with the cell data to be marked; clustering the cell data to be marked through an automatic annotation model to obtain a clustering result; annotating the clustering result based on the marker gene to obtain a first annotation result, wherein the first annotation result comprises: correct annotation result, incorrect annotation result and un-annotated result; therefore, quick annotation of the cell data to be marked can be realized, and the speed of cell type annotation is improved.

2) Adjusting the error annotation result to obtain a corrected annotation result; according to the correct annotation result associated with the un-annotated result, adjusting the un-annotated result to obtain a supplementary annotation result; and combining the correct annotation result, the corrected annotation result and the supplementary annotation result to obtain the type annotation result of the cell to be marked. Therefore, on the basis of machine annotation, the results of machine learning mode failing to annotate and misannotate are further timely adjusted, and the accuracy of cell type annotation can be improved.

3) Because the adjustment of the un-annotated result and the adjustment of the error annotated result can be realized by the cell type annotation software, a user can intuitively know the process of adjusting the annotated result, so that the controllability of cell type annotation is higher.

The above embodiments are merely examples of the present invention, and are not intended to limit the scope of the present invention, so any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of cell type annotation, the method comprising:

acquiring cell data to be marked and a marker gene for marking cell types;

adjusting the error annotation result to obtain a corrected annotation result;

2. The method according to claim 1, wherein the obtaining of the cell data to be marked and the marker genes for cell type labeling comprises:

according to the identification information of the cells to be marked, obtaining at least one single cell expression profile matched with the identification information;

combining the at least one single cell expression profile to obtain the cell data to be marked;

and obtaining a marker gene matched with the identification information according to the identification information of the cells to be marked.

3. The method of claim 1, wherein said determining an automatically annotated model that matches said cell data to be labeled comprises:

when the type annotation result of the cells to be marked needs to characterize the continuity and organization of differentiation among all cell populations, determining an automatic annotation model matched with the cell data to be marked as a unified manifold approximation and projection model;

And when the type annotation result of the cells to be marked does not need to characterize the continuity and organization of differentiation among all cell populations, determining an automatic annotation model matched with the cell data to be marked as a T distribution and random neighbor embedding model.

4. The method of claim 1, wherein annotating the clustered results based on the marker gene results in a first annotated result, comprising:

acquiring each cluster group in the cluster result;

searching gene fragments matched with the marker genes in each cluster group based on the marker genes;

when the gene fragment is obtained, annotating each cluster group to obtain a cluster group annotation result;

and combining the cluster group annotation results to obtain the first annotation result.

5. The method according to claim 1, wherein the method further comprises:

responding to the annotation result adjustment instruction, and adjusting the error annotation result;

updating the cell data to be marked according to the correction annotation result to obtain updated cell data;

and saving the updated cell data as metadata.

6. The method of claim 1, wherein adjusting the unexplored result based on the correct annotated result associated with the unexplored result to obtain a supplemental annotated result comprises:

searching for a first highly expressed gene in a correct annotation result associated with the unexplored result;

determining a second highly expressed gene in the unexplored result;

determining that the correct annotation result and the supplemental annotation result are the same when the first high-expression gene and the second high-expression gene are the same;

and triggering an artificial annotation process when the first high-expression gene and the second high-expression gene are different, wherein the artificial annotation process is used for manually adjusting the un-annotated result to obtain the supplementary annotation result.

7. The method according to claim 1, wherein the method further comprises:

updating the single-cell expression profile of the cells to be marked according to the type annotation result of the cells to be marked to obtain an updated single-cell expression profile;

and carrying out format conversion on the updated single cell expression spectrum and storing the updated single cell expression spectrum into a target database.

8. A cell type annotation device, the device comprising:

9. An electronic device, the electronic device comprising:

A memory for storing executable instructions;

a processor for implementing the cell type annotation method according to any of claims 1 to 7 when executing the executable instructions stored in the memory.

10. A computer readable storage medium storing executable instructions which when executed by a processor implement the cell type annotation method of any of claims 1 to 7.

11. A computer readable storage medium storing executable instructions for execution by a processor to implement the cell type annotation method of any one of claims 1 to 7.