CN117476101A

CN117476101A - Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data

Info

Publication number: CN117476101A
Application number: CN202311568169.2A
Authority: CN
Inventors: 郭国骥; 叶昉; 张爽; 傅雨婷
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-01-30

Abstract

The invention discloses a method, a system, equipment and a medium for distinguishing malignant cells by using multicellular sequencing data, belonging to the technical field of tumor single cell sequencing. The method comprises the steps of performing high-throughput single-cell multi-group chemical sequencing by using molecular marker microbeads; and further carrying out copy number variation analysis based on the multicellular sequencing data, so as to distinguish malignant cells in the tumor and the tissue beside the tumor. The invention can accurately distinguish the genome sequence characteristics and gene expression patterns of malignant cells in tumors at multiple groups of chemical levels, and has great application value in detection and auxiliary diagnosis of clinical tumor samples.

Description

Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data

Technical Field

The invention belongs to the technical field of tumor single cell sequencing, and particularly relates to a method, a system, equipment and a medium for distinguishing malignant cells by using multicellular sequencing data.

Background

Tumors are the disease with the highest morbidity and mortality worldwide. The occurrence of the tumor is derived to a certain extent from the original malignant cells which are obtained to be dry after mutation accumulation, and the heterogeneity of the tumor is shaped by the cell types with different phenotypes and morphologies generated by proliferation and differentiation of the malignant cells through the change of the microenvironment of the endogenous tumor and the induction of exogenous conditions. The development of tumorigenesis within various organ tissues is derived from heterogeneity within the tumor, the evolution process of various tumors also has a common feature, and the evolution process of tumor clones with different mutations results in the formation of one or more clone types with survival advantage that determine the molecular characteristics and microenvironment of the tumor, a process that is dynamic and complex. Intratumoral heterogeneity is a key factor in the clinical course of treatment to develop chemotherapy, targeted drug therapy and immunotherapy resistance and recurrent mortality.

With the progress of high-throughput second-generation sequencing technology in recent years, deep genome sequencing studies of different types of tumors revealed that genome instability, various somatic mutations are closely related to the formation of heterogeneity and survival evolution of tumors. The multidimensional analysis of the cell resolution of the tumor is helpful for further defining the formation of the heterogeneity in the tumor and the history of clone evolution development, exploring the commonality and the differential mechanism of the occurrence and the development of the tumor and helping to solve important problems such as clinical tumor recurrence, drug resistance and the like. However, there are still relatively few cell-level multidimensional analyses and comparative studies on the problems of different types of tumorigenesis development and internal heterogeneity, and there is a lack of autonomy at the technical level, low cost and relatively high throughput platforms.

In the past, molecular characteristic analysis of various tumor tissues is usually genome sequencing at the cell level of a population, gene expression analysis (transcriptome sequencing, gene expression chip or fluorescent quantitative analysis) and protein localization and expression at the tissue level. Limited by the resolution of the technical means, gene expression detection at population level cannot reflect heterogeneity of internal cells. Single-cell sequencing technology (single-cell sequencing) can detect gene expression or genome variation of cell differentiation from single-cell precision, and provides a new opportunity for analysis of heterogeneity and development track of evolution inside tumor. In the field of tumor research, single cell sequencing can provide assistance for a series of problems of primary tumor heterogeneity, tumor microenvironment, association of primary and recurrent metastatic tumor foci and the like from multiple groups of chemical dimensions such as genome, transcriptome, proteome, metabolome, epigenetic group and the like.

Based on the heterogeneity of tumor cell generation mechanism and tumor cell evolution discovered by single cell histology, the molecular characteristics of malignant cell mutation and internal heterogeneity cells can further provide clues for the diagnosis and prevention of tumors, and the method has great application transformation potential in mechanism research and diagnosis and prevention directions. Currently, single-cell histology studies on tumors focus on cell molecular typing based on the expression of transcriptome genes. With the development of commercial single cell technology platforms and high throughput sequencers, single cell transcriptome maps of various tumor animal models and human clinical tumor samples have been published. Various tumor cell patterns systematically characterize the heterogeneity of intratumoral cells and immune microenvironment cells. Therefore, the development of a rapid tumor cell identification method based on microporous single-cell multiunit chemical sequencing has important clinical significance.

Disclosure of Invention

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

the first aspect of the invention provides a method for distinguishing malignant cells based on single-cell multiunit chemical sequencing, which comprises the following steps:

s1, obtaining a tumor sample and a paratumor sample, respectively preparing single-cell nuclear suspension, mixing the single-cell nuclear suspension with molecular marker microbeads, loading the mixture into a microporous chip, capturing the base sequence of marker cell nucleuses in situ in micropores, and adding a cell identity label and a molecular label;

s2, constructing a sequencing library, and performing at least two of single-cell transcriptome sequencing, single-cell chromatin accessibility sequencing, single-cell genome sequencing and single-cell methylation sequencing to obtain different single-cell sequencing data;

s3, for each single cell sequencing data, the following analysis is performed respectively:

s31, obtaining average copy number variation levels in tumor samples and paraneoplastic samples respectively as malignant copy number variation expectations and normal copy number variation expectations respectively,

s32, dividing single cell sequencing data of the tumor sample and the paraneoplastic sample into N subsets, and judging according to the following criteria for each subset:

if the average copy number variation level of the subset is less than the normal copy number variation expectation, the subset is a normal subset and the cells are normal cells; if the average copy number variation level of the subset is greater than the malignant copy number variation expectation, the subset is a malignant subset and the cells are malignant cells; if the average copy number variation of the subset is between the normal copy number variation expectations and the malignant copy number variation expectations, the subset is intermediate,

s33, for the intermediate state subsets, dividing the intermediate state subsets into N subsets again, and classifying the intermediate state subsets according to the standard in S32;

s34, repeating step S33 until there are no more normal or malignant subsets, or the maximum number of iterations Y is reached,

wherein n=20 to 100 and y=10 to 50;

s4, performing correlation analysis on chromosome copy number variation patterns of the malignant cells identified by the different single cell sequencing data in the step S3, and combining the malignant cells by utilizing chromosome regions with the same copy number variation patterns.

In some embodiments of the present invention, in step S1, the molecular marker microbeads and the nuclei are mixed in a ratio of 1:1 and then loaded onto the microwell chip, so that the nuclei can be provided with cell identity tags, which facilitates rapid determination of nuclei from different cell types in a subsequent analysis process. Preferably, for transcriptome sequencing and cytoplasmic accessibility sequencing, reverse transcription/genome disruption is performed while the cell nucleus is tagged with a cell identity.

Further, in step S1, the method further includes: the cell nucleus suspension is subjected to resuspension fixation treatment by using any one of aldehyde fixing liquid (such as paraformaldehyde), alcohol fixing liquid (such as ethanol), acid fixing liquid and cross-linking agent, so that nucleic acid/protein in the cell nucleus is cross-linked and fixed, and the nucleic acid molecules enter the cell/cell nucleus to react more effectively. Preferably, in genomic sequencing, the nuclei are not subjected to any organic solvent immobilization treatment, so that the transposase can more effectively enter the nuclei for reaction.

In some embodiments of the invention, in step S1, the in-situ nucleic acid molecule in the nucleus is labeled with mRNA and DNA. The poly-T tail carried by the nucleic acid molecule with known base sequence on the surface of the microbead can be hybridized and combined with mRNA in the treated cell nucleus; random or fixed sequences carried by nucleic acid molecules of known base sequences on the surface of the microbeads can be hybridized to DNA in the treated nuclei.

In some embodiments of the invention, in step S3, before performing the analysis, further comprising the step of performing a quasi-swarming treatment:

and performing quasi-population treatment according to the cell number addition single cell sequencing data from the same sample, constructing a quasi-population set according to the single cell sequencing data set of the Euclidean distance addition adjacent cells, and performing data normalization treatment.

In some embodiments of the invention, in step S4, the specific steps of combining malignant cells with the same chromosome region in the copy number variation pattern are: screening chromosome regions with copy number variation directions of amplification or deletion, and drawing a chromosome variation pattern diagram of malignant cells so as to combine the malignant cells.

In some embodiments of the invention, further comprising the step of performing cell subtype identification based on any one of the single cell sequencing data:

grouping all the microbeads pairwise according to the cell identity tags in the sequencing data to form microbead pairing;

performing traversal calculation on each microbead pairing, wherein the calculation content is similarity of microbead capturing sequences, and sequencing the microbead pairing according to the similarity;

then, combining the microbead pairs with sequence similarity higher than a preset threshold according to the number of the micropores actually contained in the micropores;

finally, respectively combining gene matrixes of cells derived from tumor samples and paratumor samples, performing dimension reduction, feature selection, difference analysis and cell subgroup grouping on the combined single-cell histology matrixes, and annotating the cell subgroup based on a public database.

According to the process, the similarity expression scores are calculated through the random sequence distribution similarity in the cell identity tags carried and captured by the microbeads in the data, so that the situation that a plurality of microbeads are located in the same microwell is determined, genetic sequence information of all microbeads in the same microwell is combined, and for a plurality of cell nuclei in the same microwell, the genetic sequence information combined by the microbeads is distributed and reduced to a single cell nucleus through the cell identity tags, so that multiple groups of chemical data with single cell resolution can be obtained.

For transcriptome sequencing, the bead-linked primer sequence includes four parts: library linker sequences, cell tag sequences, molecular tag sequences, and nucleic acid capture sequences. Wherein the library linker sequence is used for subsequent on-press sequencing; the cell tag sequence is used for identifying different cells; the molecular tag sequence is a sequence composed of random bases, and each DNA molecule contains a unique molecular tag sequence for identifying different DNA molecules during mixed sequencing; the nucleic acid capture sequence contains a poly-T tail or random primer sequence for capturing the RNA molecule.

For genome sequencing, the construction of the library uses transposases to disrupt the open region of the chromatin; the bead-attached primer sequence comprises four parts: library linker sequences, cell tag sequences, molecular tag sequences, and nucleic acid capture sequences. Wherein the library linker sequence is used for subsequent on-press sequencing; the cell tag sequence is used for identifying different cells; the molecular tag sequence is a sequence composed of random bases, and each DNA molecule contains a unique molecular tag sequence for identifying different DNA molecules during mixed sequencing; the nucleic acid capture sequence contains a hybridization sequence that matches the transposase linker sequence for capturing transposase-disrupted DNA molecules.

In some embodiments of the present invention, the combining of pairs of microbeads having sequence similarity above a predetermined threshold is specifically:

(1) A cell and a microbead in the micropore directly take the cell identity label and the captured genetic sequence information of the microbead as a genetic information matrix of the single cell;

(2) A plurality of cells and a microbead are arranged in the micropore, the captured genetic sequence information of the microbead is distributed to the cells according to the cell identity labels of the paired cells to be used as a genetic information matrix of the cells;

(3) A cell and a plurality of microbeads are arranged in the micropore, the captured genetic sequences of the microbeads are accumulated and distributed to the cell to be used as a genetic information matrix of the single cell;

(4) The microwell contains a plurality of cells and a plurality of microbeads, the captured genetic sequences of the microbeads are accumulated, and then the genetic sequences are distributed to the cells according to the cell identity labels of the paired cells to be used as a genetic information matrix of the cells.

In some embodiments of the invention, further comprising predicting the identified key transcription factors and/or their target genes in the malignant cells, performing molecular typing of the malignant cells.

The invention can cross-verify and further assist in identifying malignant cells from suspected cancer (malignant tumor) samples, and further quickly determine cell lineage sources, proportion and copy number variation modes of the malignant cells from cytohistology dimensions, and molecular typing indexes such as target genes, and the like, thereby providing auxiliary diagnosis.

In a second aspect, the invention provides a system for distinguishing malignant cells based on single-cell multiunit chemical sequencing, comprising the following modules:

and a data input module: different single cell sequencing data obtained from at least two of single cell transcriptome sequencing, single cell chromatin accessibility sequencing, single cell genome sequencing and single cell methylation sequencing for receiving a tumor sample and a paratumor sample;

malignant cell differentiation module: and the data input module is connected with the data input module and is used for respectively carrying out the following analysis on each single cell sequencing data:

s34, repeating the step S33 until no more normal subsets or malignant subsets exist, or iteration is achieved

Maximum number Y, where n=20-100, y= =10-50;

s4, carrying out correlation analysis on chromosome copy number variation patterns of the malignant cells identified by the different single cell sequencing data in the step S3, and combining the malignant cells by using chromosome regions with the same copy number variation patterns.

Further, the method further comprises the following steps:

the cell subtype identification module is respectively connected with the data input module and the malignant cell distinguishing module and is used for carrying out cell subtype identification according to the following steps:

finally, respectively combining gene matrixes of cells derived from a tumor sample and a paratumor sample, performing dimension reduction, feature selection, difference analysis and cell subgroup grouping on the combined single-cell histology matrixes, and annotating the cell subgroup based on a public database;

and for determining the pattern of variation of malignant cells in different cell subsets based on the malignant cells identified by the malignant cell differentiation module.

Still further, the kit also comprises a key target gene and a regulation network enrichment module thereof, which are connected with the malignant cell distinguishing module and are used for predicting the identified key transcription factors and/or target genes thereof in the malignant cells and carrying out molecular typing of the malignant cells.

A third aspect of the present invention provides a computer apparatus comprising:

a memory for storing a computer program;

a processor for performing the steps of a method for differentiating malignant cells based on single-cell multiunit chemical sequencing according to any of the first aspect of the invention when executing the computer program.

A fourth aspect of the invention provides a computer readable storage medium,

the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method of distinguishing malignant cells based on single cell multiunit chemical sequencing according to any of the first aspect of the invention.

The beneficial effects of the invention are that

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a method, equipment and medium for identifying quick tumor cells based on microporous single-cell multiunit chemical sequencing. The genetic information of multiple groups of tumor samples is detected at a single cell level in a high throughput manner based on a microporous microbead system. And based on multiple groups of study information, malignant cells in the tumor are rapidly and accurately identified, and the characteristic regulation mode of the target genes is enriched, so that references are provided for the typing and auxiliary diagnosis of clinical tumors.

According to the invention, correlation analysis is carried out on the chromosome copy number variation modes obtained from different single cell sequencing data, chromosome regions with the copy number variation directions of amplification or deletion are screened, and a malignant cell chromosome variation mode diagram is drawn. And combining malignant cells iteratively grouped at the average copy number level per cell. The proportion of malignant cell subsets of the core to their distribution within the individual cell lineages is determined. Further determining the genome variation mode, enriching the key target genes and the regulation network of the malignant cells, and carrying out molecular typing of the malignant cells. The invention can integrate the multi-group chemical data of malignant tumor cells to construct a regulation network, and the regulation network construction in the prior art basically aims at the single-cell transcriptome gene expression data, integrates the genomic data such as the accessibility of the single-cell transcriptome and the single-cell chromatin of the tumor and the like to construct the regulation network of single-cell resolution, which is the first creation of the invention.

Drawings

Figure 1 shows a sample of mouse lung tumor samples, paraneoplastic samples, normal reference control lung samples, cell subtypes and sample sources of cell subtypes defined by genomic chromatin accessibility. Adj represents paraneoplastic samples, normal represents contralateral Normal lung samples, and Tumor represents neoplastic samples.

Fig. 2 shows the distribution projections of malignant cells (macrognant) and non-malignant normal cells (nonmacrognant) as defined by the degree of copy number variation for cell subtypes defined by genomic chromatin accessibility of mouse lung tumor samples, paraneoplastic samples, contralateral normal lung samples.

FIG. 3 shows Copy number variation on the genomic chromatin accessibility panel level, as divided by malignancy, identified by Copy-scAT. NMF_cluster represents Non-negative matrix factorization, is Non-negative matrix factorization, and is an unsupervised clustering method by which the number 3 cluster is a malignant cell subset and matches the distribution defined by copy number variation.

FIG. 4 shows the pattern of chromosome-wide copy number variation of malignant cells (macrognant) versus non-malignant cells (nonmacrognant) predicted at the transcriptome level identified by afferCNV.

FIG. 5 shows the correlation of the average score of Copy number variation at the transcriptome level identified by the afferCNV with the average score of Copy number variation at the genomic chromatin accessibility group level identified by Copy-scAT, and the chromosome band consistent with Copy number deletion (del effect) and Copy number amplification (dup effect) at different chromosome bands.

FIG. 6 shows the enrichment results of key target genes of malignant cells of lung tumor of mice and the regulation network thereof. Pink gene is the key target gene of choice, the node color shade represents its network centrality, the node size represents how many of its interacted genes are.

Detailed Description

Unless otherwise indicated, implied from the context, or common denominator in the art, all parts and percentages in the present application are based on weight and the test and characterization methods used are synchronized with the filing date of the present application. Where applicable, the disclosure of any patent, patent application, or publication referred to in this application is incorporated by reference in its entirety, and the equivalent patents to those cited are incorporated by reference, particularly as they relate to the definitions of terms in the art. If the definition of a particular term disclosed in the prior art does not conform to any definition provided in this application, the definition of that term provided in this application controls.

Numerical ranges in this application are approximations, so that it may include the numerical values outside of the range unless otherwise indicated. The numerical range includes all values from the lower value to the upper value that increase by 1 unit, provided that there is a spacing of at least 2 units between any lower value and any higher value. For ranges containing values less than 1 or containing fractions greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is suitably considered to be 0.0001,0.001,0.01, or 0.1. For a range containing units of less than 10 (e.g., 1 to 5), 1 unit is generally considered to be 0.1. These are merely specific examples of what is intended to be provided, and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application.

The terms "comprises," "comprising," "including," and their derivatives do not exclude the presence of any other component, step or procedure, and are not related to whether or not such other component, step or procedure is disclosed in the present application. For the avoidance of any doubt, all use of the terms "comprising," "including," or "having" herein, unless expressly stated otherwise, may include any additional additive, adjuvant, or compound. Rather, the term "consisting essentially of … …" excludes any other component, step or process from the scope of any of the terms recited below, except as necessary for operability. The term "consisting of … …" does not include any components, steps or processes not specifically described or listed. The term "or" refers to the listed individual members or any combination thereof unless explicitly stated otherwise.

In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the embodiments.

The following examples are presented herein to demonstrate preferred embodiments of the present invention. It will be appreciated by those skilled in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, the disclosure of which is incorporated herein by reference as is commonly understood by reference.

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the claims.

The experimental methods in the following examples are conventional methods unless otherwise specified. The instruments used in the following examples are laboratory conventional instruments unless otherwise specified; the test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores.

Example 1 preparation and sequencing of a Single cell multiple-set of a microwell-based tumor sample from the lung of an aging mouse

1. Sample preparation

Tumor tissue was isolated from paraneoplastic control tissue from aged C57BL6 mice identified as lung tumor. The two tissues were rapidly frozen, ground and crushed into powder in liquid nitrogen, and then added to the cell nucleus lysate to be lysed on ice. And obtaining the single cell nucleus suspension after centrifugal washing.

2. Transcriptomic library construction

The method comprises the steps of fixing cell nuclei by using 4% Paraformaldehyde (PFA), adding single-nucleus suspensions of tumors and paraneoplastic tissues of different samples into different centrifuge tubes, respectively adding reverse transcription primers, reverse transcriptase, reverse transcription reaction buffer, dNTPs, RNase inhibitors, 50% PEG8000 and 10% TritonX10 which carry different cell identity tag sequences into each centrifuge tube, uniformly mixing, and then placing the mixture in a PCR instrument for constant-temperature reverse transcription reaction. After completion of the reverse transcription reaction, the nuclei were washed with 3 XSSC and PBS, respectively, to prepare for chip loading.

When the chip is loaded, the cell nucleus and the molecular tag microbeads are mixed in equal proportion, and an amplification system of constant temperature polymerase and high-fidelity polymerase is added. According to the actual sample amount, reversely transcribed cell nucleuses of different tumor samples are quickly and uniformly loaded into a microporous chip by using a liquid transfer device, microscopic examination is carried out on the falling hole conditions of microbeads and cells in micropores under a microscope, so that the falling hole ratio of the cell nucleuses and the microbeads in the microporous chip is more than 70%, sealing oil is added to seal the microporous chip to form an independent reaction space, and the microporous chip is placed in a PCR thermal cycler for amplification.

After the reaction is finished, fully collecting liquid and molecular marker microbeads in the chip through multiple times of centrifugation, and sucking the reaction liquid and transferring the reaction liquid into a new centrifuge tube; adding DNA purification magnetic beads to purify to obtain amplified cDNA liquid; adding the amplified sequence library into an amplification system containing sequencing linkers (P5 and P7) of a sequencing tag (index) and high-fidelity polymerase to amplify the sequence library, thereby obtaining a sequencing library with the index; and purifying the magnetic Beads by DNAClean Beads to obtain a sequencing library, determining the concentration of the library by using a Qubit 3.0 fluorescent agent, and storing at-20 ℃. And selecting a proper amount of sequencing libraries to be subjected to sequencing on the machine according to the on-machine requirement of the sequencer.

3. Apparent genomics-chromatin accessibility library construction

Adding the mononuclear suspensions of the tumors and the tissues beside the tumors of different samples into different centrifuge tubes, respectively adding transposase carrying different cell identity tag sequences, 2 times of enzyme digestion reaction liquid, 1% digitalis saponin, 10% Tween-20 and 1 times of enzyme digestion system of PBS into each centrifuge tube, fully and uniformly mixing, and carrying out enzyme digestion reaction for half an hour in a constant temperature reaction system at 37-55 ℃. The cleavage reaction was stopped on ice, the nuclei were collected and centrifuged, and then the nuclei were washed twice with PBS wash and prepared for chip loading.

When the chip is loaded, the cell nucleus and the molecular tag microbead are mixed in equal proportion, and 50mM EDTA and 2 Xhigh-fidelity polymerase are added and mixed uniformly. According to the actual sample amount, reversely transcribed cell nucleuses of different tumor samples are quickly and uniformly loaded into a microporous chip by using a liquid transfer device, microscopic examination is carried out on the falling hole conditions of microbeads and cells in micropores under a microscope, so that the falling hole ratio of the cell nucleuses and the microbeads in the microporous chip is more than 70%, a tube cover is fastened, then a centrifuge tube is placed at a constant temperature of 50 ℃ for reacting for half an hour, genome fragments are released, then an amplification system added with constant temperature polymerase and high-fidelity polymerase is added into the microporous chip, a sealing oil is added to seal the microporous chip to form an independent reaction space, and the microporous chip is placed in a PCR thermal cycler for amplification.

After the reaction is finished, the liquid and the molecular marker microbeads in the chip are fully collected through multiple times of centrifugation, and the reaction liquid is sucked and transferred into a new centrifuge tube. And adding DNA purification magnetic beads to purify to obtain amplified cDNA liquid. Adding into an amplification system containing sequencing tags P5 and P7 and high-fidelity polymerase, amplifying the sequencing library to obtain a sequencing library with index, purifying magnetic Beads by using DNA Clean Beads to obtain the sequencing library, determining the concentration of the library by using a Qubit 3.0 fluorescent agent, and storing at-20 ℃. And selecting a proper amount of sequencing libraries to be subjected to sequencing on the machine according to the on-machine requirement of the sequencer.

Example 2 cell subtype identification of aged mouse lung tumor samples based on single cell polycomponentry

The original fastq data of the high throughput sequencing obtained in example 1 were extracted and screened according to the cell identity tag sequence carried by reverse transcription/transposase cleavage and the molecular tag sequence carried on microbeads, and the sequencing data were aligned to the mouse reference genome to obtain a single cell transcriptome matrix and a chromatin accessibility matrix.

Firstly, according to cell identity tags extracted according to positions in sequencing data, all microbeads are grouped into groups to form microbead pairs.

Then, a traversal calculation was performed for each bead pair, the calculation content being the similarity of the bead capture sequences (Jaccard expression score), and the bead pairs were ranked according to the similarity. In transcriptome sequencing, the capture sequences that are incorporated into the calculation are random primer sequences; in chromatin accessibility sequencing, the capture sequence that is incorporated into the calculation is the genetic information that is captured.

Then, according to the number of wells (1 ten thousand) actually contained in the microwells, the microbead pairs with high sequence similarity (the Jaccard expression score corresponding to 1 ten thousand microbead pairs is used as a threshold for judging the sequence similarity to be high) are combined, and the microbeads are judged to be in the same microwell, and the following classification treatment is performed:

(1) For the case that one cell and one microbead exist in the microwell after calculation, the cell identity label and the captured genetic sequence information of the microbead are directly used as a genetic information matrix of the single cell.

(2) In the case that a plurality of cells and a microbead are present in the microwell after calculation, the captured genetic sequence information of the microbead is assigned to the plurality of cells according to the reverse transcription/cleavage step cell identity tags of the plurality of cells paired with the captured genetic sequence information as a genetic information matrix of the plurality of cells.

(3) For the case of one cell and a plurality of microbeads in the microwell after calculation, the captured genetic sequences of the microbeads are accumulated and distributed to the one cell as a genetic information matrix of the single cell.

(4) For the case that a plurality of cells and a plurality of microbeads exist in the microwell after calculation, the captured genetic sequences of the microbeads are accumulated firstly, and then the cell identity tags are distributed to the cells according to the reverse transcription/enzyme digestion steps of the paired cells to serve as a genetic information matrix of the cells.

Finally, combining tumor cells and paraneoplastic tissue cells, generating and processing a downstream gene expression matrix by using the SEurat and ArchR software to transcriptome data and chromatin accessibility data respectively, selecting 2000 differential genes with characteristics at the head, performing PCA analysis and dimension reduction processing on the single-cell gene expression matrix, and projecting characteristic single-cell subsets on a two-dimensional plane. For each subgroup, the most characteristic gene set of the subgroup is obtained by utilizing a Findamker and other differential enrichment tools. The classification and definition of cell subsets is performed by referring to the existing gene annotation database, such as the panglaoDB database, defining them to specific cell types of different lineages, identifying different epithelial, stromal and immune cell types within the tumor tissue, and subtype classification is performed based on genomic chromatin accessibility, as shown in FIG. 1, wherein the sample source of each cell is also integrally labeled.

Example 3 identification of malignant cells in a mouse Lung tumor sample

Identification of cell types in tumor and paratumor specimens using single cell genomic chromatin accessibility and transcriptome data, based on tumor specimens, paratumor specimensThe number of cells obtained by sequencing is selected in a certain combination ratio (in this example, 100 cells in the subgroup are combined into a quasi-population set cell), a quasi-population set is constructed by summing up the data sets of the 100 adjacent cells according to the Euclidean distance, and then the gene expression count matrix (listed as each cell, behavior gene) of the combined cells is subjected to addition processing in the Seurat software, that is, the sum of the counts of each sequenced gene (each row) in 100 cells (100 columns) is calculated and used as the gene expression matrix of the quasi-population set after addition. Performing dimension reduction and grouping on the single cell gene expression matrix subjected to the quasi-swarming treatment, performing data normalization treatment, and re-normalizing the quasi-swarm set of single cell data to 10 ⁶ On the order of magnitude.

Copy number variation analysis of lung tumor sample data sets at the transcriptome and genome levels (at the quasi-population level) was performed in combination with the unifying cnv software and Copy-scAT software to quantitatively characterize Copy number deletion (del effect) and Copy number amplification (dup effect) patterns across different chromosomes.

Firstly, assuming that malignant cells and normal cells exist in tumor and paraneoplastic tissue, initially taking a copy number area in the paraneoplastic tissue as a normal control, and calculating an average value (average copy number variation level) of the copy numbers of the tumor and paraneoplastic tissue cells at a quasi-population set level; and the average copy number variation levels of the paraneoplastic tissue and the neoplastic tissue are respectively used as a "normal copy number variation expectation" and a "malignant copy number variation expectation". For transcriptome data, gene expression levels were quantified to a range of-1 to 1.

Dividing a quasi-population set constructed by lung tumor side tissue and lung tumor tissue single cell data into 50 subsets by using a hierarchical clustering algorithm, and defining:

if the average copy number variation level of each subset is less than the "normal copy number variation expected", the subset is defined as "normal";

if the average copy number variation level of each subset is greater than the "malignant copy number variation desire", the subset is defined as "malignant";

the average copy number variation level for each subset is between the two, and the subset is defined as "intermediate".

For subsets categorized into "intermediate states", the next round of hierarchical clustering iterations will be entered, i.e., re-divided into 50 subsets and the classification calculated. Until no more subsets of "normal" or "malignant" are present or the maximum number of iterations is reached, the final copy number variation signature is projected onto the single cell grouping result, the multiple sets of chemically defined malignant cells are pooled, and whether there is a subset of malignant cells that aggregate individually or there is a pattern of malignant cell scattering distribution.

As a result, as shown in fig. 2, it can be seen that the predicted malignant cells are mostly derived from tumor tissue samples, and that there is also a proportion of malignant transitional cell distribution in the paraneoplastic tissue. The copy number variation pattern results of the horizontal regionality and the integrity of each chromosome are shown in fig. 3 and 4, and it can be seen that in the identified malignant cell population of the lung tumor sample, chromosome 8, chromosome 16 and chromosome 17 show significant copy number amplification; malignant cells in tumor and paraneoplastic tissue exhibit copy number deletions on chromosome 4, chromosome 5, and chromosome 11.

The integration and correlation analysis of copy number variation patterns of different sets of students requires dividing the annotated chromosome region into different bands, quantifying the average copy number variation score (ranging from deletion to amplification to-2 to 2) at the overlapping band level of the different sets of students, and comparing the copy number "amplification" and "deletion" of the multiple sets of chemical copy number variations of the different bands. For the afferCNV transcriptome level data and Copy-scAT genome chromatin accessibility level data, a total of 42 overlapping chromosomal bands were obtained, and the Copy number amplification and Copy number deletion of each chromosomal band was marked, as shown in FIG. 5, the correlation performance of the Copy number variation results of the multiplex analysis reached 0.73, with significance (p-value was 0.0018).

Example 4 construction of malignant cell Gene expression regulatory network

And predicting key transcription factors and target genes thereof in the lung tumor malignant cells identified by the multiple sets of science combination by utilizing SCRIP software and constructing a regulation network.

First, using ClusterProfiler tool, selecting a saliency p-value threshold of 0.1, filtering low quality target genes, enriching potential key channels to obtain 21 common enriched channels, and screening 49 target genes with high correlation with the channels, wherein the genes are named as key node target genes.

The interaction network of these target genes was mapped using the target gene interaction information of the STRING database, proteins outside the network were further removed, and key node target genes of normal and malignant cells were enriched by a threshold of mean fold expression (avg.logfc) >0.25 and significance BH adjustment p value <0.05, as shown in fig. 6.

By utilizing the framework, the known key target genes such as Tp63, foxc2, nkx2-1 and the like which are highly related to lung tumors can be enriched, the malignant cell population is characterized by epithelial-mesenchymal transition, and the currently detected tumor sample belongs to lung squamous cell carcinoma and indicates that the transformation process from epithelial lung adenocarcinoma to matrix lung squamous cell carcinoma exists. Proved by the method, the key target genes for regulating malignant cells in tumors and the regulating network thereof can be rapidly and accurately identified.

All documents mentioned in this application are incorporated by reference as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the claims appended hereto.

Claims

1. A method for differentiating malignant cells based on single-cell multiunit chemical sequencing, comprising the steps of:

wherein n=20 to 100 and y=10 to 50;

s4, carrying out correlation analysis on chromosome copy number variation patterns of the malignant cells identified by the different single cell sequencing data in the step S3, and combining the malignant cells by using chromosome regions with the same copy number variation patterns. Screening chromosomal regions with copy number variation directions of amplification or deletion to draw a malignant cell chromosomal variation pattern diagram. And combining malignant cells iteratively grouped at the average copy number level per cell.

2. The method of claim 1, further comprising the step of performing cell subtype identification based on any single cell sequencing data:

3. The method for distinguishing malignant cells based on single-cell multiunit chemical sequencing according to claim 2, wherein the combining the bead pairs with sequence similarity higher than a preset threshold is specifically:

4. The method for differentiating malignant cells based on single-cell multiunit chemical sequencing according to claim 1, wherein in step S3, before performing the analysis, further comprising the step of performing a quasi-population treatment:

5. The method of claim 1, further comprising predicting key transcription factors and/or target genes in the identified malignant cells for molecular typing of the malignant cells.

6. A system for differentiating malignant cells based on single cell multicellular sequencing comprising the following modules:

s34, repeating step S33 until there are no more normal subsets or malignant subsets, or the maximum number of iterations Y is reached, where n=20-100, y=10-50.

7. The system for differentiating malignant cells based on single cell multicellular sequencing of claim 6 further comprising:

8. The system for differentiating malignant cells based on single cell multicellular sequencing of claim 6 further comprising:

the key target gene and the regulation network enrichment module thereof are connected with the malignant cell distinguishing module and are used for predicting the key transcription factors and/or the target genes in the identified malignant cells and carrying out molecular typing of the malignant cells.

9. A computer device, comprising:

a memory for storing a computer program;

a processor for performing the steps of a method for differentiating malignant cells based on single-cell multiunit chemical sequencing according to any of claims 1-6 when executing the computer program.

10. A computer-readable storage medium comprising,

the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method of distinguishing malignant cells based on single cell multiunit chemical sequencing as claimed in any one of claims 1-6.