CN117476101A - Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data - Google Patents
Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data Download PDFInfo
- Publication number
- CN117476101A CN117476101A CN202311568169.2A CN202311568169A CN117476101A CN 117476101 A CN117476101 A CN 117476101A CN 202311568169 A CN202311568169 A CN 202311568169A CN 117476101 A CN117476101 A CN 117476101A
- Authority
- CN
- China
- Prior art keywords
- cell
- sequencing
- cells
- malignant
- copy number
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003211 malignant effect Effects 0.000 title claims abstract description 108
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 93
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000011325 microbead Substances 0.000 claims abstract description 74
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 68
- 238000004458 analytical method Methods 0.000 claims abstract description 18
- 239000000126 substance Substances 0.000 claims abstract description 17
- 239000003147 molecular marker Substances 0.000 claims abstract description 6
- 210000004027 cell Anatomy 0.000 claims description 260
- 108090000623 proteins and genes Proteins 0.000 claims description 34
- 210000000349 chromosome Anatomy 0.000 claims description 23
- 230000002068 genetic effect Effects 0.000 claims description 22
- 239000011159 matrix material Substances 0.000 claims description 22
- 210000003855 cell nucleus Anatomy 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 16
- 210000003483 chromatin Anatomy 0.000 claims description 16
- 230000003321 amplification Effects 0.000 claims description 15
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 15
- 108010077544 Chromatin Proteins 0.000 claims description 14
- 239000011324 bead Substances 0.000 claims description 11
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 9
- 230000033228 biological regulation Effects 0.000 claims description 9
- 238000012217 deletion Methods 0.000 claims description 9
- 230000037430 deletion Effects 0.000 claims description 9
- 239000000725 suspension Substances 0.000 claims description 8
- 238000012268 genome sequencing Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 6
- 108091023040 Transcription factor Proteins 0.000 claims description 5
- 102000040945 Transcription factor Human genes 0.000 claims description 5
- 230000024245 cell differentiation Effects 0.000 claims description 5
- 238000010219 correlation analysis Methods 0.000 claims description 5
- 230000002759 chromosomal effect Effects 0.000 claims description 4
- 238000011068 loading method Methods 0.000 claims description 4
- 238000012164 methylation sequencing Methods 0.000 claims description 4
- 238000002156 mixing Methods 0.000 claims description 4
- 238000003860 storage Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 claims description 3
- 238000011065 in-situ storage Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 239000003550 marker Substances 0.000 claims description 2
- 230000014509 gene expression Effects 0.000 abstract description 19
- 238000003745 diagnosis Methods 0.000 abstract description 5
- 238000001514 detection method Methods 0.000 abstract description 2
- 210000001519 tissue Anatomy 0.000 description 20
- 208000037841 lung tumor Diseases 0.000 description 12
- 208000020816 lung neoplasm Diseases 0.000 description 11
- 108020004414 DNA Proteins 0.000 description 10
- 239000007788 liquid Substances 0.000 description 10
- 210000004940 nucleus Anatomy 0.000 description 10
- 238000006243 chemical reaction Methods 0.000 description 9
- 108020004707 nucleic acids Proteins 0.000 description 9
- 102000039446 nucleic acids Human genes 0.000 description 9
- 150000007523 nucleic acids Chemical class 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000010839 reverse transcription Methods 0.000 description 8
- 238000011161 development Methods 0.000 description 7
- 230000018109 developmental process Effects 0.000 description 7
- 210000004881 tumor cell Anatomy 0.000 description 7
- 241000699666 Mus <mouse, genus> Species 0.000 description 6
- 238000009826 distribution Methods 0.000 description 6
- 102000008579 Transposases Human genes 0.000 description 5
- 108010020764 Transposases Proteins 0.000 description 5
- 238000010276 construction Methods 0.000 description 5
- 201000011510 cancer Diseases 0.000 description 4
- 239000012295 chemical reaction liquid Substances 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000001976 enzyme digestion Methods 0.000 description 4
- 210000004072 lung Anatomy 0.000 description 4
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 3
- 229930040373 Paraformaldehyde Natural products 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000003776 cleavage reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000035772 mutation Effects 0.000 description 3
- 229920002866 paraformaldehyde Polymers 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000001105 regulatory effect Effects 0.000 description 3
- IJGRMHOSHXDMSA-UHFFFAOYSA-N Atomic nitrogen Chemical compound N#N IJGRMHOSHXDMSA-UHFFFAOYSA-N 0.000 description 2
- 208000005623 Carcinogenesis Diseases 0.000 description 2
- 102000053602 DNA Human genes 0.000 description 2
- 230000036952 cancer formation Effects 0.000 description 2
- 231100000504 carcinogenesis Toxicity 0.000 description 2
- 238000005119 centrifugation Methods 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 239000002299 complementary DNA Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000002601 intratumoral effect Effects 0.000 description 2
- 201000005243 lung squamous cell carcinoma Diseases 0.000 description 2
- 108020004999 messenger RNA Proteins 0.000 description 2
- 230000001613 neoplastic effect Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000002265 prevention Effects 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 238000000746 purification Methods 0.000 description 2
- 239000002096 quantum dot Substances 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000007017 scission Effects 0.000 description 2
- 238000007789 sealing Methods 0.000 description 2
- 230000004083 survival effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 206010069754 Acquired gene mutation Diseases 0.000 description 1
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 1
- 238000011740 C57BL/6 mouse Methods 0.000 description 1
- 241000208011 Digitalis Species 0.000 description 1
- 206010059866 Drug resistance Diseases 0.000 description 1
- KCXVZYZYPLLWCC-UHFFFAOYSA-N EDTA Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 1
- 208000031448 Genomic Instability Diseases 0.000 description 1
- 102100034343 Integrase Human genes 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 108091005461 Nucleic proteins Proteins 0.000 description 1
- 229920002594 Polyethylene Glycol 8000 Polymers 0.000 description 1
- 229920001213 Polysorbate 20 Polymers 0.000 description 1
- 108010026552 Proteome Proteins 0.000 description 1
- 108010092799 RNA-directed DNA polymerase Proteins 0.000 description 1
- 101150071739 Tp63 gene Proteins 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 239000002671 adjuvant Substances 0.000 description 1
- 230000032683 aging Effects 0.000 description 1
- 238000010171 animal model Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000002512 chemotherapy Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 239000003431 cross linking reagent Substances 0.000 description 1
- 230000001086 cytosolic effect Effects 0.000 description 1
- -1 dNTPs Substances 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000002651 drug therapy Methods 0.000 description 1
- 230000001973 epigenetic effect Effects 0.000 description 1
- 210000002919 epithelial cell Anatomy 0.000 description 1
- 230000007705 epithelial mesenchymal transition Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 125000002485 formyl group Chemical class [H]C(*)=O 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 210000002865 immune cell Anatomy 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 201000005249 lung adenocarcinoma Diseases 0.000 description 1
- 239000006166 lysate Substances 0.000 description 1
- 230000036210 malignancy Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 206010061289 metastatic neoplasm Diseases 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 229910052757 nitrogen Inorganic materials 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 239000003960 organic solvent Substances 0.000 description 1
- 239000000256 polyoxyethylene sorbitan monolaurate Substances 0.000 description 1
- 235000010486 polyoxyethylene sorbitan monolaurate Nutrition 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 230000026447 protein localization Effects 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 239000001397 quillaja saponaria molina bark Substances 0.000 description 1
- 239000011535 reaction buffer Substances 0.000 description 1
- 239000013643 reference control Substances 0.000 description 1
- 239000003161 ribonuclease inhibitor Substances 0.000 description 1
- 229930182490 saponin Natural products 0.000 description 1
- 150000007949 saponins Chemical class 0.000 description 1
- 230000037439 somatic mutation Effects 0.000 description 1
- 210000002536 stromal cell Anatomy 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/40—Population genetics; Linkage disequilibrium
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Analytical Chemistry (AREA)
- Data Mining & Analysis (AREA)
- Bioethics (AREA)
- Physiology (AREA)
- Artificial Intelligence (AREA)
- Ecology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method, a system, equipment and a medium for distinguishing malignant cells by using multicellular sequencing data, belonging to the technical field of tumor single cell sequencing. The method comprises the steps of performing high-throughput single-cell multi-group chemical sequencing by using molecular marker microbeads; and further carrying out copy number variation analysis based on the multicellular sequencing data, so as to distinguish malignant cells in the tumor and the tissue beside the tumor. The invention can accurately distinguish the genome sequence characteristics and gene expression patterns of malignant cells in tumors at multiple groups of chemical levels, and has great application value in detection and auxiliary diagnosis of clinical tumor samples.
Description
Technical Field
The invention belongs to the technical field of tumor single cell sequencing, and particularly relates to a method, a system, equipment and a medium for distinguishing malignant cells by using multicellular sequencing data.
Background
Tumors are the disease with the highest morbidity and mortality worldwide. The occurrence of the tumor is derived to a certain extent from the original malignant cells which are obtained to be dry after mutation accumulation, and the heterogeneity of the tumor is shaped by the cell types with different phenotypes and morphologies generated by proliferation and differentiation of the malignant cells through the change of the microenvironment of the endogenous tumor and the induction of exogenous conditions. The development of tumorigenesis within various organ tissues is derived from heterogeneity within the tumor, the evolution process of various tumors also has a common feature, and the evolution process of tumor clones with different mutations results in the formation of one or more clone types with survival advantage that determine the molecular characteristics and microenvironment of the tumor, a process that is dynamic and complex. Intratumoral heterogeneity is a key factor in the clinical course of treatment to develop chemotherapy, targeted drug therapy and immunotherapy resistance and recurrent mortality.
With the progress of high-throughput second-generation sequencing technology in recent years, deep genome sequencing studies of different types of tumors revealed that genome instability, various somatic mutations are closely related to the formation of heterogeneity and survival evolution of tumors. The multidimensional analysis of the cell resolution of the tumor is helpful for further defining the formation of the heterogeneity in the tumor and the history of clone evolution development, exploring the commonality and the differential mechanism of the occurrence and the development of the tumor and helping to solve important problems such as clinical tumor recurrence, drug resistance and the like. However, there are still relatively few cell-level multidimensional analyses and comparative studies on the problems of different types of tumorigenesis development and internal heterogeneity, and there is a lack of autonomy at the technical level, low cost and relatively high throughput platforms.
In the past, molecular characteristic analysis of various tumor tissues is usually genome sequencing at the cell level of a population, gene expression analysis (transcriptome sequencing, gene expression chip or fluorescent quantitative analysis) and protein localization and expression at the tissue level. Limited by the resolution of the technical means, gene expression detection at population level cannot reflect heterogeneity of internal cells. Single-cell sequencing technology (single-cell sequencing) can detect gene expression or genome variation of cell differentiation from single-cell precision, and provides a new opportunity for analysis of heterogeneity and development track of evolution inside tumor. In the field of tumor research, single cell sequencing can provide assistance for a series of problems of primary tumor heterogeneity, tumor microenvironment, association of primary and recurrent metastatic tumor foci and the like from multiple groups of chemical dimensions such as genome, transcriptome, proteome, metabolome, epigenetic group and the like.
Based on the heterogeneity of tumor cell generation mechanism and tumor cell evolution discovered by single cell histology, the molecular characteristics of malignant cell mutation and internal heterogeneity cells can further provide clues for the diagnosis and prevention of tumors, and the method has great application transformation potential in mechanism research and diagnosis and prevention directions. Currently, single-cell histology studies on tumors focus on cell molecular typing based on the expression of transcriptome genes. With the development of commercial single cell technology platforms and high throughput sequencers, single cell transcriptome maps of various tumor animal models and human clinical tumor samples have been published. Various tumor cell patterns systematically characterize the heterogeneity of intratumoral cells and immune microenvironment cells. Therefore, the development of a rapid tumor cell identification method based on microporous single-cell multiunit chemical sequencing has important clinical significance.
Disclosure of Invention
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
the first aspect of the invention provides a method for distinguishing malignant cells based on single-cell multiunit chemical sequencing, which comprises the following steps:
s1, obtaining a tumor sample and a paratumor sample, respectively preparing single-cell nuclear suspension, mixing the single-cell nuclear suspension with molecular marker microbeads, loading the mixture into a microporous chip, capturing the base sequence of marker cell nucleuses in situ in micropores, and adding a cell identity label and a molecular label;
s2, constructing a sequencing library, and performing at least two of single-cell transcriptome sequencing, single-cell chromatin accessibility sequencing, single-cell genome sequencing and single-cell methylation sequencing to obtain different single-cell sequencing data;
s3, for each single cell sequencing data, the following analysis is performed respectively:
s31, obtaining average copy number variation levels in tumor samples and paraneoplastic samples respectively as malignant copy number variation expectations and normal copy number variation expectations respectively,
s32, dividing single cell sequencing data of the tumor sample and the paraneoplastic sample into N subsets, and judging according to the following criteria for each subset:
if the average copy number variation level of the subset is less than the normal copy number variation expectation, the subset is a normal subset and the cells are normal cells; if the average copy number variation level of the subset is greater than the malignant copy number variation expectation, the subset is a malignant subset and the cells are malignant cells; if the average copy number variation of the subset is between the normal copy number variation expectations and the malignant copy number variation expectations, the subset is intermediate,
s33, for the intermediate state subsets, dividing the intermediate state subsets into N subsets again, and classifying the intermediate state subsets according to the standard in S32;
s34, repeating step S33 until there are no more normal or malignant subsets, or the maximum number of iterations Y is reached,
wherein n=20 to 100 and y=10 to 50;
s4, performing correlation analysis on chromosome copy number variation patterns of the malignant cells identified by the different single cell sequencing data in the step S3, and combining the malignant cells by utilizing chromosome regions with the same copy number variation patterns.
In some embodiments of the present invention, in step S1, the molecular marker microbeads and the nuclei are mixed in a ratio of 1:1 and then loaded onto the microwell chip, so that the nuclei can be provided with cell identity tags, which facilitates rapid determination of nuclei from different cell types in a subsequent analysis process. Preferably, for transcriptome sequencing and cytoplasmic accessibility sequencing, reverse transcription/genome disruption is performed while the cell nucleus is tagged with a cell identity.
Further, in step S1, the method further includes: the cell nucleus suspension is subjected to resuspension fixation treatment by using any one of aldehyde fixing liquid (such as paraformaldehyde), alcohol fixing liquid (such as ethanol), acid fixing liquid and cross-linking agent, so that nucleic acid/protein in the cell nucleus is cross-linked and fixed, and the nucleic acid molecules enter the cell/cell nucleus to react more effectively. Preferably, in genomic sequencing, the nuclei are not subjected to any organic solvent immobilization treatment, so that the transposase can more effectively enter the nuclei for reaction.
In some embodiments of the invention, in step S1, the in-situ nucleic acid molecule in the nucleus is labeled with mRNA and DNA. The poly-T tail carried by the nucleic acid molecule with known base sequence on the surface of the microbead can be hybridized and combined with mRNA in the treated cell nucleus; random or fixed sequences carried by nucleic acid molecules of known base sequences on the surface of the microbeads can be hybridized to DNA in the treated nuclei.
In some embodiments of the invention, in step S3, before performing the analysis, further comprising the step of performing a quasi-swarming treatment:
and performing quasi-population treatment according to the cell number addition single cell sequencing data from the same sample, constructing a quasi-population set according to the single cell sequencing data set of the Euclidean distance addition adjacent cells, and performing data normalization treatment.
In some embodiments of the invention, in step S4, the specific steps of combining malignant cells with the same chromosome region in the copy number variation pattern are: screening chromosome regions with copy number variation directions of amplification or deletion, and drawing a chromosome variation pattern diagram of malignant cells so as to combine the malignant cells.
In some embodiments of the invention, further comprising the step of performing cell subtype identification based on any one of the single cell sequencing data:
grouping all the microbeads pairwise according to the cell identity tags in the sequencing data to form microbead pairing;
performing traversal calculation on each microbead pairing, wherein the calculation content is similarity of microbead capturing sequences, and sequencing the microbead pairing according to the similarity;
then, combining the microbead pairs with sequence similarity higher than a preset threshold according to the number of the micropores actually contained in the micropores;
finally, respectively combining gene matrixes of cells derived from tumor samples and paratumor samples, performing dimension reduction, feature selection, difference analysis and cell subgroup grouping on the combined single-cell histology matrixes, and annotating the cell subgroup based on a public database.
According to the process, the similarity expression scores are calculated through the random sequence distribution similarity in the cell identity tags carried and captured by the microbeads in the data, so that the situation that a plurality of microbeads are located in the same microwell is determined, genetic sequence information of all microbeads in the same microwell is combined, and for a plurality of cell nuclei in the same microwell, the genetic sequence information combined by the microbeads is distributed and reduced to a single cell nucleus through the cell identity tags, so that multiple groups of chemical data with single cell resolution can be obtained.
For transcriptome sequencing, the bead-linked primer sequence includes four parts: library linker sequences, cell tag sequences, molecular tag sequences, and nucleic acid capture sequences. Wherein the library linker sequence is used for subsequent on-press sequencing; the cell tag sequence is used for identifying different cells; the molecular tag sequence is a sequence composed of random bases, and each DNA molecule contains a unique molecular tag sequence for identifying different DNA molecules during mixed sequencing; the nucleic acid capture sequence contains a poly-T tail or random primer sequence for capturing the RNA molecule.
For genome sequencing, the construction of the library uses transposases to disrupt the open region of the chromatin; the bead-attached primer sequence comprises four parts: library linker sequences, cell tag sequences, molecular tag sequences, and nucleic acid capture sequences. Wherein the library linker sequence is used for subsequent on-press sequencing; the cell tag sequence is used for identifying different cells; the molecular tag sequence is a sequence composed of random bases, and each DNA molecule contains a unique molecular tag sequence for identifying different DNA molecules during mixed sequencing; the nucleic acid capture sequence contains a hybridization sequence that matches the transposase linker sequence for capturing transposase-disrupted DNA molecules.
In some embodiments of the present invention, the combining of pairs of microbeads having sequence similarity above a predetermined threshold is specifically:
(1) A cell and a microbead in the micropore directly take the cell identity label and the captured genetic sequence information of the microbead as a genetic information matrix of the single cell;
(2) A plurality of cells and a microbead are arranged in the micropore, the captured genetic sequence information of the microbead is distributed to the cells according to the cell identity labels of the paired cells to be used as a genetic information matrix of the cells;
(3) A cell and a plurality of microbeads are arranged in the micropore, the captured genetic sequences of the microbeads are accumulated and distributed to the cell to be used as a genetic information matrix of the single cell;
(4) The microwell contains a plurality of cells and a plurality of microbeads, the captured genetic sequences of the microbeads are accumulated, and then the genetic sequences are distributed to the cells according to the cell identity labels of the paired cells to be used as a genetic information matrix of the cells.
In some embodiments of the invention, further comprising predicting the identified key transcription factors and/or their target genes in the malignant cells, performing molecular typing of the malignant cells.
The invention can cross-verify and further assist in identifying malignant cells from suspected cancer (malignant tumor) samples, and further quickly determine cell lineage sources, proportion and copy number variation modes of the malignant cells from cytohistology dimensions, and molecular typing indexes such as target genes, and the like, thereby providing auxiliary diagnosis.
In a second aspect, the invention provides a system for distinguishing malignant cells based on single-cell multiunit chemical sequencing, comprising the following modules:
and a data input module: different single cell sequencing data obtained from at least two of single cell transcriptome sequencing, single cell chromatin accessibility sequencing, single cell genome sequencing and single cell methylation sequencing for receiving a tumor sample and a paratumor sample;
malignant cell differentiation module: and the data input module is connected with the data input module and is used for respectively carrying out the following analysis on each single cell sequencing data:
s31, obtaining average copy number variation levels in tumor samples and paraneoplastic samples respectively as malignant copy number variation expectations and normal copy number variation expectations respectively,
s32, dividing single cell sequencing data of the tumor sample and the paraneoplastic sample into N subsets, and judging according to the following criteria for each subset:
if the average copy number variation level of the subset is less than the normal copy number variation expectation, the subset is a normal subset and the cells are normal cells; if the average copy number variation level of the subset is greater than the malignant copy number variation expectation, the subset is a malignant subset and the cells are malignant cells; if the average copy number variation of the subset is between the normal copy number variation expectations and the malignant copy number variation expectations, the subset is intermediate,
s33, for the intermediate state subsets, dividing the intermediate state subsets into N subsets again, and classifying the intermediate state subsets according to the standard in S32;
s34, repeating the step S33 until no more normal subsets or malignant subsets exist, or iteration is achieved
Maximum number Y, where n=20-100, y= =10-50;
s4, carrying out correlation analysis on chromosome copy number variation patterns of the malignant cells identified by the different single cell sequencing data in the step S3, and combining the malignant cells by using chromosome regions with the same copy number variation patterns.
Further, the method further comprises the following steps:
the cell subtype identification module is respectively connected with the data input module and the malignant cell distinguishing module and is used for carrying out cell subtype identification according to the following steps:
grouping all the microbeads pairwise according to the cell identity tags in the sequencing data to form microbead pairing;
performing traversal calculation on each microbead pairing, wherein the calculation content is similarity of microbead capturing sequences, and sequencing the microbead pairing according to the similarity;
then, combining the microbead pairs with sequence similarity higher than a preset threshold according to the number of the micropores actually contained in the micropores;
finally, respectively combining gene matrixes of cells derived from a tumor sample and a paratumor sample, performing dimension reduction, feature selection, difference analysis and cell subgroup grouping on the combined single-cell histology matrixes, and annotating the cell subgroup based on a public database;
and for determining the pattern of variation of malignant cells in different cell subsets based on the malignant cells identified by the malignant cell differentiation module.
Still further, the kit also comprises a key target gene and a regulation network enrichment module thereof, which are connected with the malignant cell distinguishing module and are used for predicting the identified key transcription factors and/or target genes thereof in the malignant cells and carrying out molecular typing of the malignant cells.
A third aspect of the present invention provides a computer apparatus comprising:
a memory for storing a computer program;
a processor for performing the steps of a method for differentiating malignant cells based on single-cell multiunit chemical sequencing according to any of the first aspect of the invention when executing the computer program.
A fourth aspect of the invention provides a computer readable storage medium,
the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method of distinguishing malignant cells based on single cell multiunit chemical sequencing according to any of the first aspect of the invention.
The beneficial effects of the invention are that
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method, equipment and medium for identifying quick tumor cells based on microporous single-cell multiunit chemical sequencing. The genetic information of multiple groups of tumor samples is detected at a single cell level in a high throughput manner based on a microporous microbead system. And based on multiple groups of study information, malignant cells in the tumor are rapidly and accurately identified, and the characteristic regulation mode of the target genes is enriched, so that references are provided for the typing and auxiliary diagnosis of clinical tumors.
According to the invention, correlation analysis is carried out on the chromosome copy number variation modes obtained from different single cell sequencing data, chromosome regions with the copy number variation directions of amplification or deletion are screened, and a malignant cell chromosome variation mode diagram is drawn. And combining malignant cells iteratively grouped at the average copy number level per cell. The proportion of malignant cell subsets of the core to their distribution within the individual cell lineages is determined. Further determining the genome variation mode, enriching the key target genes and the regulation network of the malignant cells, and carrying out molecular typing of the malignant cells. The invention can integrate the multi-group chemical data of malignant tumor cells to construct a regulation network, and the regulation network construction in the prior art basically aims at the single-cell transcriptome gene expression data, integrates the genomic data such as the accessibility of the single-cell transcriptome and the single-cell chromatin of the tumor and the like to construct the regulation network of single-cell resolution, which is the first creation of the invention.
Drawings
Figure 1 shows a sample of mouse lung tumor samples, paraneoplastic samples, normal reference control lung samples, cell subtypes and sample sources of cell subtypes defined by genomic chromatin accessibility. Adj represents paraneoplastic samples, normal represents contralateral Normal lung samples, and Tumor represents neoplastic samples.
Fig. 2 shows the distribution projections of malignant cells (macrognant) and non-malignant normal cells (nonmacrognant) as defined by the degree of copy number variation for cell subtypes defined by genomic chromatin accessibility of mouse lung tumor samples, paraneoplastic samples, contralateral normal lung samples.
FIG. 3 shows Copy number variation on the genomic chromatin accessibility panel level, as divided by malignancy, identified by Copy-scAT. NMF_cluster represents Non-negative matrix factorization, is Non-negative matrix factorization, and is an unsupervised clustering method by which the number 3 cluster is a malignant cell subset and matches the distribution defined by copy number variation.
FIG. 4 shows the pattern of chromosome-wide copy number variation of malignant cells (macrognant) versus non-malignant cells (nonmacrognant) predicted at the transcriptome level identified by afferCNV.
FIG. 5 shows the correlation of the average score of Copy number variation at the transcriptome level identified by the afferCNV with the average score of Copy number variation at the genomic chromatin accessibility group level identified by Copy-scAT, and the chromosome band consistent with Copy number deletion (del effect) and Copy number amplification (dup effect) at different chromosome bands.
FIG. 6 shows the enrichment results of key target genes of malignant cells of lung tumor of mice and the regulation network thereof. Pink gene is the key target gene of choice, the node color shade represents its network centrality, the node size represents how many of its interacted genes are.
Detailed Description
Unless otherwise indicated, implied from the context, or common denominator in the art, all parts and percentages in the present application are based on weight and the test and characterization methods used are synchronized with the filing date of the present application. Where applicable, the disclosure of any patent, patent application, or publication referred to in this application is incorporated by reference in its entirety, and the equivalent patents to those cited are incorporated by reference, particularly as they relate to the definitions of terms in the art. If the definition of a particular term disclosed in the prior art does not conform to any definition provided in this application, the definition of that term provided in this application controls.
Numerical ranges in this application are approximations, so that it may include the numerical values outside of the range unless otherwise indicated. The numerical range includes all values from the lower value to the upper value that increase by 1 unit, provided that there is a spacing of at least 2 units between any lower value and any higher value. For ranges containing values less than 1 or containing fractions greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is suitably considered to be 0.0001,0.001,0.01, or 0.1. For a range containing units of less than 10 (e.g., 1 to 5), 1 unit is generally considered to be 0.1. These are merely specific examples of what is intended to be provided, and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application.
The terms "comprises," "comprising," "including," and their derivatives do not exclude the presence of any other component, step or procedure, and are not related to whether or not such other component, step or procedure is disclosed in the present application. For the avoidance of any doubt, all use of the terms "comprising," "including," or "having" herein, unless expressly stated otherwise, may include any additional additive, adjuvant, or compound. Rather, the term "consisting essentially of … …" excludes any other component, step or process from the scope of any of the terms recited below, except as necessary for operability. The term "consisting of … …" does not include any components, steps or processes not specifically described or listed. The term "or" refers to the listed individual members or any combination thereof unless explicitly stated otherwise.
In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the embodiments.
The following examples are presented herein to demonstrate preferred embodiments of the present invention. It will be appreciated by those skilled in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, the disclosure of which is incorporated herein by reference as is commonly understood by reference.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the claims.
The experimental methods in the following examples are conventional methods unless otherwise specified. The instruments used in the following examples are laboratory conventional instruments unless otherwise specified; the test materials used in the examples described below, unless otherwise specified, were purchased from conventional biochemical reagent stores.
Example 1 preparation and sequencing of a Single cell multiple-set of a microwell-based tumor sample from the lung of an aging mouse
1. Sample preparation
Tumor tissue was isolated from paraneoplastic control tissue from aged C57BL6 mice identified as lung tumor. The two tissues were rapidly frozen, ground and crushed into powder in liquid nitrogen, and then added to the cell nucleus lysate to be lysed on ice. And obtaining the single cell nucleus suspension after centrifugal washing.
2. Transcriptomic library construction
The method comprises the steps of fixing cell nuclei by using 4% Paraformaldehyde (PFA), adding single-nucleus suspensions of tumors and paraneoplastic tissues of different samples into different centrifuge tubes, respectively adding reverse transcription primers, reverse transcriptase, reverse transcription reaction buffer, dNTPs, RNase inhibitors, 50% PEG8000 and 10% TritonX10 which carry different cell identity tag sequences into each centrifuge tube, uniformly mixing, and then placing the mixture in a PCR instrument for constant-temperature reverse transcription reaction. After completion of the reverse transcription reaction, the nuclei were washed with 3 XSSC and PBS, respectively, to prepare for chip loading.
When the chip is loaded, the cell nucleus and the molecular tag microbeads are mixed in equal proportion, and an amplification system of constant temperature polymerase and high-fidelity polymerase is added. According to the actual sample amount, reversely transcribed cell nucleuses of different tumor samples are quickly and uniformly loaded into a microporous chip by using a liquid transfer device, microscopic examination is carried out on the falling hole conditions of microbeads and cells in micropores under a microscope, so that the falling hole ratio of the cell nucleuses and the microbeads in the microporous chip is more than 70%, sealing oil is added to seal the microporous chip to form an independent reaction space, and the microporous chip is placed in a PCR thermal cycler for amplification.
After the reaction is finished, fully collecting liquid and molecular marker microbeads in the chip through multiple times of centrifugation, and sucking the reaction liquid and transferring the reaction liquid into a new centrifuge tube; adding DNA purification magnetic beads to purify to obtain amplified cDNA liquid; adding the amplified sequence library into an amplification system containing sequencing linkers (P5 and P7) of a sequencing tag (index) and high-fidelity polymerase to amplify the sequence library, thereby obtaining a sequencing library with the index; and purifying the magnetic Beads by DNAClean Beads to obtain a sequencing library, determining the concentration of the library by using a Qubit 3.0 fluorescent agent, and storing at-20 ℃. And selecting a proper amount of sequencing libraries to be subjected to sequencing on the machine according to the on-machine requirement of the sequencer.
3. Apparent genomics-chromatin accessibility library construction
Adding the mononuclear suspensions of the tumors and the tissues beside the tumors of different samples into different centrifuge tubes, respectively adding transposase carrying different cell identity tag sequences, 2 times of enzyme digestion reaction liquid, 1% digitalis saponin, 10% Tween-20 and 1 times of enzyme digestion system of PBS into each centrifuge tube, fully and uniformly mixing, and carrying out enzyme digestion reaction for half an hour in a constant temperature reaction system at 37-55 ℃. The cleavage reaction was stopped on ice, the nuclei were collected and centrifuged, and then the nuclei were washed twice with PBS wash and prepared for chip loading.
When the chip is loaded, the cell nucleus and the molecular tag microbead are mixed in equal proportion, and 50mM EDTA and 2 Xhigh-fidelity polymerase are added and mixed uniformly. According to the actual sample amount, reversely transcribed cell nucleuses of different tumor samples are quickly and uniformly loaded into a microporous chip by using a liquid transfer device, microscopic examination is carried out on the falling hole conditions of microbeads and cells in micropores under a microscope, so that the falling hole ratio of the cell nucleuses and the microbeads in the microporous chip is more than 70%, a tube cover is fastened, then a centrifuge tube is placed at a constant temperature of 50 ℃ for reacting for half an hour, genome fragments are released, then an amplification system added with constant temperature polymerase and high-fidelity polymerase is added into the microporous chip, a sealing oil is added to seal the microporous chip to form an independent reaction space, and the microporous chip is placed in a PCR thermal cycler for amplification.
After the reaction is finished, the liquid and the molecular marker microbeads in the chip are fully collected through multiple times of centrifugation, and the reaction liquid is sucked and transferred into a new centrifuge tube. And adding DNA purification magnetic beads to purify to obtain amplified cDNA liquid. Adding into an amplification system containing sequencing tags P5 and P7 and high-fidelity polymerase, amplifying the sequencing library to obtain a sequencing library with index, purifying magnetic Beads by using DNA Clean Beads to obtain the sequencing library, determining the concentration of the library by using a Qubit 3.0 fluorescent agent, and storing at-20 ℃. And selecting a proper amount of sequencing libraries to be subjected to sequencing on the machine according to the on-machine requirement of the sequencer.
Example 2 cell subtype identification of aged mouse lung tumor samples based on single cell polycomponentry
The original fastq data of the high throughput sequencing obtained in example 1 were extracted and screened according to the cell identity tag sequence carried by reverse transcription/transposase cleavage and the molecular tag sequence carried on microbeads, and the sequencing data were aligned to the mouse reference genome to obtain a single cell transcriptome matrix and a chromatin accessibility matrix.
Firstly, according to cell identity tags extracted according to positions in sequencing data, all microbeads are grouped into groups to form microbead pairs.
Then, a traversal calculation was performed for each bead pair, the calculation content being the similarity of the bead capture sequences (Jaccard expression score), and the bead pairs were ranked according to the similarity. In transcriptome sequencing, the capture sequences that are incorporated into the calculation are random primer sequences; in chromatin accessibility sequencing, the capture sequence that is incorporated into the calculation is the genetic information that is captured.
Then, according to the number of wells (1 ten thousand) actually contained in the microwells, the microbead pairs with high sequence similarity (the Jaccard expression score corresponding to 1 ten thousand microbead pairs is used as a threshold for judging the sequence similarity to be high) are combined, and the microbeads are judged to be in the same microwell, and the following classification treatment is performed:
(1) For the case that one cell and one microbead exist in the microwell after calculation, the cell identity label and the captured genetic sequence information of the microbead are directly used as a genetic information matrix of the single cell.
(2) In the case that a plurality of cells and a microbead are present in the microwell after calculation, the captured genetic sequence information of the microbead is assigned to the plurality of cells according to the reverse transcription/cleavage step cell identity tags of the plurality of cells paired with the captured genetic sequence information as a genetic information matrix of the plurality of cells.
(3) For the case of one cell and a plurality of microbeads in the microwell after calculation, the captured genetic sequences of the microbeads are accumulated and distributed to the one cell as a genetic information matrix of the single cell.
(4) For the case that a plurality of cells and a plurality of microbeads exist in the microwell after calculation, the captured genetic sequences of the microbeads are accumulated firstly, and then the cell identity tags are distributed to the cells according to the reverse transcription/enzyme digestion steps of the paired cells to serve as a genetic information matrix of the cells.
Finally, combining tumor cells and paraneoplastic tissue cells, generating and processing a downstream gene expression matrix by using the SEurat and ArchR software to transcriptome data and chromatin accessibility data respectively, selecting 2000 differential genes with characteristics at the head, performing PCA analysis and dimension reduction processing on the single-cell gene expression matrix, and projecting characteristic single-cell subsets on a two-dimensional plane. For each subgroup, the most characteristic gene set of the subgroup is obtained by utilizing a Findamker and other differential enrichment tools. The classification and definition of cell subsets is performed by referring to the existing gene annotation database, such as the panglaoDB database, defining them to specific cell types of different lineages, identifying different epithelial, stromal and immune cell types within the tumor tissue, and subtype classification is performed based on genomic chromatin accessibility, as shown in FIG. 1, wherein the sample source of each cell is also integrally labeled.
Example 3 identification of malignant cells in a mouse Lung tumor sample
Identification of cell types in tumor and paratumor specimens using single cell genomic chromatin accessibility and transcriptome data, based on tumor specimens, paratumor specimensThe number of cells obtained by sequencing is selected in a certain combination ratio (in this example, 100 cells in the subgroup are combined into a quasi-population set cell), a quasi-population set is constructed by summing up the data sets of the 100 adjacent cells according to the Euclidean distance, and then the gene expression count matrix (listed as each cell, behavior gene) of the combined cells is subjected to addition processing in the Seurat software, that is, the sum of the counts of each sequenced gene (each row) in 100 cells (100 columns) is calculated and used as the gene expression matrix of the quasi-population set after addition. Performing dimension reduction and grouping on the single cell gene expression matrix subjected to the quasi-swarming treatment, performing data normalization treatment, and re-normalizing the quasi-swarm set of single cell data to 10 6 On the order of magnitude.
Copy number variation analysis of lung tumor sample data sets at the transcriptome and genome levels (at the quasi-population level) was performed in combination with the unifying cnv software and Copy-scAT software to quantitatively characterize Copy number deletion (del effect) and Copy number amplification (dup effect) patterns across different chromosomes.
Firstly, assuming that malignant cells and normal cells exist in tumor and paraneoplastic tissue, initially taking a copy number area in the paraneoplastic tissue as a normal control, and calculating an average value (average copy number variation level) of the copy numbers of the tumor and paraneoplastic tissue cells at a quasi-population set level; and the average copy number variation levels of the paraneoplastic tissue and the neoplastic tissue are respectively used as a "normal copy number variation expectation" and a "malignant copy number variation expectation". For transcriptome data, gene expression levels were quantified to a range of-1 to 1.
Dividing a quasi-population set constructed by lung tumor side tissue and lung tumor tissue single cell data into 50 subsets by using a hierarchical clustering algorithm, and defining:
if the average copy number variation level of each subset is less than the "normal copy number variation expected", the subset is defined as "normal";
if the average copy number variation level of each subset is greater than the "malignant copy number variation desire", the subset is defined as "malignant";
the average copy number variation level for each subset is between the two, and the subset is defined as "intermediate".
For subsets categorized into "intermediate states", the next round of hierarchical clustering iterations will be entered, i.e., re-divided into 50 subsets and the classification calculated. Until no more subsets of "normal" or "malignant" are present or the maximum number of iterations is reached, the final copy number variation signature is projected onto the single cell grouping result, the multiple sets of chemically defined malignant cells are pooled, and whether there is a subset of malignant cells that aggregate individually or there is a pattern of malignant cell scattering distribution.
As a result, as shown in fig. 2, it can be seen that the predicted malignant cells are mostly derived from tumor tissue samples, and that there is also a proportion of malignant transitional cell distribution in the paraneoplastic tissue. The copy number variation pattern results of the horizontal regionality and the integrity of each chromosome are shown in fig. 3 and 4, and it can be seen that in the identified malignant cell population of the lung tumor sample, chromosome 8, chromosome 16 and chromosome 17 show significant copy number amplification; malignant cells in tumor and paraneoplastic tissue exhibit copy number deletions on chromosome 4, chromosome 5, and chromosome 11.
The integration and correlation analysis of copy number variation patterns of different sets of students requires dividing the annotated chromosome region into different bands, quantifying the average copy number variation score (ranging from deletion to amplification to-2 to 2) at the overlapping band level of the different sets of students, and comparing the copy number "amplification" and "deletion" of the multiple sets of chemical copy number variations of the different bands. For the afferCNV transcriptome level data and Copy-scAT genome chromatin accessibility level data, a total of 42 overlapping chromosomal bands were obtained, and the Copy number amplification and Copy number deletion of each chromosomal band was marked, as shown in FIG. 5, the correlation performance of the Copy number variation results of the multiplex analysis reached 0.73, with significance (p-value was 0.0018).
Example 4 construction of malignant cell Gene expression regulatory network
And predicting key transcription factors and target genes thereof in the lung tumor malignant cells identified by the multiple sets of science combination by utilizing SCRIP software and constructing a regulation network.
First, using ClusterProfiler tool, selecting a saliency p-value threshold of 0.1, filtering low quality target genes, enriching potential key channels to obtain 21 common enriched channels, and screening 49 target genes with high correlation with the channels, wherein the genes are named as key node target genes.
The interaction network of these target genes was mapped using the target gene interaction information of the STRING database, proteins outside the network were further removed, and key node target genes of normal and malignant cells were enriched by a threshold of mean fold expression (avg.logfc) >0.25 and significance BH adjustment p value <0.05, as shown in fig. 6.
By utilizing the framework, the known key target genes such as Tp63, foxc2, nkx2-1 and the like which are highly related to lung tumors can be enriched, the malignant cell population is characterized by epithelial-mesenchymal transition, and the currently detected tumor sample belongs to lung squamous cell carcinoma and indicates that the transformation process from epithelial lung adenocarcinoma to matrix lung squamous cell carcinoma exists. Proved by the method, the key target genes for regulating malignant cells in tumors and the regulating network thereof can be rapidly and accurately identified.
All documents mentioned in this application are incorporated by reference as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the claims appended hereto.
Claims (10)
1. A method for differentiating malignant cells based on single-cell multiunit chemical sequencing, comprising the steps of:
s1, obtaining a tumor sample and a paratumor sample, respectively preparing single-cell nuclear suspension, mixing the single-cell nuclear suspension with molecular marker microbeads, loading the mixture into a microporous chip, capturing the base sequence of marker cell nucleuses in situ in micropores, and adding a cell identity label and a molecular label;
s2, constructing a sequencing library, and performing at least two of single-cell transcriptome sequencing, single-cell chromatin accessibility sequencing, single-cell genome sequencing and single-cell methylation sequencing to obtain different single-cell sequencing data;
s3, for each single cell sequencing data, the following analysis is performed respectively:
s31, obtaining average copy number variation levels in tumor samples and paraneoplastic samples respectively as malignant copy number variation expectations and normal copy number variation expectations respectively,
s32, dividing single cell sequencing data of the tumor sample and the paraneoplastic sample into N subsets, and judging according to the following criteria for each subset:
if the average copy number variation level of the subset is less than the normal copy number variation expectation, the subset is a normal subset and the cells are normal cells; if the average copy number variation level of the subset is greater than the malignant copy number variation expectation, the subset is a malignant subset and the cells are malignant cells; if the average copy number variation of the subset is between the normal copy number variation expectations and the malignant copy number variation expectations, the subset is intermediate,
s33, for the intermediate state subsets, dividing the intermediate state subsets into N subsets again, and classifying the intermediate state subsets according to the standard in S32;
s34, repeating step S33 until there are no more normal or malignant subsets, or the maximum number of iterations Y is reached,
wherein n=20 to 100 and y=10 to 50;
s4, carrying out correlation analysis on chromosome copy number variation patterns of the malignant cells identified by the different single cell sequencing data in the step S3, and combining the malignant cells by using chromosome regions with the same copy number variation patterns. Screening chromosomal regions with copy number variation directions of amplification or deletion to draw a malignant cell chromosomal variation pattern diagram. And combining malignant cells iteratively grouped at the average copy number level per cell.
2. The method of claim 1, further comprising the step of performing cell subtype identification based on any single cell sequencing data:
grouping all the microbeads pairwise according to the cell identity tags in the sequencing data to form microbead pairing;
performing traversal calculation on each microbead pairing, wherein the calculation content is similarity of microbead capturing sequences, and sequencing the microbead pairing according to the similarity;
then, combining the microbead pairs with sequence similarity higher than a preset threshold according to the number of the micropores actually contained in the micropores;
finally, respectively combining gene matrixes of cells derived from tumor samples and paratumor samples, performing dimension reduction, feature selection, difference analysis and cell subgroup grouping on the combined single-cell histology matrixes, and annotating the cell subgroup based on a public database.
3. The method for distinguishing malignant cells based on single-cell multiunit chemical sequencing according to claim 2, wherein the combining the bead pairs with sequence similarity higher than a preset threshold is specifically:
(1) A cell and a microbead in the micropore directly take the cell identity label and the captured genetic sequence information of the microbead as a genetic information matrix of the single cell;
(2) A plurality of cells and a microbead are arranged in the micropore, the captured genetic sequence information of the microbead is distributed to the cells according to the cell identity labels of the paired cells to be used as a genetic information matrix of the cells;
(3) A cell and a plurality of microbeads are arranged in the micropore, the captured genetic sequences of the microbeads are accumulated and distributed to the cell to be used as a genetic information matrix of the single cell;
(4) The microwell contains a plurality of cells and a plurality of microbeads, the captured genetic sequences of the microbeads are accumulated, and then the genetic sequences are distributed to the cells according to the cell identity labels of the paired cells to be used as a genetic information matrix of the cells.
4. The method for differentiating malignant cells based on single-cell multiunit chemical sequencing according to claim 1, wherein in step S3, before performing the analysis, further comprising the step of performing a quasi-population treatment:
and performing quasi-population treatment according to the cell number addition single cell sequencing data from the same sample, constructing a quasi-population set according to the single cell sequencing data set of the Euclidean distance addition adjacent cells, and performing data normalization treatment.
5. The method of claim 1, further comprising predicting key transcription factors and/or target genes in the identified malignant cells for molecular typing of the malignant cells.
6. A system for differentiating malignant cells based on single cell multicellular sequencing comprising the following modules:
and a data input module: different single cell sequencing data obtained from at least two of single cell transcriptome sequencing, single cell chromatin accessibility sequencing, single cell genome sequencing and single cell methylation sequencing for receiving a tumor sample and a paratumor sample;
malignant cell differentiation module: and the data input module is connected with the data input module and is used for respectively carrying out the following analysis on each single cell sequencing data:
s31, obtaining average copy number variation levels in tumor samples and paraneoplastic samples respectively as malignant copy number variation expectations and normal copy number variation expectations respectively,
s32, dividing single cell sequencing data of the tumor sample and the paraneoplastic sample into N subsets, and judging according to the following criteria for each subset:
if the average copy number variation level of the subset is less than the normal copy number variation expectation, the subset is a normal subset and the cells are normal cells; if the average copy number variation level of the subset is greater than the malignant copy number variation expectation, the subset is a malignant subset and the cells are malignant cells; if the average copy number variation of the subset is between the normal copy number variation expectations and the malignant copy number variation expectations, the subset is intermediate,
s33, for the intermediate state subsets, dividing the intermediate state subsets into N subsets again, and classifying the intermediate state subsets according to the standard in S32;
s34, repeating step S33 until there are no more normal subsets or malignant subsets, or the maximum number of iterations Y is reached, where n=20-100, y=10-50.
7. The system for differentiating malignant cells based on single cell multicellular sequencing of claim 6 further comprising:
the cell subtype identification module is respectively connected with the data input module and the malignant cell distinguishing module and is used for carrying out cell subtype identification according to the following steps:
grouping all the microbeads pairwise according to the cell identity tags in the sequencing data to form microbead pairing;
performing traversal calculation on each microbead pairing, wherein the calculation content is similarity of microbead capturing sequences, and sequencing the microbead pairing according to the similarity;
then, combining the microbead pairs with sequence similarity higher than a preset threshold according to the number of the micropores actually contained in the micropores;
finally, respectively combining gene matrixes of cells derived from a tumor sample and a paratumor sample, performing dimension reduction, feature selection, difference analysis and cell subgroup grouping on the combined single-cell histology matrixes, and annotating the cell subgroup based on a public database;
and for determining the pattern of variation of malignant cells in different cell subsets based on the malignant cells identified by the malignant cell differentiation module.
8. The system for differentiating malignant cells based on single cell multicellular sequencing of claim 6 further comprising:
the key target gene and the regulation network enrichment module thereof are connected with the malignant cell distinguishing module and are used for predicting the key transcription factors and/or the target genes in the identified malignant cells and carrying out molecular typing of the malignant cells.
9. A computer device, comprising:
a memory for storing a computer program;
a processor for performing the steps of a method for differentiating malignant cells based on single-cell multiunit chemical sequencing according to any of claims 1-6 when executing the computer program.
10. A computer-readable storage medium comprising,
the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method of distinguishing malignant cells based on single cell multiunit chemical sequencing as claimed in any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311568169.2A CN117476101A (en) | 2023-11-22 | 2023-11-22 | Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311568169.2A CN117476101A (en) | 2023-11-22 | 2023-11-22 | Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117476101A true CN117476101A (en) | 2024-01-30 |
Family
ID=89636203
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311568169.2A Pending CN117476101A (en) | 2023-11-22 | 2023-11-22 | Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117476101A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854600A (en) * | 2024-03-07 | 2024-04-09 | 北京大学 | Cell identification method, device, equipment and storage medium based on multiple sets of chemical data |
-
2023
- 2023-11-22 CN CN202311568169.2A patent/CN117476101A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117854600A (en) * | 2024-03-07 | 2024-04-09 | 北京大学 | Cell identification method, device, equipment and storage medium based on multiple sets of chemical data |
CN117854600B (en) * | 2024-03-07 | 2024-05-21 | 北京大学 | Cell identification method, device, equipment and storage medium based on multiple sets of chemical data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108753967B (en) | Gene set for liver cancer detection and panel detection design method thereof | |
Baek et al. | Single-cell ATAC sequencing analysis: from data preprocessing to hypothesis generation | |
US11335437B2 (en) | Set membership testers for aligning nucleic acid samples | |
Ebert et al. | Genomic approaches to hematologic malignancies | |
CN110800063B (en) | Detection of tumor-associated variants using cell-free DNA fragment size | |
Riddick et al. | Integration and analysis of genome-scale data from gliomas | |
Bohers et al. | cfDNA sequencing: technological approaches and bioinformatic issues | |
CN106676182A (en) | Low-frequency gene fusion detection method and device | |
Larsson et al. | Comparative microarray analysis | |
Li et al. | Bioinformatics-based identification of methylated-differentially expressed genes and related pathways in gastric cancer | |
CN110910950A (en) | Flow method for combined analysis of single-cell scRNA-seq and scATAC-seq | |
CN117476101A (en) | Method, system, equipment and medium for distinguishing malignant cells by using multicellular sequencing data | |
CN109637587B (en) | Method, device, storage medium, processor and method for standardizing transcriptome data expression quantity for detecting gene fusion mutation | |
Mamatjan et al. | Molecular signatures for tumor classification: an analysis of the cancer genome atlas data | |
Iaccarino et al. | LncRNA as cancer biomarkers | |
WO2021150990A1 (en) | Small rna disease classifiers | |
Foster et al. | A targeted capture approach to generating reference sequence databases for chloroplast gene regions | |
Miles et al. | Genetic testing and tissue banking for personalized oncology: Analytical and institutional factors | |
Costa et al. | Bioinformatics research methodology of non-coding RNAs in cardiovascular diseases | |
Mitchell et al. | Inter-platform comparability of microarrays in acute lymphoblastic leukemia | |
Marques et al. | Single-Cell RNA sequencing of oligodendrocyte lineage cells from the mouse central nervous system | |
WO2020194057A1 (en) | Biomarkers for disease detection | |
CN114875118B (en) | Methods, kits and devices for determining cell lineage | |
CN111020710A (en) | ctDNA high-throughput detection of hematopoietic and lymphoid tissue tumors | |
Porter et al. | StemBase: a resource for the analysis of stem cell gene expression data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |