CN117912552B

CN117912552B - A method, device and medium for analyzing single-cell transcriptome data of a PDX model

Info

Publication number: CN117912552B
Application number: CN202410054361.8A
Authority: CN
Inventors: 葛长利; 韩斐然; 郎秋蕾
Original assignee: Hangzhou Link Care Medical Laboratory Co ltd
Current assignee: Hangzhou Link Care Medical Laboratory Co ltd
Priority date: 2023-01-07
Filing date: 2023-01-07
Publication date: 2025-03-28
Anticipated expiration: 2043-01-07
Also published as: CN116153401A; CN116153401B; CN117912551A; CN117912551B; CN117912552A

Abstract

The invention discloses a method, equipment and medium for analyzing single-cell transcriptome data of a PDX model, and belongs to the technical field of bioinformatics. The analysis method comprises the steps of comparing single-cell transcriptome sequencing data of a PDX model with a mixed genome library to obtain a cell-gene expression profile matrix, identifying whether cells are human cells, mouse cells or double cells according to the proportion of human or mouse genes, extracting cell sequences from the single-cell transcriptome sequencing data based on cell barcode, comparing the cell sequences with a reference genome of a source species to obtain a corresponding cell-gene expression profile, and the cell-gene expression profile matrix can be used for carrying out differential gene analysis, functional enrichment analysis and the like to obtain more data, thereby providing technical support for clinical treatment of cancers. Furthermore, the cell-integrated gene expression profile of multiple species can be obtained based on the integrated gene sequence set, and the interaction mechanism of human and mouse cells can be further known through cell cluster analysis.

Description

Analysis method, equipment and medium of single-cell transcriptome data of PDX model

Related patent

The application is a divisional application of Chinese patent application with the application number 2023100214777 and the application date 2023, 01 and 07, and the name of a single cell transcriptome data analysis method, a system, equipment and a medium based on PDX.

Technical Field

The invention belongs to the technical field of biological data processing, and particularly relates to analysis method equipment and medium of single-cell transcriptome data of a PDX model.

Background

The PDX (Patient-derived tumor xenograft ) model is a tumor model constructed by transplanting tumor tissue or primary cells derived from a Patient into the body of NSG (immunodeficiency) mice. The model is characterized in that tumor tissues are directly transplanted into NSG mice without artificial culture, so that the characteristics of most primary tumors are maintained on histopathology, molecular biology and gene level, and the clinical similarity is higher. The PDX model is the tumor model closest to clinical specimens so far, and has important significance for clinical tumor assessment treatment and prognosis.

Single cell transcriptome sequencing using the PDX model can further analyze cell types and gene expression characteristics at different stages of a tumor, thereby providing guidance for treatment of the tumor. Currently, 10 x official analysis software cellranger can perform data analysis on a plurality of species mixture libraries of the PDX model and obtain corresponding expression profile matrices and clustering results. However, since there are a large number of homologous genes between human and mouse species, even after the human-derived cells are analyzed by the mixed genome of human and mouse, some reads can be aligned with the genome of the upper mouse, and the cell clustering result directly based on the expression profile and the downstream data analysis mining result are inaccurate. And because of the homologous sequences present in the human and mouse genomes, the sequences of the human genomes cannot be directly used for analysis, which can lead to part of the cells from the mice being finally misidentified as adult cells.

In addition, in the PDX model study, in addition to the study of the differences in gene expression changes of human-related tumor cells under different treatment schemes to find corresponding drug therapeutic targets, etc., it is also necessary to study how cells in mice interact with introduced human cells. Therefore, it is important how to isolate cellular gene expression profiles of individual species from cellular sequencing data obtained from the PDX model and how to obtain cellular gene expression profiles of multiple species. However, there is currently no method for systematically analyzing single cell transcriptomes based on the PDX model.

Disclosure of Invention

In order to solve at least one of the technical problems, the invention adopts the following technical scheme:

The first aspect of the invention provides a single cell transcriptome data analysis method based on PDX, comprising the steps of:

S1, comparing single-cell transcriptome sequencing data of an obtained PDX model with a mixed genome library of a human and a mouse to obtain a cell-gene expression profile matrix based on the mixed genome library, wherein the mixed genome library is obtained by combining a reference genome file and a gene annotation file of the human and the mouse;

S2, recognizing the human cells, the mouse cells or the double cells according to whether the proportion of expressed human genes or the proportion of expressed mouse genes in the cells is greater than or equal to a first preset threshold value P1;

S3, extracting sequences recognized as human cells and sequences recognized as mouse cells from the single cell transcriptome sequencing data based on cell barcode;

S4, comparing the obtained sequence of the human cell with a human reference genome, comparing the obtained sequence of the mouse cell with the mouse reference genome to obtain a corresponding cell-gene expression profile,

Wherein P1 is set such that the dual cell rate differs from half the multicellular rate by no more than 5%, the dual cell rate and multicellular rate being calculated by the following formulas:

double cell rate = (double cell number/(human cell number + mouse cell number + double cell number)) ×100%;

Multicellular rate= (number of captured cells× 7.589 ×10 ^-6+5.272×10^-4) ×100%.

Since multicellular rate includes a mixture of human and human cells, mice and mouse cells in addition to double cells (a mixture of human and mouse cells), the double cell rate is theoretically equal to 1/2 of the multicellular rate. In the present invention, the difference between the double cell rate and half of the multicellular rate is not more than 5%, which means that the ratio of the double cell rate to the multicellular rate/2 is | (multicellular rate/2) times 100% is not more than 5%. If P1 is set too high, too many cells are judged to be double cells, which is not the case.

In some embodiments of the invention, p1=70%.

In some embodiments of the present invention, in step S1, when the genome file and the gene annotation file of human and mouse are combined, in order to avoid duplication of genes and chromosomes, specific tags are respectively added before the gene ID, the gene name, and the chromosome for distinguishing. For example, "human" is added before the human gene ID, gene name, chromosome, and "mouse" is added before the mouse gene ID, gene name, chromosome. Further, library files are generated for comparison based on the combined genomic file and gene annotation file.

In some embodiments of the present invention, step S2 specifically includes:

s21, counting the number Nh of expressed human genes and the number Nm of expressed mouse genes in cells;

S22, calculating the ratio Ph of the human gene expressed in the cell and the ratio Pm of the mouse gene expressed, where ph=nh/(nh+nm), pm=nm/(nh+nm);

s23, if Ph is greater than or equal to a first preset threshold value P1, the cell is identified as a human cell, and if Pm is greater than or equal to the first preset threshold value P1, the cell is identified as a mouse cell.

In some embodiments of the invention, cells that neither meet Ph equal to or greater than a first preset threshold value P1 nor meet Pm equal to or greater than a first preset threshold value P1 are judged as double cells, i.e., cells that have both human and mouse gene expression.

Barcode, also known as index, a barcode or tag, is commonly used in sequencing technology to distinguish between different sources of sequence. In the present invention, barcode is used to distinguish between different cells, i.e., sequencing sequences with the same barcode in the sequencing result means from the same cell, so that different barcodes can represent different cells. In some descriptions of the invention, barcode and cells have the same meaning.

In some embodiments of the invention, in step S3, the extracting the sequence recognized as a human cell and the sequence recognized as a mouse cell specifically includes:

S31, identifying the barcode of the sequencing sequence, and comparing the barcode with the cell barcode to obtain a base matching coefficient Mi, wherein Mi=lm/Lb, lm is the number of bases matched with the cell barcode by the sequencing sequence, and Lb is the number of bases of the cell barcode;

S32, extracting the sequence according to Mi and a second preset threshold P2:

If Mi=100%, extracting the corresponding sequence directly, if P2 is less than or equal to Mi <100%, and if the sequencing quality value of the base corresponding to the sequencing reads which are not fully matched is less than 10, correcting the sequencing reads into correct base, extracting the sequence, if Mi < P2, not extracting,

Wherein P2 is more than or equal to 80 percent.

After the human cells and the mouse cells are identified by the step S2, the corresponding cell-gene expression profile of the human cells or the mouse cells may be obtained according to the comparison result of the step S1. However, the alignment of step S1 is based on the mixed genome library, and the gene expression profile of each cell may not be accurate due to the influence of homologous genes. Therefore, based on the step S3 of extracting the original sequencing data of the cells from the human and the mouse, the cell-gene expression profile matrix of the human or the mouse can be obtained by further comparing the original sequencing data with the reference genome of the human and the mouse respectively.

Different sequencing platforms typically have different barcode lengths, e.g., illumina 10 x single cell sequencing platform, a barcode length of 16, and an ink platform, a barcode length of 28. Different P2 values are chosen according to different barcode lengths, and the standard for selection is generally that only 1-2 base mismatches are accepted. In some embodiments of the invention, the barcode is 16 in length and p2 is set to 90% allowing only 1 base to be unmatched.

In some embodiments of the present invention, in step S32, the correction of sequencing reads to correct bases refers to correction of bases with a sequencing quality value <10 in the barcode of the sequencing sequence to bases at the corresponding position of the barcode of the matched cell. If the sequencing quality is greater than or equal to 10, no correction can be made, the sequencing reads do not belong to the cell and should be extracted or discarded using other barcode.

In some embodiments of the present invention, in step S4, when constructing the cell-gene expression profile matrix, only genome alignment and UMI correction need to be performed, and no cell recognition process is required.

In some embodiments of the invention, after step S2, the following steps are performed:

s3', comparing the single cell transcriptome sequencing data with an integrated gene sequence set of homologous genes of the human and the mouse to obtain a comparison result of the single cell transcriptome sequencing data and the integrated gene sequence set;

S4', obtaining a cell-integrated gene expression profile from the comparison result obtained in the step S3' based on the human cell and the mouse cell barcode identified in the step S2,

Wherein the set of integrated gene sequences of the homologous genes is obtained based on the following steps:

(1) Splicing sequences of homologous genes of the human and the mouse to obtain an integrated gene sequence of each homologous gene, wherein the gene sequences of the human and the mouse are filled by 60-100N bases;

(2) Corresponding annotation files are constructed according to the integrated gene sequences, wherein homologous genes are taken as an independent chromosome, and homologous gene sequences from human and mice are taken as 2 transcripts.

In some embodiments of the invention, 60-100N bases are placed in the integrated gene sequence to add species information to the collated annotation file for subsequent alignment reads filtering. In some embodiments of the invention, 80N bases are provided in the integrated gene sequence. That is, 80N bases were inserted between the homologous gene sequence derived from human and the homologous gene sequence derived from mouse in one integrated gene sequence. On the other hand, by designing N bases, it is also possible to prevent reads from being aligned across species during subsequent alignment, i.e., some reads are partially aligned to human homologous genes and some are aligned to mouse homologous genes.

In some embodiments of the present invention, in steps S3 'and S4', the step of filtering the comparison result is further included:

the method comprises the steps of comparing a unique position of a last integrated gene, reserving corresponding comparison information, extracting one piece of comparison information, reserving only one piece of comparison information if a plurality of positions of the same integrated gene are compared, and filtering the corresponding comparison information if different integrated genes are compared.

In a second aspect, the present invention provides a single cell transcriptome data analysis system based on PDX, comprising:

the data input module is used for obtaining single-cell transcriptome sequencing data of the PDX model;

The database storage module is used for storing a ginseng genome, a mouse reference genome and a mixed genome library of human and mouse, wherein the mixed genome library is obtained by combining a reference genome file and a gene annotation file of the human and the mouse;

The first comparison module is respectively connected with the data input module and the database storage module and is used for comparing the single-cell transcriptome sequencing data with the mixed genome library;

the cell identification module is connected with the first comparison module and is used for identifying human cells, mouse cells or double cells according to whether the ratio of expressed human genes or the ratio of expressed mouse genes in cells is greater than or equal to a first preset threshold value P1;

The sequence acquisition module is respectively connected with the cell identification module and the data input module and is used for extracting sequences identified as human cells and sequences identified as mouse cells from the single cell transcriptome sequencing data based on cell barcode;

the cell-gene expression profile construction module is respectively connected with the sequence acquisition module and the database storage module and is used for comparing the obtained sequence of the human cell with a human reference genome and comparing the obtained sequence of the mouse cell with the mouse reference genome to obtain a corresponding cell-gene expression profile,

Since multicellular rate includes a mixture of human and human cells, mice and mouse cells in addition to double cells (a mixture of human and mouse cells), the double cell rate is theoretically equal to 1/2 of the multicellular rate. In the present invention, the difference between the double cell rate and half of the multicellular rate is not more than 5%, which means that the ratio of the double cell rate to the multicellular rate/2 is | (multicellular rate/2) times 100% is not more than 5%.

In some embodiments of the invention, p1=70%.

In some embodiments of the invention, in the database storage module, when the genome file and the gene annotation file of the human and the mouse are combined, specific tags are respectively added before the gene ID, the gene name and the chromosome for distinguishing in order to avoid duplication of the genes and the chromosomes. For example, "human" is added before the human gene ID, gene name, chromosome, and "mouse" is added before the mouse gene ID, gene name, chromosome. Further, library files are generated for comparison based on the combined genomic file and gene annotation file.

In some embodiments of the invention, the cell recognition module performs cell recognition based on and steps of:

Counting the number Nh of expressed human genes and the number Nm of expressed mouse genes in the cells;

Calculating the ratio Ph of the expressed human gene and the ratio Pm of the expressed mouse gene in the cell, wherein ph=nh/(nh+nm), pm=nm/(nh+nm);

If Ph is greater than or equal to a first preset threshold value P1, the cell is identified as a human cell, and if Pm is greater than or equal to the first preset threshold value P1, the cell is identified as a mouse cell, and the rest are double cells.

In some embodiments of the invention, the sequence acquisition module extracts sequences that are recognized as human cells and sequences that are recognized as mouse cells by:

Identifying the barcode of the sequencing sequence, and comparing the barcode with the cell barcode to obtain a base matching coefficient Mi, mi=lm/Lb, wherein Lm is the number of bases matched with the cell barcode by the barcode of the sequencing sequence, and Lb is the number of bases of the cell barcode;

And performing sequence extraction according to Mi and a second preset threshold P2:

Wherein P2 is more than or equal to 80 percent.

In some embodiments of the invention, the database storage module is further for storing a set of integrated gene sequences of human and mouse homologous genes, the system further comprising or not comprising the sequence acquisition module and the cell-gene expression profile construction module but comprising:

The second comparison module is respectively connected with the data input module and the database storage module and is used for comparing the single-cell transcriptome sequencing data with an integrated gene sequence set of homologous genes of human and mice to obtain a comparison result of the single-cell transcriptome sequencing data and the integrated genes;

And the cell-integrated gene expression profile construction module is respectively connected with the cell identification module and the second comparison module and is used for obtaining a cell-integrated gene expression profile from the comparison result obtained by the second comparison module based on the human cell and the mouse cell barcode identified by the cell identification module.

A third aspect of the present invention provides a computer apparatus comprising:

A memory for storing a computer program;

A processor for implementing the steps of a single cell transcriptome data analysis method according to any one of the first aspect of the invention when executing the computer program.

A fourth aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a PDX-based single cell transcriptome data analysis method according to any of the first aspect of the present invention.

The beneficial effects of the invention are that

Compared with the prior art, the invention has the following beneficial effects:

The method and the system can be used for constructing the cell-gene expression profile of the human and/or mouse of the samples before and after PDS mouse model processing (such as medicines or other treatment modes), further can be used for carrying out differential gene analysis, functional enrichment analysis and the like, obtain more data and provide technical support for clinical treatment of cancers.

The method and the system of the invention can be used for obtaining the expression profile of the cell-integrated gene, and can be used for downstream cell clustering and differential gene searching analysis. The clustering result obtained in this way can also cluster human and mouse cells together well, and can be used to learn the interaction mechanism of human and mouse cells.

Drawings

FIG. 1 shows a plot of the number of captured cells versus multicellular rate.

FIG. 2 is a schematic diagram showing the sequencing result of single cell transcriptome of PDX model in example 1 of the present invention.

Fig. 3 shows the clustering result based on the mixed library analysis in embodiment 1 of the present invention.

FIG. 4 shows the results of cluster analysis based on sequencing data corresponding to extracted human cells in example 1 of the present invention.

FIG. 5 shows the number of genes obtained based on the mixed library analysis and the analysis based on the sequencing data corresponding to the extracted human cells in example 1 of the present invention.

FIG. 6 shows clustering results using cell-gene expression profiles constructed based on a hybrid database.

Fig. 7 shows partial homologous gene information for human and mouse.

FIG. 8 shows the sequence of the mouse Cd3d gene.

FIG. 9 shows the integrated gene sequence obtained by splicing human and mouse homologous Cd3d gene sequences.

FIG. 10 shows annotation files of integrated gene sequences obtained by splicing human and mouse homologous Cd3d gene sequences.

FIG. 11 shows the results of a cell-integrated gene-based expression profile matrix clustering analysis in example 2 of the present invention.

FIG. 12 shows a schematic diagram of a cell-gene expression profile construction system of a single species in example 3 of the present invention.

FIG. 13 shows a schematic diagram of a multi-species cell-gene expression profile construction system in example 4 of the present invention.

FIG. 14 shows a schematic diagram of a single cell transcriptome sequencing data analysis system of the PDX model in example 5 of the present invention.

Detailed Description

Unless otherwise indicated, implied from the context, or common denominator in the art, all parts and percentages in the present application are based on weight and the test and characterization methods used are synchronized with the filing date of the present application. Where applicable, the disclosure of any patent, patent application, or publication referred to in this application is incorporated by reference in its entirety, and the equivalent patents to those cited in this application are incorporated by reference, particularly as if they were set forth in the relevant terms of art. If the definition of a particular term disclosed in the prior art is inconsistent with any definition provided in the present application, the definition of the term provided in the present application controls.

In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the embodiments.

Examples

The following examples are presented herein to demonstrate preferred embodiments of the present invention. It will be appreciated by those skilled in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, the disclosure of which is incorporated herein by reference as is commonly understood by reference.

The experimental methods in the following examples are conventional methods unless otherwise specified. The apparatus used in the examples described below, unless otherwise specified, were all laboratory conventional apparatus, and the test materials used in the examples described below, unless otherwise specified, were all purchased from conventional biochemical reagent stores.

Example 1 construction of cell-Gene expression profiles of individual species based on single cell transcriptome data of the PDX model

The present example provides a method for constructing a cell-gene expression profile of a single species based on single cell transcriptome data of a PDX model, comprising the steps of:

1. obtaining single cell transcriptome data for PDX model

The target tissue is dissociated to obtain a single cell suspension, and then Gel Beads containing the barcode information are combined with a mixture of cells and enzymes, and then encapsulated with droplets of oil surfactant In a microfluidic system to form GEMs (Gel Beads-In-Emulsions). Gel beads in GEM are dissolved, mRNA is released by cell lysis, cDNA with a barcoding for sequencing is generated by reverse transcription, and cDNA amplification, purification and library construction are performed after the liquid oil layer is destroyed. The completed library was sequenced to obtain single cell transcriptome sequencing data.

2. Obtaining cell-gene expression profiles based on a mouse and human mixed database

Construction of a human and mouse Mixed genome library the human and mouse genome files (GRCh 38 and GRCm 39) and the Gene annotation file (version of ensamble V105: homosapiens. GRCh38.105.Chr.gtf. Gz, mus. Museulus. GRCm39.105.Chr.gtf. Gz) were pooled and "human" and "mouse" were added before the genes I D, gene name, chromosome, respectively, to avoid duplication of genes and chromosomes. Library files that can be used for star alignment are then generated using cellranger software based on the combined genome file and gene annotation file.

For single cell transcriptome data of the PDX model obtained in step 1, the following procedure was performed using cellranger software:

(1) barcode extraction correction

Comparing the sequence of the first 16bp of the sequencing R1 end with a 10X white list, if the sequence is from the white list, reserving reads at the corresponding double ends, if the reads obtained by sequencing are only 1 base different from the white list and the sequencing alkali matrix value at different positions is less than 10, modifying the base which is erroneously matched according to the white list and reserving the reads, otherwise, the reads are not included in the subsequent expression profile matrix calculation.

(2) Sequence genome alignment

The sequenced R2 reads were aligned to the reference genome using star software to obtain alignment information, and reads that uniquely matched only the upper sense transcript were kept into the statistical barcode-gene expression matrix.

(3) Cell recognition

According to the barcode-gene expression matrix, counting all UMI counts detected by each barcode, and sorting from high to low. For high RNA content cell recognition, taking 10000 cells captured as an example, cells are considered if the UMI Count number of bar code is greater than 1/10 of the UMI Count value of 101 th bar code. Some cells with low RNA content may not meet the conditions just identified, barco de with low expression level is selected as a background, the expression of the gene of the barcode is counted, then the expression of the gene corresponding to the barcode which is not identified as the cell in the first step is compared with the background, if the expression of the gene is consistent, the cell is regarded as the background, otherwise, the cell is regarded as the cell.

Thus, a cell-gene expression profile matrix based on a mixed genomic library was obtained.

The specific expression profile data format is as follows:

Gene	barcode 1	barcode 2	barcode 3	barcode 4	......	barcode n
							human_TNFRSF4	0	1	0	0	......	2
human_CD3E	1	0	0	3	......	0
							......	......	......	......	......	......	......
mouse_Gm28417	0	0	1	0		0

3. Identification of human and mouse cells

The number of genes expressing human and mouse per cell (barcode) was counted according to the gene name prefix and noted Nh and Nm, respectively. The ratio of genes from human and mouse in each barcode was calculated and denoted Ph and Pm, respectively, where ph=nh/(nh+nm), pm=nm/(nh+nm).

Since an alignment of mixed genome libraries of human and mouse is used, and human and mouse have a large number of homologous genes, even human-derived cells may express small amounts of mouse genes. If Ph >70% of a certain barcode, the certain barcode is derived from human cells, if Pm >70% of a certain barcode, the certain barcode is derived from mouse cells, and the rest of the certain barcode are considered as double cells. Here, "70%" is a threshold value set for identifying a cell derived from a human or mouse.

In a 10×single cell sequencing process, the multicellular rate is related to the number of captured cells, as shown in fig. 1, and the relationship between the multicellular rate and the number of captured cells is obtained by fitting as shown in the following formula:

Multicellular rate= (captured cells number of X7.589X 10 ^-6+5.272×10^-4) X100%

However, in the above-described multicellular, in addition to human and mouse cells mixed in one oil droplet (double cell), there are cases where human and human cells, mouse and mouse cells are mixed together. Thus, in theory, the double cell rate should be half the multicellular rate, i.e., ((number of captured cells. Times. 7.589. Times.10. ^-6+5.272×10^-4)/2). The inventors found that when the threshold was 70%, the resulting double cell rate was very close to half the multicellular rate.

The specific statistical tables obtained are shown in the following table (display section):

barcode	Nh	Nm	Ph(%)	Pm(%)	cell type
						AAACCTGAGCTAACAA	1871	61	96.84	3.16	Human cells
AAACCTGAGTTGCAGG	645	42	93.89	6.11	Human cells
						AAGGCAGAGCACACAG	897	1284	41.13	58.87	Double cells
AAGGCAGGTATCTGCA	102	1053	8.83	91.17	Mouse cells
						ACGAGCCGTAGCCTAT	1082	192	84.93	15.07	Human cells
......	......	......	......	......	......
						TATGCCCCAGCATGAG	321	1673	16.10	83.90	Mouse cells
TGCACCTTCAATCACG	1295	825	61.08	38.92	Double cells

The cell ratio of human and mouse obtained by different actual sample analysis is not completely consistent due to the influences of a PDX model, a sample sampling position, a sampling size, single cell dissociation, cell capturing efficiency and the like. The following table shows the actual number of cells from human and mouse obtained from two different PDX samples and their ratios:

As can be seen from the above table, the number of cells captured in sample 1 (human cell number+mouse cell number+double cell number) is 8565 in total, the multicellular rate is calculated to be (8565× 7.589 ×10× 10 ^-6+5.272×10^-4) ×100% =6.55%, the obtained double cell rate is 3.19% when the threshold is set to 70%, the difference from half 3.28% of the multicellular rate is only |3.19-3.28)/3.28×100% =2.7%, the number of cells captured in sample 2 (human cell number+mouse cell number+double cell number) is 8639 in total, the calculated multicellular rate is (8636× 7.589 ×10 ^-6+5.272×10^-4) ×100% =6.61%, and the obtained double cell rate is calculated to be 3.25% when the threshold is set to 70%, the difference from half 3.31% of the multicellular rate is only |3.25-3.31)/3.31×100%. 1.8%. It can be seen that the identification of cells using the above method is more accurate.

4. Sequence extraction

And extracting the determined sequence information corresponding to the barcode from the human and mouse cells from the original sequencing data respectively. From the library structure of the single cell transcriptome, the barcode sequence is derived from the first n bases of the R1 end of the double-ended sequencing (e.g., 10 x single cell platform, n=16, as shown in fig. 2, the corresponding cells b arcode of the domestic single cell platform are longer, e.g., the cell barcode length of the android platform is 28). The specific sequence extraction mode is as follows:

1) And calculating a base matching coefficient corresponding to each sequence, wherein the length of the cell barcode is marked as Lb, the length of the base actually matched is marked as Lm, and the base matching coefficient is Mi, mi=Lm/Lb.

2) Considering that the cell barcode set by different single cell platforms has different lengths, the base matching coefficient threshold is set to 90%, i.e. when the cell barcode is 16 bases, only 1 base is allowed to be unmatched. When mi=100%, the corresponding READS PAIR is directly output. When Mi is more than or equal to 90% and less than 100%, calculating the sequencing quality value of the base corresponding to the sequencing re ads which is not matched, if the sequencing quality value of the position is less than 10, correcting the sequencing reads to be correct bases in consideration of unmatched caused by sequencing errors, namely outputting the corresponding READS PAIR after correcting the sequencing reads to be b arcode which is already identified as cells.

5. Re-acquisition of cell-gene expression profile matrices for individual species

Although the corresponding expression profile matrix of human-derived cells can be obtained directly based on step 3, since the cell-gene expression profile is obtained based on a mixed library of human and mouse, the gene expression profile of each cell may not be accurate due to the influence of homologous genes.

Therefore, the inventors re-analyzed the reference genome of human or mouse as a library file based on the raw data of human and mouse-derived cells obtained in step 4, and obtained the correct human and mouse cell-gene expression profile.

Because the sequences extracted in step 4 are all derived from actual cells, unlike water-in-oil data, which may be empty, in the case of normal single cell transcriptome sequencing, only genome alignment and UMI correction need to be performed when generating the expression profile matrix, and cell recognition is not required. If cell recognition is performed, it is highly likely that the same class of cells will be removed as background, which can affect subsequent data mining.

By using the method, the gene expression profile matrix of human or mouse cells in the sample can be obtained before and after treatment for the PDX mouse model, and further data mining such as differential gene analysis, functional enrichment analysis and the like can be performed.

FIG. 3 shows clustering results based on mixed library analysis, and it can be seen that human and mouse cells are split on the left and right sides. Sequencing data corresponding to cells from humans were extracted and the results of the cluster analysis are shown in FIG. 4.

The following table shows the results of differential gene analysis based on mixed library analysis and extraction of raw data from human cells:

from the statistics, it was found that the differential up-regulated genes obtained based on the mixed pool are particularly high, and that the differential genes are likely to be high due to the simultaneous presence of human-derived cells and mouse-derived cells therein, and the difference in all cells. Thus, only 7 clusters of clusters 3, 4, 5, 8, 9, 10, 11 from humans were subjected to differential analysis, and found that 5066 differential genes were found in total, of which 1209 differential genes were derived from the mouse genes, which were apparently due to the inapplicable differential genes caused by the mixed pool.

For the same barcode (from human cells), the use of mixed pool analysis resulted in higher gene numbers due to the alignment of the upper mouse genome, as shown in FIG. 5, which shows the mixed pool analysis and below, which shows the statistics of pool analysis by the human used to extract reads.

EXAMPLE 2 construction of Multi-species cell-Integrated Gene expression profiling

Example 1 was only directed to analysis of cell expression profiles of individual species, and in practice it was found that some PDX models only introduce tumor primary cells, and if one wants to know how immune cells of mice act on tumor cells, how the cells interact with each other, etc., it is still necessary to generate a matrix of expression profiles based on mixed data. Cell-gene expression profile data based on the mouse and human mixed database obtained in step 1 of example 1, due to the difference in gene names between different species, a situation in which the cells were clustered, which resulted in a distinct division into 2 populations, was observed (see fig. 6).

For this case, a corrected multi-species cell-gene expression profile matrix was obtained using the following steps:

(1) Homologous gene information of two species of human and mouse is obtained from Ensemble database, and as shown in FIG. 7, the first two columns are respectively human gene ID and gene name information, and the second two columns are corresponding mouse gene ID and gene name information. Human and mouse share 18973 (total 18973 gene IDs based on human genes can find homologous genes from mice).

(2) And extracting sequence information corresponding to each gene from human and mouse genomes respectively according to the gene chromosome position information and the positive and negative gene chain information in the genome annotation file. Taking the Cd3d gene as an example, the extracted sequence information is shown in FIG. 8.

(3) And (3) merging sequences according to homologous gene information obtained in the step (1) (filling 80N bases between two species genes, and adding the species information into the sorted annotation file to obtain the integrated genes. As an example, the sequence of the integrated gene of the Cd3d gene is shown in FIG. 9.

(4) Based on the integrated gene sequence, a corresponding annotation file (homologous gene as one independent chromosome, homologous gene from two species as 2 transcripts) was constructed, for example, merge0000001, as shown in fig. 10.

(5) And (3) taking the reads obtained by sequencing as a reference genome by using an integrated gene sequence set, and performing comparison and analysis to obtain a comparison result of the reads and the integrated gene sequence.

(6) The comparison result is filtered, corresponding comparison information is reserved for the unique position of the last integrated gene to be compared (the sequence on the comparison is only from the same species and does not span N base comparison), one piece of reads multiple comparison information is extracted, one piece of comparison information is reserved if a plurality of positions of the same integrated gene to be compared (a plurality of positions on the comparison are all from the same species and do not span N base comparison), and the corresponding comparison information is filtered if different integrated genes to be compared. The comparison information of about 40% is obtained in the step.

(7) From the human and mouse barcode information obtained in example 1 (double cell information removed), a matrix of expression profiles of cell-integrated genes was obtained from the filtered alignment information obtained in step (6).

Based on the expression profile matrix of the cell-integrated genes, downstream cell clustering and differential gene searching analysis can be performed. The clustering results obtained in this way, the human and mouse cells also clustered well together (as shown in FIG. 11), can be used to learn the interaction mechanism of human and mouse cells.

EXAMPLE 3 Single species cell-Gene expression profiling System

This example provides a system for constructing single species cell-gene expression profile based on single cell transcriptome data of PDX model, as shown in fig. 12, comprising:

a database storage module for storing a reference genome of a human, a mouse, and a mixed genome library of a human and a mouse, wherein the mixed genome library is obtained by combining reference genome files of the human and the mouse with gene annotation files, as in the method described in example 1;

the first comparison module is respectively connected with the data input module and the database storage module and is used for comparing single-cell transcriptome sequencing data with the mixed genome database;

The cell identification module is connected with the first comparison module and is used for identifying whether the cell is a human cell or a mouse cell according to the ratio of expressed human genes or the ratio of expressed mouse genes in the cell being more than or equal to 70%;

The sequence acquisition module is respectively connected with the cell identification module and the data input module and is used for extracting sequences identified as human cells and sequences identified as mouse cells from single cell transcriptome sequencing data based on cell barcode;

The cell-gene expression profile construction module is respectively connected with the sequence acquisition module and the database storage module and is used for comparing the obtained sequence of the human cell with a human reference genome and comparing the obtained sequence of the mouse cell with the mouse reference genome to obtain a corresponding cell-gene expression profile.

Example 4 Multi-species cell-Integrated Gene expression profiling System

This example provides a multi-species cell-integrated gene expression profile construction system based on example 3, as shown in fig. 13, comprising:

the cell identification module is connected with the first comparison module and is used for identifying whether the cell is a human cell, a mouse cell or a double cell according to the proportion of expressed human genes in the cell or the proportion of expressed mouse genes being more than or equal to 70%;

And the second comparison module is respectively connected with the data input module and the database storage module and is used for comparing the single cell transcriptome sequencing data with the integrated gene sequence set of the homologous genes of the human and the mice to obtain a comparison result of the integrated genes, wherein the construction method of the integrated gene sequence set is as shown in the embodiment 2.

And the cell-integrated gene expression profile construction module is respectively connected with the cell identification module and the second comparison module and is used for obtaining a cell-integrated gene expression profile from the comparison result obtained by the second comparison module based on the human cell identified by the cell identification module and the mouse cell barcode.

Example 5 single cell transcriptome sequencing data analysis System of PDX model

This example provides a single cell transcriptome sequencing data analysis system of the PDX model, combined with the systems of examples 3 and 4, as shown in fig. 14, comprising:

The cell-gene expression profile construction module is respectively connected with the sequence acquisition module and the database storage module and is used for comparing the obtained sequence of the human cell with a human reference genome and comparing the obtained sequence of the mouse cell with a mouse reference genome to obtain a corresponding cell-gene expression profile,

Further comprises:

All documents mentioned in this disclosure are incorporated by reference in this disclosure as if each were individually incorporated by reference. Further, it will be appreciated that various changes and modifications may be made by those skilled in the art after reading the above teachings, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Claims

1. A method for analyzing single cell transcriptome data of a PDX model, comprising the steps of:

S5, comparing the single cell transcriptome sequencing data with an integrated gene sequence set of homologous genes of the human and the mouse to obtain a comparison result of the single cell transcriptome sequencing data and the integrated genes;

s6, obtaining a cell-integrated gene expression profile from the comparison result obtained in the step S5 based on the human cell and the mouse cell barcode identified in the step S2,

constructing corresponding annotation files according to the integrated gene sequences, wherein homologous genes are taken as an independent chromosome, homologous gene sequences from human and mice are taken as 2 transcripts,

The P1 is set so that the double cell rate differs from half the multicellular rate by no more than 5%, the double cell rate and multicellular rate being calculated by the following formulas:

2. The method of analyzing single cell transcriptome data of claim 1, wherein step S2 comprises:

s23, if Ph is greater than or equal to a first preset threshold value P1, the cell is identified as a human cell, and if Pm is greater than or equal to the first preset threshold value P1, the cell is identified as a mouse cell, and the rest are double cells.

3. The method according to claim 1, wherein in step S3, the extracting the sequence recognized as a human cell and the sequence recognized as a mouse cell specifically comprises:

S32, extracting the sequence according to Mi and a second preset threshold P2:

Wherein P2 is more than or equal to 80 percent.

4. A method of analyzing single cell transcriptome data of a PDX model according to claim 3, wherein the correction of sequencing reads to correct bases is correction of bases having a sequencing quality value <10 in the barcode of the sequencing sequence to bases at positions corresponding to the barcode of the matched cell.

5. The method of analyzing single cell transcriptome data according to claim 1, further comprising the step of filtering the comparison results in steps S5 and S6:

6. A computer device, comprising:

A memory for storing a computer program;

A processor for performing the steps of a method for analysis of single cell transcriptome data of a PDX model according to any one of claims 1-5 when said computer program is executed.

7. A computer-readable storage medium comprising,

The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method for analyzing single cell transcriptome data of a PDX model according to any one of claims 1 to 5.