CN116364182A - Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data - Google Patents

Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data Download PDF

Info

Publication number
CN116364182A
CN116364182A CN202310470734.5A CN202310470734A CN116364182A CN 116364182 A CN116364182 A CN 116364182A CN 202310470734 A CN202310470734 A CN 202310470734A CN 116364182 A CN116364182 A CN 116364182A
Authority
CN
China
Prior art keywords
analysis
cell
gene
cells
tcr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310470734.5A
Other languages
Chinese (zh)
Inventor
李丽翠
董亚晨
朱彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Majorbio Bio Pharm Technology Co ltd
Original Assignee
Shanghai Majorbio Bio Pharm Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Majorbio Bio Pharm Technology Co ltd filed Critical Shanghai Majorbio Bio Pharm Technology Co ltd
Priority to CN202310470734.5A priority Critical patent/CN116364182A/en
Publication of CN116364182A publication Critical patent/CN116364182A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses an integrated analysis method of single cell transcriptome and TCR and BCR sequencing data. The bioinformatics analysis flow of single cell transcriptome+TCR/BCR sequencing data combined analysis is established, single cell transcriptome data analysis and immune group library data analysis can be simultaneously carried out directly through one set of flow, and two sets of chemical data are combined to obtain the immune group library characteristics of T/B cells with different transcriptome data.

Description

Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data
Technical Field
The invention belongs to the technical field of bioinformatics, and relates to an integrated analysis method of single cell transcriptome and TCR and BCR sequencing data.
Background
Single cell sequencing can determine the transcriptional expression level of a gene at the single cell level. Since 2009, the professor Shang Fu in Beijing university published a first single-cell trace mRNA library-building sequencing scheme, single-cell sequencing technology and industry developed rapidly, fluidigm, smart-seq, 10× Genomics, BDRhapsody and the like were sequentially put forward a single-cell sequencer, and single-cell transcriptome sequencing was widely applied to research in fields of tumor diseases, growth and development, neuroscience, immune mechanisms and the like, and particularly in basic research of tumor immune microenvironment, infectious diseases and other blood related disease mechanisms, single-cell transcriptome sequencing has become a current popular research means.
T cells and B cells are two types of cells that the body is responsible for specific immunity. T Cell Receptor (TCR) and B Cell Receptor (BCR) are receptor proteins on the surface of T cell membrane and B cell membrane, respectively, which can specifically recognize and bind antigen. Theoretically, a TCR and BCR (TCR/BCR) are expressed on only one cell, and 10 can be generated by V (D) J rearrangement in order to be able to resist a large number of different foreign pathogens 6 ~10 8 TCR/BCR. When the organism generates specific immunity, the T/B cells which can resist the corresponding antigens can be cloned and amplified in a large quantity, so that pathogens or related specific antigens can be rapidly cleared. Through immune repertoire (TCR/BCR) sequencing, the clone typing of various T cells and B cells can be detected and found, the immune microenvironment change when tumors or infectious diseases occur is revealed, the target point of immune treatment is found, and meanwhile, the TCR/BCR sequencing can also be used for antibody development, medication, vaccine evaluation and other directions.
Typically, single cell transcriptome and TCR/BCR sequencing studies are in a split state, and single data analysis and recombination are inaccurate, e.g., CN113838528A discloses a single cell level coupling visualization method based on single cell immune repertoire data, which includes: screening and collecting the original data set of the existing single-cell immune database to obtain a plurality of single-cell immune database original data sets, extracting the characteristics of each single-cell immune database original data set to obtain TCR/BCR-transcriptome information of each cell in each single-cell immune database original data set, and constructing a database and a terminal browser interface to realize interactive visualization of all information of the single-cell immune database original data sets in the terminal browser interface.
In the above context, single cell transcriptome+tcr/BCR combined sequencing analysis procedures are expected to simultaneously analyze single cell transcriptome and TCR/BCR sequencing data, and to achieve combined analysis of single cell transcriptome data and immune repertoire data. Therefore, there is a need to establish a bioinformatics analysis method for single cell transcriptome+tcr/BCR sequencing data and to construct a complete set of automated data analysis procedures.
Disclosure of Invention
Aiming at the defects and actual demands of the prior art, the invention provides an integrated analysis method for single-cell transcriptome and TCR and BCR sequencing data, which establishes a bioinformatics analysis flow of single-cell transcriptome+TCR/BCR sequencing data combined analysis to obtain immune repertoire characteristics of T/B cells with different transcriptome data.
In order to achieve the above purpose, the invention adopts the following technical scheme:
in a first aspect, the invention provides a method of integrated analysis of single cell transcriptome with TCR and BCR sequencing data, the method comprising the steps of:
(1) Quality evaluation and statistics are respectively carried out on single cell transcriptome sequencing data and TCR/BCR sequencing data of each sample;
(2) Performing sequence comparison and annotation on reads of the TCR/BCR obtained in the step (1) and a reference V (D) Jfragments sequence to obtain a Consensus sequence and a closype typing;
comparing the reads of the single cell transcriptome obtained in the step (1) with a reference genome respectively, and quantifying the gene expression quantity of each sample;
(3) Carrying out batch-removing and combining on the gene expression quantity of each sample obtained in the step (2), and removing batch effect among samples;
(4) Classifying cell subsets and performing differential analysis based on the single cell transcriptome sequencing data processed in the step (3);
(5) Annotating each cell population of step (4);
(6) Performing V (D) J gene characteristic analysis, V-Jpaired characteristic analysis and CDR3 characteristic analysis on the Clonotype typing obtained in the step (2);
(7) Combining the Clonotype typing obtained in the step (2), and then carrying out multi-sample comparison analysis;
(8) Mapping the obtained TCR/BCR into single cell transcriptome data according to Barcode, and obtaining the Clonotype characteristic of the TCR/BCR in the single cell transcriptome data.
In the invention, different from the existing single-cell transcriptome analysis and immune repertoire analysis, a set of flow paths (shown in figure 1) for simultaneously carrying out single-cell transcriptome data analysis and immune repertoire data analysis is established, and two sets of data are combined for analysis to obtain immune repertoire characteristics of T/B cells with different transcriptome data.
Preferably, the step (1) specifically includes:
and carrying out data quality statistics and basic quality control on single-cell transcriptome sequencing data and T/B lymphocyte genome sequence data generated by a sequencer to remove reads containing sequencing joints, unknown bases and quality values lower than 10, so as to obtain high-quality reads.
Preferably, step (2) is performed using 10 x Genomics official analysis software cellrange rmulti.
Preferably, the step (2) specifically includes:
(2-1) downloading the reference genome index file and the TCR/BCR reference sequence index file which are already constructed by the human or mouse from the official website according to the species to be tested;
(2-2) aligning and annotating the reads of the TCR/BCR obtained in the step (1) with the reference V (D) Jfragments sequences to obtain a consistent sequence and a Clonotype typing (Clonotype); the Clonotype typing is conducted by CellRanger according to CDR3 amino acid sequences in the spliced Consensu, and the effective Barcode is abundance of the Clonotype;
(2-3) comparing the reads of the single cell transcriptome obtained in the step (1) with a reference genome and quantifying the gene expression level of each sample;
preferably, the aligning and annotating in the step (2-2) specifically includes:
sequence alignment: aligning reads of TCR/BCR obtained in step (1) with V (D) Jfragments of TCR/BCR reference sequence using 10 XGenomics official analysis software CellRangermulti, retaining a pair reads with at least 15bp alignment to the fragment region;
contig assembly: debruijngaph assembly with k=20 for reads aligned by sequence using cellrange rms;
cell filtration: filtering the Barcode by using CellRangermulti;
assembling a Contig annotation: (a) The assembled Contig sequence is aligned to the vdjc region through Smith-Waterman and germline sequence alignment, and labeling is performed; (b) Sequence finding start codon, finding CDR3motif (Cys-FGXG/WGXG); (c) The contig marker was screened as productive according to the following conditions: the v region contains the start codon, cdr region and v-j region does not contain the stop codon, spanning the v-j region;
contig screening: cells filtered by Barcode are retained, and there are more than 2 UMI-supported Contents of ProductiveContig annotated at Contig;
sequence assembling annotation, namely further splicing all Contig obtained after Contig screening to obtain a consistent sequence (Consensus sequence) supported by a sample CDR 3.
Preferably, the aligning and quantifying in step (2-3) specifically includes:
(2-3-1) aligning the cDNA sequence fragments in the single cell transcriptome sequencing data to a reference genome;
(2-3-2) filtering and correcting the barcodes and UMIs: the method comprises the steps that (1) the barcode extracted from a Read1 sequence is aligned with a known barcode sequence of a database, all barcodes which are completely matched with the database and have at most one mismatch are reserved, the mismatch only occurs at a position with a quality value lower than 10 bases, and for the mismatched barcode, the mismatched barcode is replaced with a white list barcode with highest posterior probability according to the observation frequency of the barcode in a sample; removing the single oligomeric strand, UMIs containing N or containing bases with mass values below 10; regarding the low-quality base, it should be noted that the sequenced sequence has a quality value corresponding to each base, the quality value represents the accuracy of the base, and a quality value of 10 represents the error rate of 1% for the base sequencing;
(2-3-3) based on the number of UMI in each cell of each gene, a corresponding quantitative result of the gene expression level was obtained.
The quantitative process uses a large amount of computing resources, consumes long time, uses an HPC task delivery system based on SLURM to calculate each sample in parallel, and increases the timeliness of analysis.
Preferably, step (3) comprises:
(3-1) low quality cell filtration and normalization of individual samples, retaining high quality cells for subsequent analysis;
(3-2) after removing the low-quality cells, combining the gene expression quantity matrices of all samples into one expression quantity matrix for subsequent analysis;
preferably, in the step (3-1), a high threshold UpperLimit of each screening condition is determined according to the formula (1);
upperlimit=75% quantile + (75% quantile-25% quantile) ×1.5 (1);
the criteria for low quality cells were:
(a) The number of genes identified in single cells is 200-UpperLimit;
(b) The total number of UMI in a single cell is less than UpperLimit;
(c) The expression quantity proportion of the mitochondrial gene of UMI in single cells is smaller than UpperLimit;
preferably, in step (3-2), all cells of all samples are combined into one expression level matrix according to the gene names, the combined expression level matrix is determinant, the row names are gene names, and the column names are the combination of cell barcode and sample names.
Preferably, the batch removing method of the integral of the semat in the step (3) is used for integrating data and removing batch effects among samples, and specifically comprises the following steps:
data normalization: the expression level is standardized by using the Log Normalization method of the Normalization command of the SEurat software, and the expression level is standardized and calculated as formula (2);
g=log [ (UMIG/(TotalUMI) ×10000+1] formula (2);
wherein G is the standardized expression level of the target gene G, and UMIG is the UMI number of the gene G in the target cell; totalemi is the sum of all UMI amounts in the target cells;
characteristic gene selection: selecting a feature vector based on a variance stabilizing conversion (vst) method, calculating the mean value and variance of each gene, and respectively log 10-transforming, fitting the relationship between the logarithm (variance) and the logarithm (mean) by using a local polynomial regression, normalizing the feature value by using the observed mean value and the expected variance (given by the fitting line), and if the normalized value is greater than a clipping value, setting the normalized value as the clipping value, wherein the clipping value is an evolution of the total number of cells; according to the method, corresponding to each gene, calculating the variance of the normalized values of all cells, wherein the variance represents the single cell dispersion after the expression of the control mean, sorting the characteristic values according to the variance, and selecting 2000 genes with the highest normalized variances as the characteristic genes;
identifying a common anchor: using Canonical Correlation Analysis (CCA) to reduce dimensions, projecting the datasets of two samples into a correlated low dimensional space, determining a biological state common among cells, identifying two similar cell pairs as anchor points; under the condition of no supervision, sequentially identifying anchor points between every two samples, scoring the anchor points, reducing the weight of the anchor points with lower scores, and improving the weight of the anchor points with higher scores, thereby determining the anchor points of all samples and constructing a weight matrix;
batch effect correction: the difference between the anchor point cells and the expression quantity is converted with the weight matrix to obtain a conversion matrix, and the original expression quantity matrix is subtracted from the conversion matrix to obtain an expression quantity matrix after batch removal.
Preferably, the step (4) specifically includes the following steps:
(4-1) clustering and grouping cells using a graph theory-based clustering algorithm, comprising the steps of:
(a) Constructing a clustering relation among cells: constructing a KNN cluster relation diagram based on Euclidean distance by utilizing the first 30 most obvious main components obtained in PCA analysis;
(b) Optimizing the weight value of the clustering relation distance between cells: optimizing a weight value of the distance between cells by using Jaccard similarity;
(c) Clustering and grouping: performing cell grouping optimization by using a Louvain algorithm;
(4-2) cell subpopulation comparative analysis:
extracting barcodes and UMI of different samples in each group according to the clustering result of the step (4-1), and respectively drawing different cell numbers of sub-groups among different samples to form a bar graph, a gene and UMI distribution difference box graph;
(4-3) the criteria for determining differentially expressed genes are:
(a) The screened genes are expressed in the target subgroup or the control subgroup and more than 25% of samples;
(b) The P value is less than or equal to 0.05;
(c) The gene expression multiple logFC is more than or equal to 0.25, namely the gene up-regulation multiple is more than or equal to 2A 0.25.
Preferably, in the step (5), correlation analysis is performed on the single-cell transcriptome expression quantity matrix based on GEO/NCBI annotated cell types and the sample expression quantity matrix to obtain annotation results of the analyzed cell group, and the method specifically comprises the following steps:
(5-1) screening the reference data set for genes consistent with the test data set for subsequent analysis;
(5-2) identifying differentially expressed genes between each tag of the reference dataset, performing differential analysis on the median of gene expression in each pair of tags, and using the first 10 genes with the largest differences as genes for identifying two tags, thereby obtaining top10 differentially expressed genes between two cell types, wherein all gene sets are de-duplicated and used as marker genes for pair comparison;
(5-3) performing spearman correlation analysis on each cell of the test data set and the reference data set based on the marker gene, wherein the 80 percentile of the correlation coefficient of each cell type is used as a score of the cell type, so that the correlation scores of the cells and all the tags are obtained;
(5-4) identifying all tags having a score less than the highest score of no more than 0.05 as potential tags for the cell, identifying a new set of marker genes again from the reference dataset based on the potential tags, and repeating the calculation of the score using only the new marker genes, and using the highest scoring tag as annotated tag for the cell.
And selecting the cell type with the largest annotated cell number as the annotated result of the group according to the cell number corresponding to the cell type annotated by each group.
Preferably, the V (D) J gene feature analysis in the step (6) includes statistical analysis of V/J gene abundance distribution, statistical analysis of CDR3V/J gene sequence length distribution, and clustering analysis of V/J gene expression levels between samples.
Preferably, the V-Jpanned profiling includes statistical analysis of V-Jpanned abundance distribution and frequency distribution.
Preferably, the CDR3 profiling includes analysis using a CDR3 abundance profile and a CDR3 base sequence length profile.
Preferably, the combining in step (7) comprises combining cells of all samples together, and rerun the clonal grouping algorithm to obtain a summary table of clostypes.
Preferably, the multiple sample comparison analysis includes inter-sample clostype diversity analysis, overlay closotype cluster analysis, and overlay closotype variance analysis.
Preferably, the inter-sample clostype diversity analysis involves combining the names of the same clostype between different samples, and then displaying the number of clostypes shared (overlapping) and unique (nonveriapping) between samples with a wien diagram.
Preferably, the overlay closetype cluster analysis comprises cluster analysis by plotting a sample MDS map and a sample cluster map.
Preferably, the overlay closotype differential analysis includes an abundance comparison analysis of the overlay closotype of the inter-sample top10 and a differential analysis by plotting the same V-Jpairs frequency Ciros across samples.
Preferably, the tsne/umap plot in step (8) is used to show whether TCR/BCR and Top5 cloneteype are present in which cells.
Preferably, the method for integrated analysis of single cell transcriptome with TCR and BCR sequencing data further comprises the step of generating a junction report for single cell transcriptome and TCR/BCR automated analysis results, respectively.
Preferably, the step of generating the report of the questions includes sorting results in the automated process according to the products, and generating a catalog of results and a report of web pages.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention provides a method for simultaneously analyzing single-cell transcriptome and T/BCR sequencing data, which can reveal the real complexity of immune response through the high diversity of antigen receptors of T cells and B cells and the expression of cellular genes and can accelerate the new knowledge of adaptive immune response;
(2) The method has the advantages that the analysis is automatically completed and the junction report is generated through one-key code concatenation, the flow is complete, the content is rich, the visual picture is displayed, and the interaction type junction report enables a user to evaluate the characteristics of the immune group library more intuitively and conveniently;
(3) The whole flow is reconstructed based on the Perl language, two HPC task delivery systems of SGE and SLURM are introduced, and timeliness of analysis is increased by utilizing parallel calculation.
Drawings
FIG. 1 is a schematic diagram of a single cell transcriptome+TCR/BCR sequencing data combinatorial analysis scheme;
FIG. 2 is a thermal diagram of a V-JPAireuse distribution in an embodiment of the present invention;
FIG. 3 is a diagram of a V-Jpameduse distribution circle in an embodiment of the present invention;
FIG. 4 is a histogram of statistics of the V gene duty ratio of different samples according to an embodiment of the present invention;
FIG. 5 is a graph showing the distribution heat of different samples V-JPAireuse according to an embodiment of the present invention;
FIG. 6 is a statistical diagram of T/BCR diversity in an embodiment of the present invention;
FIG. 7 is a diagram of T/BCR data tsne included in single cell transcriptome data according to an embodiment of the present invention;
FIG. 8 is a top5 clolotype profile according to an embodiment of the invention.
Detailed Description
The technical means adopted by the invention and the effects thereof are further described below with reference to the examples and the attached drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof.
The specific techniques or conditions are not identified in the examples and are described in the literature in this field or are carried out in accordance with the product specifications. The reagents or equipment used were conventional products available for purchase through regular channels, with no manufacturer noted.
Example 1
The sequencing data of this example included 2 samples of human Peripheral Blood (PBMC) with single cell transcriptome + TCR + BCR sequencing.
In this example, for the integrated analysis of single cell transcriptome and TCR/BCR sequencing data, see fig. 1, the quality control is performed on the raw data, the single cell transcriptome, TCR and BCR data are analyzed separately, and the single cell transcriptome and TCR/BCR sequencing data are analyzed in an integrated manner. Finally, sorting and visually displaying analysis results; the specific steps of the analysis method are as follows:
1. sequencing data quality control: quality evaluation and statistics are respectively carried out on single cell transcriptome+TCR/BCR sequencing data of each sample, a main flow code calls a sequencing data quality control sub-script according to the sample name and the original data, sequencing data quality control is carried out on the sub-samples, a result is output to a naming file of a calling analysis method, and the sub-samples are output and stored;
2. quantitative analysis of single cell transcriptome data, TCR/BCR annotation and typing: performing sequence comparison and annotation on filtered TCR/BCR high-quality reads and reference V (D) Jfragments sequences to obtain a Consensus sequence and a closype typing; the high-quality reads of the single-cell transcriptome are respectively compared with a reference genome to quantify the gene expression quantity of each sample, a main flow code calls a sub-script according to the sample name and the original data, quantitative analysis of single-cell transcriptome data and TCR/BCR data annotation and typing are carried out on each sample, each parameter adopts a default value, the result is output to a naming file of a calling analysis method, and the sub-samples are output and stored;
3. single cell transcriptome data analysis: respectively carrying out cell filtration, multi-sample batch merging, data dimension reduction clustering and cell annotation analysis on each sample by a program, respectively obtaining quantitative results of each sample by the program according to sample names, constructing a Seperat object in an R environment for analysis, and outputting analysis results to a named file of a calling analysis method, wherein the analysis results of single sample are output and stored in a sample;
4. TCR/BCR analysis: the V (D) J gene characteristic analysis is carried out on each sample through a program, wherein the V (D) J gene characteristic analysis comprises the statistical analysis of the distribution situation of the abundance of the V/J gene and the statistical analysis of the distribution situation of the length of the CDR3V/J gene sequence. Performing sample-to-sample clostype diversity analysis, overlay closingclostype cluster analysis and overlay clostype difference analysis, calling a sub-script by a main flow code, annotating and typing result paths of each sample TCR/BCR according to sample names, performing statistical analysis, outputting results of each parameter by default values into a named file of a calling analysis method, outputting and storing the classified samples, and displaying a visualized result, wherein the visualized result is shown in fig. 2-6, the visual result is shown in fig. 2 as a thermal diagram of a VJ gene pair, each behavior V gene is listed as a J gene, and TRA and TRB are respectively shown; FIG. 3 is a chart of the use of circos for the VJ gene pairs, outer circles representing each gene, center connecting lines representing the VJ gene pairs, line thickness representing the abundance of the VJ gene pairs, TRA and TRB being shown separately; FIG. 4 is a bar graph of the abundance of the V (D) J gene in different samples, with the abscissa representing the name of the gene and the ordinate representing the abundance of the corresponding gene in each sample; FIG. 5 is a heat map of VJ gene pairs for different samples, behavior gene names, listed as samples; FIG. 6 is a Sample TCR/BCR diversity index showing the diversity of immune repertoires in each Sample (Sample) or Group (Group).
5. Joint analysis: single cell transcriptome data and TCR/BCR data from 2 samples were analyzed in combination by procedure and the analysis was visualized. In this embodiment, the input of the program includes 2 samples of single-cell transcriptome combined expression quantity and a cluster result constructed SEurat object, the program combines two types of data according to cell Barcode, single-cell transcriptome combined TCR and single-cell transcriptome combined BCR are respectively analyzed, the results are output to a file named by calling an analysis method, the TCR and the BCR are separately displayed, the visual results are shown in fig. 7 and 8, fig. 7 is a TCR distribution UMAP diagram, and the TCR combined single-cell transcriptome shows the existence and distribution of the TCR in cells; FIG. 8 is a TCRtop5Clonotype profile, TCR in combination with single cell transcriptome data, showing the distribution of Top5 clonality in cells in the TCR data.
The invention is built by using a Perl language, and the subsequent analysis is based on the developed R script.
In summary, the present invention provides a method for simultaneously analyzing single cell transcriptome and T/BCR sequencing data, which can reveal the real complexity of immune response through the high diversity of antigen receptors of T cells and B cells and the cellular gene expression, and can accelerate the new knowledge of adaptive immune response.
The applicant states that the detailed method of the present invention is illustrated by the above examples, but the present invention is not limited to the detailed method described above, i.e. it does not mean that the present invention must be practiced in dependence upon the detailed method described above. It should be apparent to those skilled in the art that any modification of the present invention, equivalent substitution of raw materials for the product of the present invention, addition of auxiliary components, selection of specific modes, etc., falls within the scope of the present invention and the scope of disclosure.

Claims (10)

1. A method for integrated analysis of single cell transcriptome with TCR and BCR sequencing data, comprising the steps of:
(1) Quality evaluation and statistics are respectively carried out on single cell transcriptome sequencing data and TCR/BCR sequencing data of each sample;
(2) Performing sequence comparison and annotation on reads of the TCR/BCR obtained in the step (1) and a reference V (D) Jfragments sequence to obtain a consistent sequence and a closype typing;
comparing the reads of the single cell transcriptome obtained in the step (1) with a reference genome respectively, and quantifying the gene expression quantity of each sample;
(3) Carrying out batch-removing and combining on the gene expression quantity of each sample obtained in the step (2), and removing batch effect among samples;
(4) Classifying cell subsets and performing differential analysis based on the single cell transcriptome sequencing data processed in the step (3);
(5) Annotating each cell population of step (4);
(6) Performing V (D) J gene characteristic analysis, V-Jpaired characteristic analysis and CDR3 characteristic analysis on the Clonotype typing obtained in the step (2);
(7) Combining the Clonotype typing obtained in the step (2), and then carrying out multi-sample comparison analysis;
(8) Mapping the obtained TCR/BCR into single cell transcriptome data according to Barcode, and obtaining the Clonotype characteristic of the TCR/BCR in the single cell transcriptome data.
2. The method of claim 1, wherein step (1) comprises:
and carrying out data quality statistics and basic quality control on single-cell transcriptome sequencing data and T/B lymphocyte genome sequence data generated by a sequencer to remove reads containing sequencing joints, unknown bases and quality values lower than 10, so as to obtain high-quality reads.
3. The method of claim 1 or 2, wherein step (2) comprises:
(2-1) downloading the reference genome index file and the TCR/BCR reference sequence index file which are already constructed by the human or mouse from the official website according to the species to be tested;
(2-2) aligning and annotating the reads of the TCR/BCR obtained in the step (1) with the reference V (D) Jfragments sequences to obtain a consistent sequence and Clonotype typing;
(2-3) comparing the reads of the single cell transcriptome obtained in the step (1) with a reference genome and quantifying the gene expression level of each sample;
preferably, the aligning and annotating in the step (2-2) specifically includes:
sequence alignment: aligning reads of TCR/BCR obtained in step (1) with V (D) Jfragments of TCR/BCR reference sequence using 10 XGenomics official analysis software CellRangermulti, retaining a pair reads with at least 15bp alignment to the fragment region;
contig assembly: debruijngaph assembly with k=20 for reads aligned by sequence using cellrange rms;
cell filtration: filtering the Barcode by using CellRangermulti;
assembling a Contig annotation: (a) The assembled Contig sequence is aligned to the vdjc region through Smith-Waterman and germline sequence alignment, and labeling is performed; (b) Sequence finding start codon, finding CDR3motif (Cys-FGXG/WGXG); (c) The contig marker was screened as productive according to the following conditions: the v region contains the start codon, cdr region and v-j region does not contain the stop codon, spanning the v-j region;
contig screening: cells filtered by Barcode are retained, and there are more than 2 UMI-supported Contents of ProductiveContig annotated at Contig;
and (3) sequence assembly annotation, namely further splicing all the Contigs obtained after the Contig screening to obtain a consistent sequence supported by a sample CDR 3.
4. The method of claim 3, wherein the aligning and quantifying of step (2-3) specifically comprises:
(2-3-1) aligning the cDNA sequence fragments in the single cell transcriptome sequencing data to a reference genome;
(2-3-2) filtering and correcting the barcodes and UMIs: the method comprises the steps that (1) the barcode extracted from a Read1 sequence is aligned with a known barcode sequence of a database, all barcodes which are completely matched with the database and have at most one mismatch are reserved, the mismatch only occurs at a position with a quality value lower than 10 bases, and for the mismatched barcode, the mismatched barcode is replaced with a white list barcode with highest posterior probability according to the observation frequency of the barcode in a sample; removing the single oligomeric strand, UMIs containing N or containing bases with mass values below 10;
(2-3-3) based on the number of UMI in each cell of each gene, a corresponding quantitative result of the gene expression level was obtained.
5. The method of integrated analysis of single cell transcriptome with TCR and BCR sequencing data of any of claims 1-4, wherein step (3) comprises:
(3-1) low quality cell filtration and normalization of individual samples, retaining high quality cells for subsequent analysis;
(3-2) after removing the low-quality cells, combining the gene expression quantity matrices of all samples into one expression quantity matrix for subsequent analysis;
preferably, in the step (3-1), a high threshold UpperLimit of each screening condition is determined according to the formula (1);
upperlimit=75% quantile + (75% quantile-25% quantile) ×1.5 (1);
the criteria for low quality cells were:
(a) The number of genes identified in single cells is 200-UpperLimit;
(b) The total number of UMI in a single cell is less than UpperLimit;
(c) The expression quantity proportion of the mitochondrial gene of UMI in single cells is smaller than UpperLimit;
preferably, in step (3-2), all cells of all samples are combined into one expression level matrix according to the gene names, the combined expression level matrix is determinant, the row names are gene names, and the column names are the combination of cell barcode and sample names.
6. The method of claim 1-5, wherein the step (3) of integrating data by the batch removal method of the integral of semat and removing batch effects between samples comprises:
data normalization: the expression level is standardized by using the Log Normalization method of the Normalization command of the SEurat software, and the expression level is standardized and calculated as formula (2);
g=log [ (UMIG/(TotalUMI) ×10000+1] formula (2);
wherein G is the standardized expression level of the target gene G, and UMIG is the UMI number of the gene G in the target cell; totalemi is the sum of all UMI amounts in the target cells;
characteristic gene selection: selecting a feature vector based on a variance stabilizing conversion method, calculating the mean value and variance of each gene, respectively carrying out log10 conversion, fitting the relation between the variance and the mean value by using local polynomial regression, normalizing the feature value by using the observed mean value and the expected variance, and setting the normalized value as a clipping value if the normalized value is larger than the clipping value, wherein the clipping value is an evolution of the total number of cells;
identifying a common anchor: using typical correlation analysis to reduce dimension, projecting the data sets of two samples into a correlated low-dimensional space, judging the biological state shared among cells, and identifying two similar cell pairs as anchor points; under the condition of no supervision, sequentially identifying anchor points between every two samples, scoring the anchor points, reducing the weight of the anchor points with lower scores, and improving the weight of the anchor points with higher scores, thereby determining the anchor points of all samples and constructing a weight matrix;
batch effect correction: the difference between the anchor point cells and the expression quantity is converted with the weight matrix to obtain a conversion matrix, and the original expression quantity matrix is subtracted from the conversion matrix to obtain an expression quantity matrix after batch removal.
7. The method of claim 1-6, wherein step (4) comprises the steps of:
(4-1) clustering and grouping cells using a graph theory-based clustering algorithm, comprising the steps of:
(a) Constructing a clustering relation among cells: constructing a KNN cluster relation diagram based on Euclidean distance by utilizing the first 30 most obvious main components obtained in PCA analysis;
(b) Optimizing the weight value of the clustering relation distance between cells: optimizing a weight value of the distance between cells by using Jaccard similarity;
(c) Clustering and grouping: performing cell grouping optimization by using a Louvain algorithm;
(4-2) cell subpopulation comparative analysis:
extracting barcodes and UMI of different samples in each group according to the clustering result of the step (4-1), and respectively drawing different cell numbers of sub-groups among different samples to form a bar graph, a gene and UMI distribution difference box graph;
(4-3) the criteria for determining differentially expressed genes are:
(a) The screened genes are expressed in the target subgroup or the control subgroup and more than 25% of samples;
(b) The P value is less than or equal to 0.05;
(c) The gene expression multiple logFC is more than or equal to 0.25, namely the gene up-regulation multiple is more than or equal to 2A 0.25.
8. The method of claim 1-7, wherein step (5) comprises the steps of:
(5-1) screening the reference data set for genes consistent with the test data set for subsequent analysis;
(5-2) identifying differentially expressed genes between each tag of the reference dataset, performing differential analysis on the median of gene expression in each pair of tags, and using the first 10 genes with the largest differences as genes for identifying two tags, thereby obtaining top10 differentially expressed genes between two cell types, wherein all gene sets are de-duplicated and used as marker genes for pair comparison;
(5-3) performing spearman correlation analysis on each cell of the test data set and the reference data set based on the marker gene, wherein the 80 percentile of the correlation coefficient of each cell type is used as a score of the cell type, so that the correlation scores of the cells and all the tags are obtained;
(5-4) identifying all tags having a score below the highest score of no more than 0.05 as potential tags for the cell, identifying a new set of marker genes again from the reference dataset based on the potential tags, and repeatedly calculating scores using only the new marker genes, and using the highest scoring tag as annotated tag for the cell;
and selecting the cell type with the largest annotated cell number as the annotated result of the group according to the cell number corresponding to the cell type annotated by each group.
9. The method of claim 1-8, wherein the V (D) J gene profiling in step (6) comprises statistical analysis of V/J gene abundance distribution, CDR3V/J gene sequence length distribution, and sample-to-sample V/J gene expression levels;
preferably, the V-Jpanned profile includes a statistical analysis of V-Jpanned abundance distribution and frequency distribution;
preferably, the CDR3 profiling includes analysis using a CDR3 abundance profile and a CDR3 base sequence length profile.
10. The method of integrated analysis of single cell transcriptome with TCR and BCR sequencing data according to any of claims 1-9, wherein the combining in step (7) comprises combining cells of all samples together, re-running a clonal grouping algorithm, obtaining a summary table of clostypes;
preferably, the multiple sample comparison analysis comprises inter-sample clostype diversity analysis, overlay closotype cluster analysis, and overlay closotype variance analysis;
preferably, the inter-sample clostype diversity analysis includes merging the names of the same Clonotype among different samples, and then displaying the number of clostype sharing overlapping and unique nonveriapping among samples with a wien diagram;
preferably, the overlay closetype cluster analysis comprises cluster analysis by plotting a sample MDS map and a sample cluster map;
preferably, the overlay closotype difference analysis includes an abundance comparison analysis of the overlay closotype of the inter-sample top10 and a difference analysis by plotting the same V-Jpairs frequency Ciros plot between samples;
preferably, the tsne/umap plot in step (8) is used to show whether TCR/BCR and Top5 cloneteype are present in which cells.
CN202310470734.5A 2023-04-27 2023-04-27 Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data Pending CN116364182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310470734.5A CN116364182A (en) 2023-04-27 2023-04-27 Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310470734.5A CN116364182A (en) 2023-04-27 2023-04-27 Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data

Publications (1)

Publication Number Publication Date
CN116364182A true CN116364182A (en) 2023-06-30

Family

ID=86909341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310470734.5A Pending CN116364182A (en) 2023-04-27 2023-04-27 Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data

Country Status (1)

Country Link
CN (1) CN116364182A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118280456A (en) * 2024-06-03 2024-07-02 江西师范大学 Mitochondrial DNA data normalization method and integrated application platform
CN118380052A (en) * 2024-06-24 2024-07-23 安诺优达基因科技(北京)有限公司 Genome structure prediction method and electronic device
CN118571324A (en) * 2024-07-31 2024-08-30 杭州华大生命科学研究院 Data processing method and device, storage medium and electronic device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118280456A (en) * 2024-06-03 2024-07-02 江西师范大学 Mitochondrial DNA data normalization method and integrated application platform
CN118380052A (en) * 2024-06-24 2024-07-23 安诺优达基因科技(北京)有限公司 Genome structure prediction method and electronic device
CN118571324A (en) * 2024-07-31 2024-08-30 杭州华大生命科学研究院 Data processing method and device, storage medium and electronic device

Similar Documents

Publication Publication Date Title
US20240218445A1 (en) Methods for clonotype screening
Barennes et al. Benchmarking of T cell receptor repertoire profiling methods reveals large systematic biases
CN116364182A (en) Integrated analysis method for single cell transcriptome and TCR and BCR sequencing data
Deutsch et al. Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics
US20190332963A1 (en) Systems and methods for visualizing a pattern in a dataset
CN110033860B (en) Method for improving detection rate of genetic metabolic diseases based on machine learning
CA2823061A1 (en) Data analysis of dna sequences
CN115295084A (en) Method and system for visually analyzing data of tumor neoantigen immune repertoire
CN115458052A (en) Gene mutation analysis method, equipment and storage medium based on first generation sequencing
US9953133B2 (en) Biological data annotation and visualization
CN114496089B (en) Pathogenic microorganism identification method
CN112786103A (en) Method and device for analyzing feasibility of target sequencing Panel for estimating tumor mutation load
JP5403563B2 (en) Gene identification method and expression analysis method in comprehensive fragment analysis
CN112365991B (en) Method for mining doubt signal facing SRS combined adverse reaction signal
US10672505B2 (en) Biological data annotation and visualization
van Dongen Fast multi-resolution consensus clustering
CN117789823B (en) Identification method, device, storage medium and equipment of pathogen genome co-evolution mutation cluster
WO2024187428A1 (en) Assembly process for constructing high-quality microbial genomes on basis of stlfr metagenomic sequencing data
CN112365990B (en) Strong signal screening method for adverse reaction signals of SRS combined medication
US20230420078A1 (en) Scrnaseq analysis systems
Anaissi et al. A benchmark of pre-processing effect on single cell RNA sequencing integration methods
US20230212560A1 (en) Systems, methods, and media for determining relative quality of oligonucleotide preparations
US20240203531A1 (en) Cell type annotation
Ettetuani et al. Meta-analysis for a therapeutic target involved in the activation of the genes associated with c3 glomerulopathy
Yan et al. obaDIA: one-step biological analysis pipeline for quantitative proteomics data from data independent acquisition mass spectrometry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination