CN116072217A

CN116072217A - Single cell transcriptome data availability processing method, medium and equipment

Info

Publication number: CN116072217A
Application number: CN202310126779.0A
Authority: CN
Inventors: 陈哲名; 郎秋蕾; 陈志锋
Original assignee: Hangzhou Lianchuan Gene Diagnosis Technology Co ltd
Current assignee: Hangzhou Lianchuan Gene Diagnosis Technology Co ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-05-05
Anticipated expiration: 2042-11-02
Also published as: CN115424668A; CN116072217B; CN115424668B

Abstract

The invention discloses a single cell transcriptome data availability analysis method, and relates to a biological data analysis method. The method comprises the following steps: sequencing the barcode according to the large-to-small gene expression quantity; obtaining the inflection point of the variation amplitude of the gene expression quantity; traversing all inflection points, classifying the barcode into a cell area, an empty droplet area and a magnetic bead area, and counting the corresponding number of the barcode; extracting the expression profile of all barcode in the cell area; counting and comparing the read numbers of the reference genome, and calculating the average read number of the cells; when the gene expression quantity corresponding to at least one inflection point is larger than G1, the gene expression quantity corresponding to at least 1 inflection point is larger than G2 and smaller than G1, the number of the barcode in the cell area is larger than K3, the number of the barcode in the empty liquid drop area is larger than K4, and the average read number of the cells is larger than K6, judging that the sample data is available; otherwise, it is determined that the sample data is not available. The invention can systematically analyze the availability of single-cell transcriptome data, provide data availability early warning before downstream analysis, and save analysis time and energy of analysis personnel.

Description

Single cell transcriptome data availability processing method, medium and equipment

Cross Reference to Related Applications

The application is based on application number 2022113631393, and the application date is: 2022, 11, 02, the name of the invention is: a single cell transcriptome data availability analysis method, medium and apparatus are disclosed.

Technical Field

The invention relates to a biological data analysis method, in particular to a single cell transcriptome data availability processing method, medium and equipment.

Background

Single cell transcriptome sequencing technology can obtain information of nearly ten thousand gene expression in a single cell, and can distinguish transcription characteristics of various cell types in biological tissues and comprehensively reveal gene expression heterogeneity among cells. The core technology of the high-throughput single-cell sequencing platform is to add a unique sequence tag to each cell and treat the nucleic acid sequence carrying the same tag as being from the same cell during sequencing. The 10XGenomics single-cell transcriptome sequencing platform is a technology widely applied at present, and the platform utilizes technologies such as micro-flow control, oil drop encapsulation, barcode labels and the like to realize high-flux cell sorting and capturing, can separate and mark 500 to tens of thousands of single cells at one time, can obtain transcriptome information of each cell after sequencing, and has the advantages of high cell flux, low library construction cost, short capturing period and the like. The technology is mainly used for cell typing and identification of marker factors, can realize division of cell populations and detection of gene expression differences among the cell populations, can also predict cell differentiation and development tracks, and plays an increasingly important role in the current disease, immunity and tumor fields and research on tissues, organs and development.

A typical single cell transcriptome sequencing technique consists of 6 steps: single cell analysis, RNA isolation, reverse transcription, amplification, library generation and sequencing. The first two steps are particularly important. 10X Genomics single cell transcriptome sequencing technology utilizes microfluidic chips to encapsulate microbeads and single cells with a barcode tag in one droplet. Each microbead carries a unique nucleotide sequence, namely, a barcode tag, which can label individual cells. Each barcode tag is also linked to a molecular identifier (unique molecular identifier, UMI) which also consists of a nucleotide sequence, each UMI being capable of labelling an mRNA transcript. Through reverse transcription, PCR amplification, library generation and sequencing, in sequencing data, whether each sequence in the result is from the same cell and the same mRNA can be determined according to the barcode label and UMI label, so that a transcriptome expression profile of a single cell is obtained.

Although 10X Genomics single cell transcriptome sequencing technology can detect thousands of cells simultaneously, it is premised on the normal generation of droplets (GEMs) that encapsulate cells and microbeads, and that there is a sufficient amount of sequencing data for each cell. When the generation of GEMs fails or the number of cells is excessive during the experiment, it is difficult for the sequencing data to correctly reflect the true state of the cells. The reason for failure of GEMs generation may be that cells or magnetic beads are blocked in a microchannel (collectively referred to as blocking holes), or that oil droplets do not properly contain a cell suspension (collectively referred to as weighing failure), the former results in extremely low numbers of captured cells, and the latter results in unclear boundaries of captured cells, which causes confusion of expression profiles. While an excessive number of cells may result in an insufficient sequencing of each cell, resulting in severely unstable results. Under the prior art conditions, the data problems caused by the reasons cannot be directly reflected in the experimental process and the data volume, and sequencing data is often found to be unavailable when the data is analyzed to a certain degree, so that a lot of manpower, calculation force and time are wasted.

Disclosure of Invention

In order to solve at least one technical problem mentioned in the background art, the invention aims to provide a single-cell transcriptome data availability analysis method, medium and equipment, which can judge whether single-cell transcriptome data is unavailable due to experimental problems, provide data availability early warning before downstream analysis, save analysis time and energy of analysts, and provide basis for subsequent processing according to corresponding processing schemes.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a single cell transcriptome data availability analysis method comprising the steps of:

s1, sequencing barcode according to the gene expression quantity from large to small;

s2, obtaining an inflection point of the variation amplitude of the gene expression quantity;

s3, traversing all inflection points, and classifying the barcode into a cell area, an empty droplet area and a magnetic bead area by combining the size of the gene expression quantity;

s4, counting the numbers of the barcode in the cell area, the empty droplet area and the magnetic bead area;

s5, extracting expression profiles of all barcode in a cell area;

s6, counting and comparing the read numbers of the upper reference genome, and calculating the average read number of the cells;

s7, judging that sample data is available when the gene expression quantity corresponding to at least one inflection point is larger than G1, the gene expression quantity corresponding to at least 1 inflection point is larger than G2 and smaller than G1, the number of barcode in a cell area is larger than K3, the number of barcode in an empty liquid drop area is larger than K4, and the average read number of cells is larger than K6; otherwise, it is determined that the sample data is not available.

Further, the method for solving the inflection point of the variation amplitude of the gene expression level is as follows:

s21, drawing a scatter diagram by taking the ranking of the barcode as an X axis and the gene expression quantity as a Y axis;

s22, taking the nearest point on the scatter diagram at a specified distance, and obtaining the slope between two adjacent points;

s23, when the change trend of the slope is changed from large to small and the slope is smaller than the set slope threshold for the first time in the trend duration, the corresponding point is set as an inflection point.

Further, before the scatter diagram is drawn in S21, the ranking of the barcode and the gene expression level are subjected to logarithmic processing.

Further, the classification method of the cell area, the empty droplet area and the magnetic bead area is as follows:

classifying the corresponding gene expression amount of barcode before the inflection point of G1 into a cell region; classifying the barcode whose corresponding gene expression level is located before the inflection point between G1 and G2 and is not in the cell region into an empty droplet region; and classifying the corresponding genes into a magnetic bead region after the inflection point of which the expression quantity of the corresponding genes is smaller than G2.

Further, when the sample data is not available, further judging the reason why the sample data is not available:

calculating the expression proportion of different genes in the barcode, and counting the number of first genes with the expression proportion larger than P1 and the number of second genes with the expression proportion larger than P2;

when the gene expression quantity corresponding to only one inflection point is larger than G2 and the first gene quantity is larger than K1 or the second gene quantity is larger than K2, judging that the sample data is not available, wherein the experiment has a weighing failure;

when the number of the barcode in the cell area is smaller than K3 and the number of the barcode in the empty droplet area is smaller than K4, judging that the sample data is not available, wherein the hole blockage exists in the experiment;

when the number of the barcode in the cell area is smaller than K3 and the number of the barcode in the empty droplet area is larger than K4, judging that the availability of the sample data is to be confirmed, wherein the number of the experimental cells is too small;

when the number of the barcode in the cell area is larger than K5 and the average read number of the cells is smaller than K6, judging that the sample data is not available, wherein the number of the experimental cells is excessive;

further, the step S7 further includes step S8 of performing a corresponding processing method for the data availability condition:

if the sample data is available, normally carrying out subsequent data analysis;

if the sample data is unavailable due to the existence of a weighing failure or hole blockage in the experiment, carrying out the experiment again by using the cell suspension;

if sample data is not available due to an excessive number of experimental cells, the amount of sequencing data is increased.

Further, when the sequencing data amount is increased, the data amount of the complement measurement is as follows:

Gb＝(5×10 ⁴ -Read _cell )×Barc _cell

gb is the complement data quantity; read _cell Mean read number for cells，Barc _cell Is the number of barcode in the cell region.

A computer storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements a single cell transcriptome data availability analysis method as described above.

A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a single cell transcriptome data availability analysis method as described above when executing the computer program.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention can judge the availability of single cell transcriptome data by calculating the variation amplitude of the gene expression quantity, distinguishing three types of barcode of a cell area, an empty droplet area and a magnetic bead area, and according to the quantity of the various barcode, the gene expression proportion and the sequencing data quantity of the cell droplet. Compared with the prior art, the technical scheme provided by the invention can systematically analyze the availability of single-cell transcriptome data, provide data availability early warning before downstream analysis, and save analysis time and energy of analysis personnel.

2. The invention further analyzes the situation that the sample data is not available, judges whether the single cell transcriptome data is not available due to experimental problems, and provides a corresponding processing method.

Drawings

FIG. 1 is a flow chart of an overall method according to an embodiment of the invention.

FIG. 2 is a scatter diagram of an embodiment of the present invention.

Fig. 3 is a schematic diagram of an inflection point according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiment one:

referring to fig. 1, the present embodiment provides a single cell transcriptome data availability analysis method, which includes the following steps:

s1, according to the gene expression quantity (UMI counts, abbreviated as C _UMI ) Ordering the barcode from big to small and assigning a rank R _n ；

S2, obtaining an inflection point of the variation amplitude of the gene expression quantity, namely a sharp variation point; the specific solving method is as follows:

s21, ranking R in barcode _n X-axis, gene expression level C _UMI Drawing a scatter diagram for the Y axis; to amplify the gene expression level C _UMI The present embodiment also ranks R of the barcode _n And gene expression level C _UMI Log is performed ₁₀ Processing, i.e. in log ₁₀ R _n In log of X axis ₁₀ C _UMI For the Y-axis, a scatter plot is drawn as shown in fig. 2.

S22, taking the nearest point along the X axis on the scatter diagram, wherein the interval length can be set in a self-defined way or 0.2 or 0.3.

Finding the slope k between two adjacent points _n The solution formula is: k (k) _n ＝(y _n -y _n-1 )/(x _n -x _n-1 ) The method comprises the steps of carrying out a first treatment on the surface of the In the formula, (x) _n ，y _n ) Is the coordinates of the nth point, (x) _n-1 ，y _n-1 ) Coordinates of the n-1 th point; table 1 below shows the ranking and slope of gene expression levels for one example.

Table 1: ranking and slope of gene expression levels

S23, when the slope k _n The trend of change of (2) is from large to small, i.e. k _n <k _n-1 And during the duration of the trend, the slope k _n When the value is smaller than the set slope threshold for the first time, the value corresponds toThe point of (2) is set as an inflection point; in this embodiment, the slope threshold is-1.

As can be seen from table 1 above, the slope corresponding to barcode of rank 1996 is less than-1 for the first time with a continuous decrease, and is therefore considered to be the inflection point.

As shown in FIG. 3, by step S2, inflection points K can be found on the scatter plot _m The number of inflection points is counted for use in subsequent steps.

S3, traversing all inflection points, and classifying the barcode into a cell area, an empty droplet area and a magnetic bead area according to the size of the gene expression quantity. Wherein the meaning represented by each region is as follows:

cell region: barcode represents a droplet comprising cells;

empty drop zone: barcode represents a droplet that does not contain cells but contains a cell suspension;

magnetic bead region: barcode represents a droplet that does not contain cells and does not contain a cell suspension;

the classification method of the cell area, the empty liquid drop area and the magnetic bead area is as follows:

setting two thresholds G1 and G2, (G1 > G2); traversing all inflection points

Corresponding gene expression level C _UMI When G1 is greater, classifying all the barcode ranked before the inflection point into a cell region;

classifying the barcode which is ranked before the inflection point and is not in the cell region into an empty droplet region when the corresponding gene expression amount is located between G1 and G2;

when the corresponding gene expression level is smaller than G2, the barcode after the inflection point is classified into the bead region.

G1 and G2 can be adjusted according to practical situations and are generally set to 500 and 80.

S4, counting the barcode quantity Barc of the cell area, the empty droplet area and the magnetic bead area respectively _cell 、Barc _empty And Barc _bead 。

S5, extracting expression profiles of all barcode in a cell area; calculating the expression proportion P of different genes in the barcode; quantity of barcode expressing a certain Gene (assumed to be Gene A) is C _A Expression of the expression ratio PThe calculation formula of (C) is p=c _A /Barc _cell *100％；

The first gene number with a proportion of expression greater than P1 (50%) and the second gene number with a proportion of expression greater than P2 (70%) were counted.

S6, comparing the Read number Read of the upper reference genome by using 10X Genomics official software cellrange statistics _total Calculating the average Read number of cells _cell The method comprises the steps of carrying out a first treatment on the surface of the The formula is: read _cell ＝Read _total /Barc _cell To determine whether the sequencing amount is sufficient.

S7, judging whether the sample data is available or not, and determining that at least one inflection point corresponds to the gene expression quantity C _UMI Greater than G1, has a gene expression level C corresponding to at least 1 inflection point _UMI Greater than G2 and less than G1, and the number of barcode Barc in the cell region _cell Greater than K3 (in this example, K3 is taken to be 2000), the number of barcode Barc in the empty drop zone _empty Greater than K4 (in this example, K4 is 30000), the average Read number of cells Read _cell When the sample data is larger than K6 (in the embodiment, K6 is 20000), judging that the sample data is available; otherwise, it is determined that the sample data is not available.

In this embodiment, when the sample data is not available, the reason why the sample data is not available is further determined:

when there is only one inflection point corresponding to the gene expression level C _UMI Greater than G2, and the first gene number is greater than K1 (in this embodiment, K1 is taken to be 900) or the second gene number is greater than K2 (in this embodiment, K2 is taken to be 300), judging that the sample data is not available, because the experiment has a weighing failure;

when the number of barcode in the cell region is Barc _cell Number of barcode Barc of empty drop zone less than K3 _empty If the data is smaller than K4, judging that the sample data is not available, wherein the experiment has hole blocking;

when the number of barcode in the cell region is Barc _cell Number of barcode Barc of empty drop zone less than K3 _empty When the number of the experimental cells is larger than K4, judging that the availability of the sample data is to be confirmed, wherein the number of the experimental cells is too small;

when the number of barcode in the cell region is Barc _cell Greater than K5 (in this example, K5 is 20000), and the average Read number of cells Read _cell Less than K6, judging that the sample data is not available because the number of experimental cells is excessive and the sequencing depth is insufficient;

s8, a corresponding processing method is carried out according to the data availability condition:

if the sample data is not available due to excessive experimental cell numbers, the sequencing data volume is increased, and when the sequencing data volume is increased, the data volume of the complementary measurement is as follows:

Gb＝(5×10 ⁴ -Read _cell )×Barc _cell

gb is the complement data quantity; read _cell As average read number of cells, barc _cell Is the number of barcode in the cell region.

If the number of experimental cells is too small, the experiment is re-performed with the cell suspension.

Embodiment two:

a computer storage medium having a computer program stored thereon, wherein the program when executed by a processor implements the single cell transcriptome data availability analysis method according to embodiment one.

Embodiment III:

a terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the single cell transcriptome data availability analysis method according to embodiment one of the present invention when executing the computer program.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A single cell transcriptome data availability processing method, comprising the steps of:

s1, sequencing barcode according to the gene expression quantity from large to small, and endowing rank R _n At the same time, rank R of barcode _n And gene expression level C _UMI Log is performed ₁₀ Processing;

s5, extracting expression profiles of all barcode in a cell area;

s7, judging that sample data is available when the gene expression quantity corresponding to at least one inflection point is larger than G1, the gene expression quantity corresponding to at least 1 inflection point is larger than G2 and smaller than G1, the number of barcode in a cell area is larger than K3, the number of barcode in an empty liquid drop area is larger than K4, and the average read number of cells is larger than K6; otherwise, judging that the sample data is not available;

when the sample data is not available, further judging the reason why the sample data is not available:

when the gene expression quantity corresponding to only one inflection point is larger than G2 and the first gene quantity is larger than K1 or the second gene quantity is larger than K2, judging that the sample data is not available, wherein the oil drops are not correctly contained in the cell suspension in the experiment;

when the number of barcode in the cell area is greater than K5 and the average read number of cells is less than K6, it is determined that the sample data is not available because the number of experimental cells is excessive.

if the oil drops in the sample data do not contain the cell suspension correctly or the blocking holes are unavailable due to the experiment, the cell suspension is reused for the experiment;

2. The method for processing availability of single cell transcriptome data according to claim 1, wherein the method for solving the inflection point of the magnitude of the change in the gene expression level is as follows:

3. The method for processing availability of single cell transcriptome data according to claim 2, wherein ranking of the barcode and the gene expression level are logarithmically processed before the scattergram is drawn in S21.

4. The method for processing availability of single cell transcriptome data according to claim 1, wherein the classification method of the cell region, the empty droplet region and the magnetic bead region is as follows:

5. The method for processing availability of single cell transcriptome data according to claim 1, wherein the additional amount of sequencing data is increased by:

Gb＝(5×10 ⁴ -Read _cell )×Barc _cell

6. A computer storage medium having stored thereon a computer program, which when executed by a processor implements the single cell transcriptome data availability processing method according to any one of claims 1 to 5.

7. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the single cell transcriptome data availability processing method according to any one of claims 1 to 5 when the computer program is executed by the processor.