CN115394359A

CN115394359A - Method for identifying human embryonic cell chromosome variation and application

Info

Publication number: CN115394359A
Application number: CN202211322202.9A
Authority: CN
Inventors: 乔杰; 李烨; 严智强; 王玉倩; 闫丽盈; 王楠; 朱小辉; 关硕; 阔瀛; 孔思明
Original assignee: Peking University Third Hospital Peking University Third Clinical Medical College
Current assignee: Peking University Third Hospital Peking University Third Clinical Medical College
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2022-11-25
Anticipated expiration: 2042-10-27
Also published as: CN115394359B

Abstract

The invention relates to a method for identifying human embryonic cell chromosome variation, which obtains a reference system which takes a gene as a unit and can be used as an expression quantity reference by establishing a normal diploid gene expression matrix. Calculating the relative value of the chromosome expression quantity of the embryo to be detected can indicate the chromosome ploidy of the embryo. Embryo biopsy can capture information available throughout the embryo to ensure transcriptome sequencing as an effective tool for pre-implantation screening. The method of the invention can be used for generating chromosome karyotypes based on RNA expression changes, and the result is basically consistent with the result of CNV calculation by the existing whole genome sequencing.

Description

Method for identifying human embryo cell chromosome variation and application

Technical Field

The invention relates to the field of medical detection, in particular to a method for identifying human embryonic cell chromosome variation and application thereof.

Background

Less than half of human zygotes survive to birth, and some fetuses are born with genetic disease, primarily due to chromosomal deletions or duplications of meiotic or mitotic origin. Currently, the process of selecting embryos for uterine transplantation uses a temporal combination of morphological criteria, developmental dynamics and aneuploidy gene detection. However, there is no single criterion to ensure that a viable embryo is selected. Transcriptomes give rise to embryos with high developmental potential, but at the same time as it does, it is also necessary to know the chromosomal Copy Number Variation (CNV) of the embryos. Although there are methods available for obtaining chromosomal CNV by batch-based DNA assays or comparison of multiple biopsies of a few embryonic cells, these methods are based on genomic sequencing and do not simultaneously obtain transcriptome information.

At present, there are two existing techniques for identifying chromosomal variation in human embryonic cells prior to implantation by using the single-cell transcriptome technique:

(1) The RNA-seq library was generated by taking the trophectoderm biopsy and the remaining whole embryo. Specifically, based on the RNA expression value of each sample, the method uses a z-fraction as a standardization mode, establishes an RNA digital karyotype for each autosome of a batch of samples, divides a threshold value, and uses chromosomes with the z-fraction being more than 2 or less than-2 as abnormal values to report chromosome variation.

(2) Embryo karyotype is classified by transcriptome data, but more deep RNA-seq sequencing is required, and aneuploidy is inferred based on SNP genotyping by integrating the characteristics of allelic imbalance, detecting dose-related gene expression changes.

The aforementioned prior art approach to chromosomes removes the most noisy genes (those expressing <1 RPKM in all samples) and then treats each entire chromosome as a transcription unit and normalizes the z-score for the total amount of gene expression on each chromosome. Chromosomes with z-scores greater than 2 or less than-2 are considered outliers. Thereby judging the chromosome karyotype of the embryo. However, this method has two disadvantages:

first, unstable gene expression can affect the determination of chromosome copy number. Some genes have very high expression level, up to ten thousand RPKM, and some genes have single digit expression level. Thus, a highly expressed gene has a great influence on the total amount of chromosomal transcripts in which the gene is located. In particular, when some highly expressed genes themselves are not stably expressed, these genes may cause excessive intrinsic noise for karyotype determination. However, the method only screens the genes which are not expressed, but does not perform any treatment on the genes with high expression. Resulting in chromosomes with a low number of genes, such as chromosome 21, the karyotype calculation is susceptible to high expression of the genes.

Second, there is a systematic error. For diploid human embryonic trophoblast cells, the expression heterogeneity of genes among individuals is strong, so that the chromosome ploidy is measured by directly using the expression quantity of the genes, which brings large errors. In the method, only the normalization is carried out on the chromosome level, and no correction or normalization is carried out on the sample level, so that the difference of the total expression amount among samples can cause deviation, and the chromosomes of the samples with low expression amount (or low sample cell number) are more easily judged to be deleted; and vice versa.

Finally, the chromosome copy number is calculated to establish a relative value after a batch of samples are normalized at the same time, the method needs a certain number of samples to be compared at the same time, the premise is that most samples are normal diploids, so that an abnormal value after normalization is found, the requirement on the samples is high, and the method is sometimes difficult to achieve clinically.

Therefore, currently, there is no effective method for detecting single-cell chromosomal copy number variation by transcriptome.

Disclosure of Invention

To overcome the deficiencies of the prior art, we developed a transcriptome analysis method. The method is used for evaluating the development ability of embryos by deducing aneuploidy through 'identifying whether the gene expression quantity in each chromosome of a human embryo cell accords with that of a normal diploid embryo by single cell transcriptome sequencing data'. The invention mainly solves two problems: the first is to screen out unstable gene expression through coefficient of variation, and establish normal human embryo diploid gene expression reference system, eliminate internal noise. And secondly, correcting the chromosome expression quantity by using a diploid gene expression reference system at a sample level, synchronously multiplying the chromosome expression quantity of the sample by a coefficient, and adjusting the median of the chromosome expression quantity of each sample to 2 so as to eliminate the karyotype judgment system deviation caused by sample difference.

Specifically, we establish a normal diploid gene expression matrix to obtain a reference frame in units of genes, which can be used as a reference for expression. Calculating the relative value of the chromosome expression quantity of the embryo to be detected can indicate the chromosome ploidy of the embryo. Embryo biopsies can capture information available throughout the embryo to ensure transcriptome sequencing as an effective tool for pre-implantation screening. The results indicate that this technique can be used to generate chromosomal karyotypes based on changes in RNA expression, and that the results are essentially consistent with the results of current genome-wide sequencing calculations of CNVs.

In order to achieve the above technical effects, the following technical solutions are specifically provided:

in a first aspect of the present invention, there is provided a method for detecting single cell chromosomal Copy Number Variation (CNV) by transcriptome, the method comprising the steps of:

(1) Screening for stably expressed genes.

After deleting genes whose expression levels in all diploid samples were less than 1 on the average, the Coefficient of Variation (CV) was calculated for the expression levels of the remaining genes as follows:

SD is the standard deviation of gene expression in each sample, mean is the Mean expression level of the gene

According to the distribution condition of the coefficient of variation, the CV values are arranged from high to low, the genes with the CV values positioned in the first 25 percent are selected as genes with unstable expression, the genes are screened out, and the remaining genes are genes with stable expression and can be reserved for the next calculation;

(1) Calculating the average expression level of each stably expressed gene in a diploid standard sample, and forming a new matrix together with the genes, wherein the matrix is a gene expression reference system of a normal diploid embryo of a human body:

(3) Preparation of relative expression quantity matrix

After obtaining the diploid gene expression reference system, calculating the CNV of the clinical sample according to the transcript, firstly, making a relative expression matrix, specifically, firstly, selecting genes which are overlapped with the reference system from the generated matrix to form a new matrix, and dividing each gene in the new matrix by the average expression amount of the corresponding gene in the reference system to generate the relative expression matrix, wherein the limit that the expression amount is higher than 4 in the relative expression matrix is 4, so that the influence of overhigh fluctuation of a single gene on the whole is avoided. Assuming a geneXThe expression amount in the reference system is

Of genesXExpression quantity matrix in all samples to be tested

Comprises the following steps:

geneXThe relative expression matrix of (a) is then:

similarly, the relative expression matrix of all genes is:

(4) Generation and correction of relative expression matrix in chromosome unit and judgment of CNV

Obtaining relative expression matrixes, and calculating the average relative expression quantity of genes of the chromosomes by taking the chromosomes as units, specifically, by using an alignment file downloaded from a UCSC genome database, the genes in each relative expression matrix are corresponding to the chromosome in which the relative expression matrix is located, and each chromosome calculates the average expression quantity of the genes contained in the chromosome:

，

wherein n is the number of genes belonging to chromosome i in the diploid reference system,

calculating each chromosome once to obtain a relative expression matrix using chromosome as unit, where each row in the matrix is the relative expression of each chromosome in a certain sample, and the relative expression matrix

Comprises the following steps:

because the expression quantity of the sample is deviated from the reference system due to the individual difference between each sample, most of the chromosomes of most samples are normal diploids, and therefore, the expression quantity of 22 chromosomes of each sample is multiplied by a coefficient in units of samples

The median of the chromosome expression quantity of each sample is equal to 2, which indicates that the sample is a normal diploid, and the step is used for judging the CNV after the chromosome relative expression matrix is normalized; that is, those with a copy number of more than 2.7 are referred to as trisomy, and those with a copy number of less than 1.3 are referred to as monosomy.

In one embodiment, the step of determining the CNV is:

the expression level of 22 autosomes in each sample was

The median value of the chromosomal expression was recorded

(ii) a Its chromosomal expression coefficient

Then it is:

to obtain

After the values of (3), the expression levels of the 22 chromosomes of the sample can be calculated as follows:

example (a):

the value in the obtained final chromosome relative expression matrix can represent the value of the chromosome Copy Number Variation, namely CNV (Copy Number Variation).

Compared with the prior art, the invention has the following remarkable advantages:

(1) Unstable gene expression. After screening out genes that were not expressed in the samples (genes with RPKM mean < 1), the Coefficient of Variation (CV) of each of the remaining genes in all samples was first calculated, i.e. the standard deviation divided by the mean. The more unstable the expression of the gene, the greater the coefficient of variation. After screening out 25% of the genes in the first CV value, the remaining genes are stably expressed and used for generating a diploid expression level reference system. The method basically eliminates the internal noise caused by unstable gene expression and the difference of the initial expression quantity thereof, and ensures that the calculated CNV variation is derived from the copy number variation of the chromosome; compared with the prior art that only low-expression genes are screened out, the method disclosed by the invention focuses on the influence of the gene expression stability on CNV, and selects stably-expressed genes as a judgment basis for chromosome copy number, so that the accuracy of the result is far higher than that of the prior art;

(2) Aiming at the influence of system errors, the invention calculates the real expression quantity of the gene and establishes a human diploid embryo gene expression reference system. The reference system is used for calibrating the real expression quantity, and the influence of the difference of the expression baseline among the genes on the result is eliminated. Specifically, after obtaining the diploid human embryo trophoblast cell transcriptome matrix, the method calculates the gene expression value (RPKM) of a sample, and corrects the gene expression value by using a human embryo diploid gene expression quantity reference system to obtain a relative expression quantity matrix. The relative expression matrix eliminates the difference between different genes and draws all genes to the same level for statistics. And then calculating the average expression quantity of each chromosome gene by taking the chromosomes as a unit, and finally uniformly up-regulating or down-regulating all the chromosome expression quantities of each sample to the median of 2 to obtain a matrix which is the chromosome ploidy.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a distribution of the coefficient of variation of each gene;

fig. 2 shows the CNV result determined by the method.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it should be understood that they are presented herein only to illustrate and explain the present invention and not to limit the present invention.

EXAMPLE 1 construction of the model

Human embryo diploid gene expression level reference system: the average expression level of the stably expressed gene of the normal human diploid embryo was used as a standard control.

1. Single cell transcriptome sequencing

Extracapsular trophoblast cells were obtained from biopsies. 1, 3 or 5 cells were extracted for single cell transcriptome sequencing.

2. Sequencing data cleaning, comparison and comparison post-processing

Firstly, data quality is cleaned by trim _ galore (version 0.6.6), a second-generation sequencing joint sequence and low-quality bases are removed by default parameters, and only a sequence with the sequence length of more than 36 bp after treatment is reserved. Next, the alignment was performed using RSEM (version 1.3.3) with hg38 as the reference genome. The expression level of each gene was calculated for each sample using RSEM.

3. Screening of samples and Generation of Gene expression matrices

After the gene expression level of each sample was obtained, the samples were mass filtered. Samples with RPKM >1 genes more than 5000 are taken as qualified samples.

After selecting appropriate samples and obtaining the expression level (RPKM) of each gene of each sample, a matrix with column names of the samples and row names of the genes was prepared.

4. Production of human normal diploid embryo gene expression reference system

After obtaining the gene expression matrix, selecting a proper sample and a suitable gene thereof for establishing a gene expression reference system. First, the whole genome PGD result of the selected sample is used to determine whether the trophoblast cells of the embryo sample are normal diploid (gold standard). Secondly, selecting the normal diploid samples, sequencing transcriptome, and then screening genes and making a reference system.

Since the number of cells in each sample is different, the individual samples are also different, resulting in the total gene expression amount in each sample

In contrast, to make the samples comparable, the average of the total expression of two normal samples was calculated

Then, the expression level of each gene in each sample is measured

Are all synchronously up-regulated/down-regulated so as to obtain the total value of gene expression

And average value of total amount of gene expression

Flush:

in whichnIs the sample size;

after correcting the gene expression level, deleting the genes with the average expression level less than 1 in all diploid samples, wherein the genes are regarded as not to be expressed and do not influence the judgment of chromosome karyotypes; then, the Coefficient of Variation (CV) was calculated for the expression level of the remaining genes as follows:

The greater the coefficient of variation, the less stable the expression of the gene is considered.

And (3) according to the distribution condition of the coefficient of variation, arranging CV values from high to low, selecting genes with CV values positioned in the first 25 percent, regarding the genes as genes with unstable expression, and screening the genes, wherein the remaining genes are genes with stable expression and can be reserved for the next calculation.

Then, calculating the average expression level of each gene in the diploid standard sample, and forming a new matrix together with the genes, wherein the matrix is a gene expression reference system of a normal diploid embryo of a human being:

5. preparation of relative expression quantity matrix

After obtaining the diploid gene expression reference system, calculating the CNV of the clinical sample can be started according to the transcript. First, a relative expression quantity matrix is prepared. Specifically, genes overlapping the reference system are selected from the matrix generated in step 3 to form a new matrix. Each gene in the new matrix is divided by the average expression level of the corresponding gene in the reference frame to generate a relative expression level matrix.

Specifically, assume a geneXThe expression amount in the reference system is

. GeneXExpression quantity matrix in all samples to be tested

Comprises the following steps:

geneXThe relative expression matrix of (a) is then:

similarly, the relative expression matrix of all genes is:

6. generation and correction of relative expression matrix in chromosome unit and judgment of CNV

After obtaining the relative expression matrix, the average relative expression quantity of the chromosome genes is calculated by taking the chromosome as a unit. Specifically, using the alignment file downloaded from the UCSC genome database, the genes in each relative expression matrix are mapped to the chromosome on which they reside. The average expression level of the genes contained in each chromosome is calculated:

，

the relative expression amounts of these genes. Each chromosome is calculated once to obtain a relative expression matrix using the chromosome as a unit, each row in the matrix is the relative expression quantity of each chromosome of a certain sample, and the relative expression quantity matrix

Comprises the following steps:

due to individual differences between each sample, the expression level of the sample may be shifted from the reference frame. While most chromosomes in most samples are normally diploid. Therefore, the expression level of 22 chromosomes per sample is multiplied by a coefficient in units of samples

The median of the chromosome expression of each sample was made equal to 2, indicating that it is a normal diploid. After the chromosome relative expression matrix is normalized by the step, the judgment of the CNV is made.

For example, for sample A, the expression levels of 22 autosomes are

The median value of the chromosomal expression was recorded

(ii) a Its chromosomal expression coefficient

Then it is:

to obtain

namely:

the resulting values in the final chromosome relative expression matrix can represent the values of the chromosomal Copy Number Variation, which we generally refer to as CNV (Copy Number Variation).

7. Thresholding and single cell chromosome copy number visualization

Dividing the chromosome copy number by a certain threshold value, and performing clinical judgment. In the clinic, a copy number of 1 represents a chromosomal deletion and a copy number of 3 represents a chromosomal duplication. However, the presence of chimeric embryos (i.e., some cells in the embryo are normally diploid, some are monomeric or trisomy, and current omics sequencing will mix the two cells together) results in the CNV being measured which is often not an integer. Therefore, based on the threshold value of DNA for detecting chromosome copy number, 0-1.3 is divided into deletion (monomer), 1.3-1.7 is divided into chimeric deletion, 1.7-2.3 is normal diploid, 2.3-2.7 is chimeric repeat, and more than 2.7 is repeat (trisomy).

Example 2 identification of Single cell chromosome copy number Using blastocyst biopsy Single cell RNA sequencing data construction

1. Screening of samples and Generation of Gene expression matrices

After the gene expression level of each sample was obtained, the samples were mass filtered. At the gene level, RPKM >1 is defined as expression; at the sample level, samples with RPKM >1 genes with number greater than 5000 are qualified samples.

After screening, a total of 39 samples were obtained for the next calculation. Of these, 16 samples were doubled by DNA sequencing (gold standard) showing normal chromosome number, and these 16 were left as reference frame and the other 23 for validation. After obtaining the expression level (RPKM) of each gene of each sample, a matrix having a column name of the sample and a row name of the gene name is prepared for each of the reference sample and the verification sample.

2. Production of human normal diploid embryo gene expression reference system

The reference sample matrix is used for gene screening and reference line creation.

First, the genes for making a reference system are selected and used for establishing a gene expression reference system. The total gene expression of each sample was averaged and then synchronously up/down regulated to the average.

Secondly, deleting genes with an average expression value RPKM <1, wherein the genes are regarded as not to be expressed, so that the judgment of chromosome karyotype is not influenced; then, the Coefficient of Variation (CV) was calculated from the expression level of the remaining genes. The larger the coefficient of variation, the more unstable the expression of the gene is considered.

The distribution of the coefficient of variation for each gene in the present 16 samples is shown in FIG. 1.

According to the figure, the vertical axis represents the number of genes, and 75% of the gene variation coefficients are concentrated between 0 and 1, so that genes with CV >1 are considered as genes with unstable expression and are screened out, and 7390 genes with stable expression are remained.

After screening out the unstable genes, the remaining genes are all genes used for making a reference frame and for subsequent calculations. As follows:

calculating the average expression level of each gene in 16 standard samples, and forming a new matrix together with the genes, wherein the matrix is a gene expression reference system of the human normal diploid embryo. As shown in the following figures:

3. preparation of relative expression quantity matrix

After obtaining the diploid gene expression reference system, the method can be used for detecting the chromosome copy number of 23 verification samples.

First, a relative expression matrix with respect to a reference frame is prepared. Specifically, 7390 genes that overlap the reference frame are selected from the matrix generated in step 1 to form a new matrix. Each gene in the new matrix is divided by the average expression level of the corresponding gene in the diploid embryo gene expression reference frame to generate a relative expression level matrix. The following figures:

4. generation, correction and judgment of CNV of relative expression matrix in chromosome unit

After obtaining the relative expression matrix, the average relative expression quantity of the chromosome genes is calculated by taking the chromosome as a unit. Specifically, using the alignment file downloaded from the UCSC genome database, the genes in each relative expression matrix are mapped to the chromosome on which they reside. The average expression level of the genes contained in the chromosome is calculated every chromosome cycle. Obtaining a chromosome relative expression matrix:

due to individual differences between each sample, the expression level of the sample may be shifted from the reference frame. Most chromosomes in most samples are normal diploids. Therefore, the expression level of 22 chromosomes per sample is multiplied by a coefficient in units of samples

The median of the chromosome expression of each sample was made equal to 2, indicating that it is a normal diploid. After the chromosome relative expression matrix is normalized by the step, the CNV is further judged.

The coefficients for the 23 samples to be tested are:

to obtain

After the value of (3), the chromosome relative expression quantity is calibrated by weight to obtain the final relative expression quantity of 22 chromosomes in each sample:

the values in the matrix are calculated values of the chromosome CNV (Copy Number Variation).

5. Thresholding and single cell chromosome copy number visualization

The chromosome copy number is divided by a certain threshold value at this time, and clinical judgment is carried out. Dividing the number of the gene into 0-1.3 deletion (monomer), 1.3-2.7 normal diploid or chimeric, and more than 2.7 duplication (trisomy). The CNV result determined by the method of this time according to the threshold is shown in fig. 2.

6. And (5) verifying the result.

The results of identifying human embryonic cell chromosomal variations using single cell transcriptome-based sequencing data and its gene expression reference lines are shown above in FIG. 2. The results obtained from the gold standards (copy number variation results using DNA whole genome sequencing) for this batch of samples and using the above method of the invention (RNA whole transcriptome sequencing build method to identify the copy number of chromosomes of a single cell) are compared in the following table:

as can be seen from the above table, the 23 sample cases show that the method for establishing and identifying the copy number of the single-cell chromosome based on the sequencing of the whole transcriptome is completely consistent with the embryo result based on the sequencing of the whole genome, the diagnosis accuracy rate obtained by the conventional method is only 43.4%, and the diagnosis accuracy rate of the invention is up to 100%.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for detecting chromosomal copy number variation in a single cell by transcriptome, said method comprising the steps of:

after the gene expression level is normalized by RPKM, deleting the gene with the expression level average value less than 1 in all diploid samples; then, the coefficient of variation was calculated for the expression level of the remaining gene as follows:

SD is the standard deviation of the gene expression in each sample, mean is the average expression level of the gene;

arranging the coefficient of variation values from high to low according to the distribution condition of the coefficient of variation, selecting genes of which the coefficient of variation values are positioned at the first 25 percent, regarding the genes as genes with unstable expression, and screening out the genes, wherein the remaining genes are genes with stable expression and can be reserved and used for the next calculation;

calculating the average expression level of each gene in the diploid standard sample, and forming a new matrix together with the genes, wherein the matrix is a gene expression reference system of a human normal diploid embryo:

(3) Preparation of relative expression quantity matrix

After obtaining the diploid gene expression reference system, calculating the copy number variation of clinical samples according to the transcript, firstly making a relative expression matrix, specifically, firstly selecting genes which are overlapped with the reference system from the generated matrix to form a new matrix, and dividing each gene in the new matrix by the average expression of the corresponding gene in the reference system to generate the relative expression matrix; wherein, the relative expression quantity of the genes with the expression quantity exceeding 4 is limited to 4, so that the overlarge influence of the fluctuation of a single gene on the whole is avoided; assuming a gene X, the expression level in the reference system is

Expression quantity matrix of Gene X in all samples to be examined

Comprises the following steps:

geneXThe relative expression matrix of (a) is then:

similarly, the relative expression matrix of all genes is:

(4) Generation, correction and judgment of copy number variation of relative expression matrix in chromosome unit

After obtaining the relative expression matrix, next, calculating the average relative expression level of the genes of the chromosomes by taking the chromosomes as a unit, specifically, by using an alignment file downloaded from the UCSC genome database, the genes in each relative expression matrix are corresponding to the chromosome where the relative expression matrix is located, and each chromosome calculates the average expression level of the genes contained in the chromosome:

，

Comprises the following steps:

And (3) normalizing the relative expression matrix of the chromosomes by the step, and then judging the copy number variation.

2. The method of claim 1, wherein the determining of copy number variation comprises:

the expression level of 22 autosomes in each sample was

The median value of the chromosomal expression was recorded

(ii) a Its chromosomal expression coefficient

Then it is:

to obtain

After the values of (2), the expression levels of the 22 chromosomes of the sample can be calculated as follows:

the value in the final chromosome relative expression matrix can represent the value of the chromosome copy number variation, i.e. the copy number variation.