CN113012759B

CN113012759B - Method for calculating cffDNA content of male fetus based on X chromosome

Info

Publication number: CN113012759B
Application number: CN202011431098.8A
Authority: CN
Inventors: 袁梦兮; 马丑贤; 李�根; 黄文静; 蒋艳凰; 王振国; 杨仁武
Original assignee: Genetalks Bio Tech Changsha Co ltd
Current assignee: Genetalks Bio Tech Changsha Co ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2022-08-12
Anticipated expiration: 2040-12-09
Also published as: CN113012759A

Abstract

The invention discloses a method for calculating the cffDNA content of a male fetus based on an X chromosome, which comprises the following steps: step S1: obtaining an original sequencing gene sequence; step S2: counting the sequencing gene sequence; step S3: standardizing the number of sequencing gene sequences; step S4: identifying the sex of the fetus; step S5: calculating copy baseline in female X chromosome window; step S6: calculating X chromosome prediction factors of male fetus and detecting abnormal points; step S7: obtaining the cffDNA content of the male fetus. The invention has the advantages of simple principle, simple and convenient operation, high accuracy, high detection efficiency and the like.

Description

Method for calculating cffDNA content of male fetus based on X chromosome

Technical Field

The invention mainly relates to the technical field of gene sequencing and biological information analysis, in particular to a male fetus cffDNA content calculation method based on an X chromosome.

Background

The discovery of free embryo DNA molecules in the plasma of pregnant women brings prenatal detection into a non-invasive era, and circulating free embryo DNA (cffDNA for short) is gradually considered as an important carrier for detecting fetal abnormalities in a non-invasive manner. Various methods of non-invasive prenatal testing (NIPT) based on high-throughput sequencing have been developed and are now rapidly shifting into medical practice. In these protocols, cffDNA is a crucial parameter in controlling the performance of the test results and in performing appropriate clinical interpretation of the test results.

The existing methods for estimating and predicting cffDNA content based on biological information mainly comprise the following methods:

(1) calculation method based on Y chromosome: the method is the earliest method for estimating the DNA content of the embryo, and in the early stage, the genetic markers of paternal inheritance, such as SRY gene, DYS14 gene and ZFY gene, which are positioned on a Y chromosome are used for verifying the existence of cffDNA by a PCR chip, so that the cffDNA content can be estimated by utilizing the ratio of the sequence content of the Y chromosome and a certain autosome; in the noninvasive prenatal detection era based on massively parallel sequencing, the proportion of gene sequencing sequences from the whole Y chromosome can be converted into the content of the DNA of the embryo, and the method is visual and accurate and is suitable for detecting the content of the cffDNA of the male fetus.

(2) A method based on combination of pregnant woman plasma sequencing data and father genotyping comprises the following steps: briefly, at a Single Nucleotide Polymorphism (SNP) site where both the biological father and the biological mother are pure and genotyped differently, the fetus will show heterozygosity at that site, and the cffDNA content can then be quantified by alleles inherited from the father. Although this method can directly and accurately assess the cffDNA content, in actual medical practice, the genotyping result of the biological father is often simply and directly unavailable, and thus this method is limited in practical application.

(3) In order to overcome the defects of the method (2) in practical application, a practitioner proposes a cffDNA content calculation method based on targeted high-depth sequencing, the method comprises the steps of performing ultrahigh-depth targeted sequencing on the DNA of the peripheral blood of a pregnant woman, modeling the number of alleles corresponding to four implicit genotype combinations { AAaa, AAab, ABaa, ABab } of the pregnant woman and a fetus by adopting a mixed binomial distribution model, and calculating the cfDNA content by adopting maximum likelihood estimation in the model. This method produces results very close to the above method (2), and has disadvantages in that: the method needs to perform ultra-high depth sequencing on a sample, the sequencing depth usually needs to be more than 120x to detect the fetal allele, and targeted sequencing can only cover one part of a genome and cannot realize chromosome abnormality detection on the whole genome level of the fetus, so that the method is limited in practical application.

(4) The method is based on the principle that other alleles observed from the maternal homozygous sites are theoretically peculiar to fetuses, so that a gene chip is firstly adopted to genotype the leucocytes of the pregnant women, and then alleles which are different from the maternal homozygous sites and are theoretically derived from fathers are identified from the sequencing data of the peripheral blood of the pregnant women; if it is assumed that the error bias introduced by sequencing and other technical causes remains constant in different samples, the cffDNA content will be linearly related to the proportion of these fetal heterozygous sites, so a linear regression model can be constructed by performing the above analysis on samples of known cffDNA to estimate and predict cffDNA of other unknown samples. Generally, when the sequencing data reaches more than 1M sequences, the correlation coefficient of the cffDNA calculated by the method to the method (2) can reach more than 0.995. However, the parameters in the model may cause different noise distribution characteristics due to different sequencing platforms and typing chips, and the trained model has no universality; on the other hand, the proportion of heterozygous sites is different in populations of different ethnicities, and these intrinsic factors all affect the accuracy of cffDNA prediction.

(5) The seqFF method based only on pregnant woman peripheral blood low-depth sequencing data, which attempts to estimate cffDNA directly from conventional noninvasive parity data, is basically as follows: first, single-ended random sequencing of maternal peripheral blood was performed and the normalized read numbers in each 50KB window on the autosome (except chromosome 13, 18, 21) were analyzed to fit a high-dimensional elastic network and a reduced rank regression model. The Pearson correlation coefficient of the method and the calculation method based on Y can reach 0.93, but the training of the high-dimensional model requires a large-scale training set sample, and the accuracy of the method cannot be guaranteed when the cffDNA content is lower than 5%.

(6) A method based on fetal methylation markers by estimating cffDNA content from methylation markers characteristic of the placenta. For example, sequences of the RASSF1A promoter region differ in methylation status in pregnant women and fetuses, and by cleaving this region with a methylation sensitive enzyme, hypermethylated sequences from fetuses are unaffected, and hypomethylated sequences from pregnant women are destroyed by cleavage, thereby achieving isolation of fetal sequences from maternal background sequences for analysis of cffDNA content. However, bisulfite-based methylation sequencing methods are expensive, and bisulfite can degrade DNA fragments, making large-scale application in conventional non-invasive production assays difficult.

(7) Method based on the distribution of episomal DNA fragments based on the principle that the DNA fragments in the episomal DNA of pregnant women exhibit different distribution characteristics from the fragment size of the DNA fragments of fetuses, the DNA fragments from fetuses are generally shorter, and the cffDNA content can be estimated based on the ratio between the fragments of different lengths by paired-end sequencing. The cffDNA content is estimated by fitting a linear model by taking the ratio of the number of fragments between the interval of [100,150] and [163,169] as a prediction factor, the correlation coefficient of the method and the calculated cffDNA content of Y is 0.83, and the accuracy is difficult to meet the requirement of non-invasive production detection.

(8) Based on the free DNA nucleosome positioning method, the research shows that the main peak of the fragment length of the free DNA in the peripheral blood of the pregnant woman is 166bp, some small peaks similar to spikes are separated by 10bp, and the main peak of the molecular length of the fetal free DNA is 143 bp; scientists speculate that 166bp contains a nucleosome body and a linker, and on the contrary, the DNA molecule with 143bp main peak lacks the linker as a component, and based on the assumed model, scientists develop a method for predicting cffDNA based on the nucleosome positioning method, but the method has low accuracy and is difficult to meet the clinical requirement.

In conclusion, the method based on the low-depth large-scale parallel sequencing of the free DNA in the peripheral blood of the pregnant woman is still the mainstream method for noninvasive prenatal detection, the calculation method based on the Y chromosome is considered as a gold standard method for calculating the cffDNA of the male fetus, and the method needs the free DNA of the male as a contrast to accurately estimate the content of the Y chromosome calculated by a specific sequencing platform and a biological information processing flow, so as to accurately infer the content of the cffDNA; however, detecting male control samples in conventional non-invasive tests increases the cost of sequencing and the complexity of process management, and in addition, if the fetus has aneuploidy abnormalities of the Y chromosome (e.g., lacks the Y chromosome, or has multiple Y chromosomes), the cffDNA content cannot be accurately estimated.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides the method for calculating the content of the cffDNA of the male fetus based on the X chromosome, which has the advantages of simple principle, simple and convenient operation, high accuracy and high detection efficiency.

In order to solve the technical problems, the invention adopts the following technical scheme:

a male fetus cffDNA content calculation method based on an X chromosome comprises the following steps:

step S1: obtaining an original sequencing gene sequence;

step S2: counting the sequence of the sequencing gene;

step S3: standardizing the number of sequencing gene sequences;

step S4: identifying the sex of the fetus;

step S5: calculating copy baseline in female X chromosome window;

step S6: calculating X chromosome prediction factors of male fetus and detecting abnormal points;

step S7: obtaining the cffDNA content of the male fetus by the following formula:

wherein S is _male I represents a sample with the serial number i in the sample set,

the set of predictors for sample i after the removal of the outlier on the X chromosome,

to predict the aggregate size of the factors, pf _i,X,w W for sample i after removal of the outlier on the X chromosomeA predictor.

As a further improvement of the process of the invention: in the step S1, the peripheral blood sample of the pregnant woman for noninvasive prenatal detection is subjected to low-depth sequencing to obtain an original sequencing gene sequence.

As a further improvement of the process of the invention: in the step S1, the original sequencing gene sequence is preprocessed; the pre-processing includes aligning the original sequenced gene sequence to a human reference genome and de-duplicating the results of the alignment.

As a further improvement of the process of the invention: the step S2 includes:

counting the number UM of unique aligned gene segments on 1-22 autosomes and sex chromosomes of each sample _i,j Wherein, i is more than or equal to 1 and less than or equal to n, j belongs to {1,2, …,22, X, Y };

counting the number UM of uniquely aligned gene segments in a window with the size of K on the X chromosome _i,X,k Wherein i is more than or equal to 1 and less than or equal to n,

| X | is the length of the X chromosome;

sex chromosome statistics do not account for the unique aligned gene sequence fragments within the pseudoautosomal region.

As a further improvement of the process of the invention: the step S3 includes the steps of:

step S301: calculating the total number of uniquely aligned gene fragments of 1-22 autosomes

1≤i≤n；

Step S302: the relative contents of the X chromosome and the Y chromosome of each sample are calculated: xc _i ＝UM _i,X /UM _i ,yc _i ＝UM _i,Y /UM _i ,1≤i≤n；

Step S303: calculating the standard value of the number of uniquely aligned gene fragments in m windows with the size of K on the X chromosome of each sample:

1≤i≤n，

as a further improvement of the process of the invention: in the step S4, when the Y chromosome content yc _i If the sex of the fetus is larger than the given threshold value sigma, the fetus is judged to be a male fetus, otherwise, the fetus is judged to be a female fetus.

As a further improvement of the process of the invention: the threshold σ is 0.0005.

As a further improvement of the process of the invention: the step S5 includes the following steps:

step S501: selecting the fetus sex identification result in the previous step as a sample set S of the fetus _female ；

Step S502: calculating the median MNUM of the standard values of m sequences with the size of K in the windows on the X chromosome of all the female fetus samples _X,k ＝median(NUM _i,X,k ),i∈S _female ,

As a further improvement of the process of the invention: the step S6 includes:

step S601: selecting a sample set S with sex identification results of male fetus _male ；

Step S602: calculating S _male Predictor pf in m windows of size K on the X chromosome of each sample in the set _i,x,k ＝log ₂ (NUM _i,X,k /MNUM _X,k ),i∈S _male ,

Step S603: outliers in the predictor are removed.

As a further improvement of the process of the invention: the step S603 includes:

step S6031: for sample i, calculate the 0.05 and 0.95 quantile α of all predictors _i,0.05 And alpha _i,0.95 ；

Step S6032: will be less than alpha _i,0.05 And is greater than alpha _i,0.95 Removing the prediction factor, recording the combination of the prediction factors after removing the abnormality

i∈S _male 。

Compared with the prior art, the invention has the advantages that:

1. according to the method for calculating the content of the male cffDNA based on the X chromosome, disclosed by the invention, the content of the X chromosome is selected as a prediction factor of the content of the male cffDNA, so that the problem that the content of the cffDNA cannot be accurately estimated by using the Y chromosome when the Y chromosome of a fetus has aneuploid abnormality is solved.

2. According to the method for calculating the content of the male fetus cffDNA based on the X chromosome, the X chromosome is selected as a prediction factor of the content of the male fetus cffDNA, and because the X chromosome is 3 times as long as the Y chromosome, the required sequencing data amount is 1/3 of the Y chromosome method under the same statistical accuracy, so that the detection cost is further reduced.

3. According to the method for calculating the content of the male fetal cffDNA based on the X chromosome, the detection of male free DNA is not needed as a comparison, but the sequencing data of the peripheral blood free DNA of the pregnant woman of the female fetus is used as a comparison, the data can be directly obtained in the conventional noninvasive prenatal detection, and the additional sequencing cost and the process management complexity are reduced.

4. According to the method for calculating the content of the cffDNA of the male fetus based on the X chromosome, disclosed by the invention, data are up-sampled in a sliding window mode, so that the statistical effect is improved, the requirement on the sequencing quantity is reduced, and the detection cost is further reduced.

Drawings

FIG. 1 is a schematic flow chart of the present invention in a specific application example.

Detailed Description

The invention will be described in further detail below with reference to the drawings and specific examples.

There are 23 pairs of chromosomes in a human, where chromosomes 1-22 are autosomes, X and Y are sex chromosomes, the sex chromosome combination in males is XY, and the sex chromosome combination in females is XX; for a male fetus, the free DNA in the peripheral blood of the pregnant woman is a mixture of the maternal DNA and fetal DNA, and the sex chromosome combination of the pregnant woman and the fetus is XXxy. Assuming that the cffDNA content is f, the expected value of the X chromosome content is 2 × (1-f) + f ═ 2-f, and since the X chromosome content is correlated with the cffDNA content, the cffDNA content can be indirectly predicted based on the X chromosome; further, the above properties are also present for each sub-region of fixed length of the X chromosome.

As shown in fig. 1, the invention provides a method for calculating the cffDNA content of male fetus based on X chromosome based on the above principle, which comprises the following steps:

step S1: obtaining an original sequencing gene sequence;

performing low-depth sequencing on n (n is greater than 30) non-invasive prenatal detection pregnant woman peripheral blood samples to obtain an original sequencing gene sequence, and preprocessing the original sequencing gene sequence;

step S2: counting the sequencing gene sequence;

step S3: standardizing the number of sequencing gene sequences;

step S4: identifying the sex of the fetus;

step S5: calculating copy baseline in female X chromosome window;

to predict the aggregate size of the factors, pf _i,X,w The predictor labeled w after the removal of the outlier on the X chromosome for sample i.

In a specific application example, in step S1, the original sequenced gene sequence is further preprocessed; the pre-processing includes aligning the original sequenced gene sequence to the human reference genome (e.g., in practice, the genome version may be chosen to be hg19), and de-duplicating the results.

In a specific application example, the step S2 includes:

counting the number UM of gene segments uniquely aligned within a window of size K (e.g., K5000 is taken as an example) on chromosome X _i,X,k Wherein i is more than or equal to 1 and less than or equal to n,

| X | is the length of the X chromosome;

sex chromosome statistics do not account for the unique aligned gene sequence segments within the pseudoautosomal region (PAR region).

In a specific application example, the step S3 may include the following steps:

1≤i≤n；

Step S303: the standard value of the number of gene fragments uniquely aligned within m (taking m-12) windows of size K on the X chromosome of each sample was calculated:

1≤i≤n，

the mode of 'large window statistics and small window sliding' is adopted to ensure that enough gene sequencing sequence fragments exist in each window under the condition of low-depth sequencing, and simultaneously, enough windows can be ensured.

In a specific application example, in the step S4, when the Y chromosome content yc _i If the sex of the fetus is larger than the given threshold value sigma, the fetus is judged to be a male fetus, otherwise, the fetus is judged to be a female fetus. According to actual needs, σ can be 0.0005 in specific implementation through an expert database.

In a specific application example, the step S5 may include the following steps:

In a specific application example, the step S6 may include the following steps:

step S601: selecting the sex identification result in the step S4 as a sample set S of the male fetus _male ；

Step S603: abnormal points in the prediction factor are removed because the content of free DNA of the pregnant woman is significantly higher than the cffDNA content, and if the pregnant woman has copy number variation within the corresponding statistical window, the estimated cffDNA content will be affected by the background sequence of the pregnant woman, resulting in a distorted calculation result.

Further, step S603 may include the following flow:

i∈S _male 。

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A male fetus cffDNA content calculation method based on an X chromosome is characterized by comprising the following steps:

step S1: obtaining an original sequencing gene sequence;

step S2: counting the sequencing gene sequence;

step S3: standardizing the number of sequencing gene sequences;

step S4: identifying the sex of the fetus;

step S5: calculating copy baseline in female X chromosome window;

to predict the aggregate size of the factors, pf _i,X,w A predictor labeled w for sample i after the removal of the outlier on the X chromosome;

the step S5 includes the following steps:

Step S502: calculating the median of the standard values of m window sequence fragments with the size of K on X chromosome of all female fetus samples

The step S6 includes:

Step S602: calculating S _male Predictors in m windows of size K on each sample X chromosome in the set

Step S603: outliers in the predictor are removed.

2. The method for calculating the cffDNA content in male fetus according to claim 1, wherein in step S1, the peripheral blood sample of the pregnant woman for noninvasive prenatal testing is subjected to low-depth sequencing to obtain the original sequencing gene sequence.

3. The method for calculating the cffDNA content in male fetus according to claim 2, wherein in step S1, the original sequenced gene sequence is preprocessed; the pre-processing includes aligning the original sequenced gene sequence to a human reference genome and de-duplicating the results of the alignment.

4. The method for calculating the content of cffDNA of X-chromosome-based male fetus according to claim 1,2 or 3, wherein said step S2 comprises:

| X | is the length of the X chromosome;

sex chromosome statistics do not account for the unique aligned gene sequence fragments within pseudoautosomal regions.

5. The method for calculating the X chromosome-based male fetus cffDNA content according to claim 1,2 or 3, wherein the step S3 comprises the steps of:

6. the method for calculating the cffDNA content in male fetus according to claim 1,2 or 3, wherein the Y chromosome content yc is determined in step S4 _i If the sex of the fetus is larger than the given threshold sigma, the fetus is judged to be a male fetus, otherwise, the fetus is judged to be a female fetus.

7. The method according to claim 6, wherein the threshold σ is 0.0005.

8. The method for calculating the cffDNA content in X-chromosome-based male fetus according to claim 1, wherein the step S603 comprises:

i∈S _male 。