WO2023010242A1 - 估计无创产前基因检测数据中胎儿核酸浓度的方法和系统 - Google Patents

估计无创产前基因检测数据中胎儿核酸浓度的方法和系统 Download PDF

Info

Publication number
WO2023010242A1
WO2023010242A1 PCT/CN2021/110058 CN2021110058W WO2023010242A1 WO 2023010242 A1 WO2023010242 A1 WO 2023010242A1 CN 2021110058 W CN2021110058 W CN 2021110058W WO 2023010242 A1 WO2023010242 A1 WO 2023010242A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
reads
regions
machine learning
regression model
Prior art date
Application number
PCT/CN2021/110058
Other languages
English (en)
French (fr)
Inventor
张通达
白勇
詹念吉
林润铭
鞠佳
金鑫
Original Assignee
深圳华大生命科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大生命科学研究院 filed Critical 深圳华大生命科学研究院
Priority to PCT/CN2021/110058 priority Critical patent/WO2023010242A1/zh
Publication of WO2023010242A1 publication Critical patent/WO2023010242A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the invention belongs to the field of biotechnology, and more specifically, the invention provides a method and system for estimating the concentration of fetal nucleic acid in non-invasive prenatal genetic testing data.
  • the cell-free DNA in the plasma of pregnant women can be used to analyze the health status of the fetus.
  • Non-invasive testing is based on the cell-free DNA in the plasma of pregnant women to speculate whether the fetus suffers from genetic diseases such as trisomy syndrome.
  • a key parameter in noninvasive data analysis is fetal DNA concentration.
  • the ratio of Y chromosome data can be directly used to infer fetal DNA concentration, while for female fetuses, only other algorithms can be developed.
  • the paper "Maternal plasma fetal DNA fractions in pregnancies with low and high risks for fetal chromosomal aneuploidies” discloses a method for estimating fetal DNA concentration using Y chromosome data, which is only applicable to male fetuses.
  • the paper "Size-based molecular diagnostics using plasma DNA for noninvasive prenatal testing” discloses a method for estimating fetal DNA concentration using the difference between pregnant women and fetal free nucleic acid (cfDNA) fragments. ratio between. This method requires paired-end (PE) sequencing or electrophoresis experiments to estimate fetal nucleic acid concentration, and single-end (SE) sequencing data of NIPT data is not suitable for this method.
  • PE paired-end
  • SE single-end
  • the paper "Determination of fetal DNA fraction from the plasma of pregnant women using sequence read counts” describes a method called seqFF, which divides the whole genome into windows and counts the number of sequencing reads in all windows. Due to the direct window division, a single window may contain both the fetal data-enriched area and the pregnant woman's data-enriched area, which reduces the resolution. Although this method does not require additional data and is not limited to male fetuses, it can only be used to estimate samples with high fetal nucleic acid concentrations, and is not suitable for samples with fetal nucleic acid concentrations within 5%.
  • the present invention requires a method for estimating the concentration of fetal nucleic acid in plasma cell-free DNA of a pregnant woman.
  • the purpose of the present invention is to provide a method for estimating the concentration of fetal nucleic acid in plasma free DNA of pregnant women.
  • the present invention provides a method of estimating the concentration of fetal nucleic acid in noninvasive prenatal genetic testing data, said method comprising:
  • the machine learning model is trained with the data of a plurality of pregnant women with male fetuses, the data of the plurality of pregnant women includes the fetal nucleic acid concentration calculated by the sequencing depth of the Y chromosome and the read segments of each of the plurality of pregnant women in The copy ratios on the plurality of gene regions and/or the plurality of promoter regions.
  • the free nucleic acid fragment comes from the peripheral plasma of a pregnant woman, the liver of a pregnant woman and/or the placenta.
  • the free nucleic acid fragment is free DNA.
  • the plurality of gene regions and/or the plurality of promoter regions are from autosomes, more preferably autosomes other than chromosomes 13, 18 and 21.
  • the length of the multiple gene regions is 7-2473538 bp, and the length of the multiple promoter regions is 199-43798 bp.
  • the number of the plurality of gene regions and/or the plurality of promoter regions is more than 10,000, preferably more than 50,000, more preferably more than 100,000, and most preferably more than 200,000.
  • the machine learning model is a machine learning regression model.
  • the machine learning regression model includes a linear regression model and a nonlinear regression model.
  • the machine learning regression model is a ridge regression model, a lasso regression model, a least squares linear regression model, a regression model based on a random forest algorithm or a regression model based on a deep neural network.
  • the plurality of gene regions and/or the plurality of promoter regions are from ENSEMBLE.
  • the fetal nucleic acid concentration (Fraction fetal ) calculated by Y chromosome depth is:
  • Depth Y is the average coverage depth of the Y chromosome
  • Depth autosomes is the average coverage depth of the sequencing data on the autosomes, preferably the autosomes do not include chromosomes 13, 18 and 21.
  • the copy ratio of the reads on the plurality of gene regions and/or the plurality of promoter regions is:
  • copy_ratio ip is the copy ratio of region p of sample i
  • reads_number ip is the number of reads in region p of sample i
  • length p is the total length of region p
  • reads_number i is the number of reads in sample i
  • length ref is the number of reads in the reference genome Total length, a total of n samples, a total of m areas.
  • the present invention provides a method for constructing a machine learning model for estimating the concentration of fetal nucleic acid in non-invasive prenatal genetic testing data in the first aspect of the present invention, the method comprising:
  • the free nucleic acid fragments are from peripheral plasma of pregnant women, liver of pregnant women and/or placenta.
  • the free nucleic acid fragment is free DNA.
  • the plurality of gene regions and/or the plurality of promoter regions are from autosomes, more preferably autosomes other than chromosomes 13, 18 and 21.
  • the length of the multiple gene regions is 7-2473538bp, and the length of the multiple promoter regions is 199-43798bp
  • the number of the plurality of gene regions and/or the plurality of promoter regions is more than 10,000, preferably more than 50,000, more preferably more than 100,000, most preferably more than 200,000
  • the machine learning model is a machine learning regression model.
  • the machine learning models include linear regression models and nonlinear regression models.
  • the machine learning model is a ridge regression model, a lasso regression model, a least squares linear regression model, a regression model based on a random forest algorithm or a regression model based on a deep neural network.
  • the plurality of gene regions and/or the plurality of promoter regions are from ENSEMBLE.
  • the fetal nucleic acid concentration (Fraction fetal ) calculated by Y chromosome depth is:
  • Depth Y is the average coverage depth of the Y chromosome
  • Depth autosomes is the average coverage depth of the sequencing data on the autosomes, preferably the autosomes do not include chromosomes 13, 18 and 21.
  • the copy ratio of the reads on the plurality of gene regions and/or the plurality of promoter regions is:
  • copy_ratio ip is the copy ratio of region p of sample i
  • raads_number ip is the number of reads in region p of sample i
  • length p is the total length of region p
  • reads_number i is the number of reads in sample i
  • length ref is the number of reads in the reference genome Total length, a total of n samples, a total of m areas.
  • the present invention provides a machine learning model for estimating the concentration of fetal nucleic acid in non-invasive prenatal genetic testing data in the first aspect of the present invention, the machine learning model is constructed according to the method of the second aspect of the present invention .
  • the present invention provides a system for estimating the concentration of fetal nucleic acid in noninvasive prenatal genetic testing data, the system comprising:
  • the sequencing data acquisition module is used to obtain the sequencing data of the free nucleic acid fragments of the pregnant woman, wherein the sequencing data includes several reads;
  • a copy ratio calculation module for calculating the copy ratio of the read segment on multiple gene regions and/or multiple promoter regions
  • the model training module is used to carry out training with the data of multiple pregnant women who are pregnant with male fetuses, and the training includes the fetal nucleic acid concentration and read segments calculated with the Y chromosome sequencing depth in multiple gene regions and/or multiple promoters
  • the copy ratio on the region is input into the machine learning model for training, and the trained machine learning model is obtained;
  • Prediction module for predicting with the data of the pregnant woman sample to be tested, and described prediction comprises inputting the copy ratio of reading segment on multiple gene regions and/or multiple promoter regions into the machine learning model of training to predict fetal nucleic acid concentration .
  • the free nucleic acid fragments are from peripheral plasma of pregnant women, liver of pregnant women and/or placenta.
  • the free nucleic acid fragments are free DNA.
  • the plurality of gene regions and/or the plurality of promoter regions are from autosomes, more preferably autosomes other than chromosomes 13, 18 and 21.
  • the length of the multiple gene regions is 7-2473538 bp, and the length of the multiple promoter regions is 199-43798 bp.
  • the number of the plurality of gene regions and/or the plurality of promoter regions is more than 10,000, preferably more than 50,000, more preferably more than 100,000, most preferably more than 200,000
  • the machine learning model is a machine learning regression model.
  • the machine learning regression model includes a linear regression model and a nonlinear regression model.
  • the machine learning regression model is a ridge regression model, a lasso regression model, a least squares linear regression model, a regression model based on a random forest algorithm or a regression model based on a deep neural network.
  • the plurality of gene regions and/or the plurality of promoter regions are from ENSEMBLE.
  • the fetal nucleic acid concentration (Fraction fetal ) calculated by Y chromosome depth is:
  • Depth Y is the average coverage depth of the Y chromosome
  • Depth autosomes is the average coverage depth of the sequencing data on the autosomes, preferably the autosomes do not include chromosomes 13, 18 and 21.
  • the copy ratio of the reads on the plurality of gene regions and/or the plurality of promoter regions is:
  • copy_ratio ip is the copy ratio of region p of sample i
  • reads_number ip is the number of reads in region p of sample i
  • length p is the total length of region p
  • reads_number i is the number of reads in sample i
  • length ref is the number of reads in the reference genome Total length, a total of n samples, a total of m areas.
  • the method of the present invention does not require high-depth sequencing, PE sequencing, methylation sequencing, and additional sequencing of parents.
  • the method of the present invention is capable of estimating fetal nucleic acid concentrations using only NIPT data.
  • Figure 1 shows the prediction results of fetal nucleic acid concentration of 600 samples of pregnant women calculated by using the sample data of 2,400 cases of pregnant women with male fetuses as the training model.
  • the gene regions and promoter regions with biological functions are counted.
  • the region units with biological functions contain both pregnant women’s data-enriched regions and fetal data’s high-proportioned regions.
  • the enrichment area is less likely, so the resolution is relatively higher.
  • the gene region and the promoter region include regions after gene or promoter expansion.
  • the present invention includes training a machine learning model with the data of multiple pregnant women with male fetuses to obtain a trained machine learning model.
  • the training includes combining the fetal nucleic acid concentration and read segments calculated with the Y chromosome depth in multiple genes Regions and/or copy ratios across multiple promoter regions are fed into machine learning models for training.
  • the copy ratios of the reads in multiple gene regions and/or multiple promoter regions are input into the trained machine learning model to predict the fetal nucleic acid concentration.
  • the training sample and the test sample it is preferable to use the same multiple gene regions and/or multiple promoter regions. Since the method of the present invention is a calculation method developed based on data of genes with biological functions and promoter regions, the resolution is higher, and it can be applied to samples with fetal nucleic acid concentrations within 5%.
  • the machine learning model is a machine learning regression model, including, for example, a linear regression model and a nonlinear regression model.
  • the machine learning regression model may be a ridge regression model, a lasso regression model, a regression model based on a random forest algorithm or a regression model based on a deep neural network.
  • Ridge regression is a multiple linear regression model whose essence is to fit a linear function such that the target variable y is a linear combination of independent variables x (also known as features). Ridge regression further reduces the risk of model overfitting by imposing penalties on the independent variable coefficients (ie, characteristic coefficients) of the linear function (ie, performing L2 regularization processing).
  • cfDNA samples were extracted and sequenced to obtain sequenced reads for each maternity sample before subsequent analysis.
  • the original off-machine data (fq format) of all samples used for model training and prediction is quality controlled and compared to the human reference chromosome hg38 using the samse mode in BWA; use Picard to remove duplicate reads in the comparison results segment and calculate the repetition rate, and use the base quality value correction BQSR function in GATK and other mutation detection algorithms to complete the local correction of the comparison results.
  • the second step is to download the gene region file and regulatory region file of hg19/hg38 on ENSEMBLE, and filter out the sex chromosomes and three chromosomes that may have trisomy on No. 13, No. 18 and No. 21.
  • the third step is to train the model.
  • For male fetuses calculate the average coverage depth (Depth autosomes ) of the sequencing data on the autosomes and the average coverage depth (Depth Y ) of the Y chromosome respectively, and then the nucleic acid concentration (Fraction fetal ) of male fetuses can be obtained.
  • the calculation formula is:
  • copy_ratio ip is the copy ratio of region p of sample i
  • reads_number ip is the number of reads that meet the quality control (MAPQ>30) of region p of sample i
  • length p is the total length of region p
  • reads_number i is the number of reads that meet the quality control (MAPQ>30) of region p of sample i.
  • the number of reads for quality control (MAPQ>30) is the total length of the reference genome, there are a total of n samples, and a total of m regions.
  • m is 54119 gene regions and 160209 promoter regions or a combination of both 214328.
  • the length of the gene region is 7-2473538bp; the length of the promoter region is 199-43798bp.
  • the fetal nucleic acid concentration estimated from the Y chromosome data is the Y value
  • the copy ratio of all genes and promoters is the X value to carry out machine learning models such as ridge regression training
  • the ridge regression model is implemented using the LinearRegression module in the sklearn python package.
  • the pregnant woman sample data of 2400 cases of male and female fetuses were used as training, and L2 regularization and cross-validation were used in the training to obtain the weight estimation of each region and save the ridge regression model after training.
  • the feature weight in this model is equal to the weight coefficient ⁇ of the linear regression model, namely:
  • y i is the nucleic acid concentration of male fetuses calculated from the Y chromosome depth corresponding to sample i
  • ⁇ 0 ... ⁇ is the coefficient of this feature in the model
  • ⁇ x 1 ...x p ⁇ is all p genes and promoter regions in sample i Copy ratio.
  • the method of the present invention can introduce phenotypic data such as gestational age and age as features other than genes and promoter regions, and use the same method for calculation to increase calculation accuracy.
  • the fifth step is to calculate the concentration of fetal nucleic acid.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明属于生物技术领域,公开了一种估计无创产前基因检测数据中胎儿核酸浓度的方法和系统。所述方法包括:(1)获得待测孕妇的游离核酸片段的测序数据,其中所述测序数据包括若干读段;(2)计算读段在多个基因区域和/或多个启动子区域上的拷贝比例;(3)将读段在多个基因区域和/或多个启动子区域上的拷贝比例输入训练的机器学习模型以预测胎儿核酸浓度,所述机器学习模型以孕有男胎的多个孕妇的数据进行训练,所述多个孕妇的数据包括以Y染色体测序深度计算的胎儿核酸浓度和所述多个孕妇的每一个的读段在所述多个基因区域和/或多个启动子区域上的拷贝比例。本发明的方法仅用NIPT数据就能估计胎儿核酸浓度。

Description

估计无创产前基因检测数据中胎儿核酸浓度的方法和系统 技术领域
本发明属于生物技术领域,更具体而言本发明提供了一种估计无创产前基因检测数据中胎儿核酸浓度的方法和系统。
背景技术
孕妇血浆中游离DNA可以用来分析胎儿健康状况,无创检测是基于孕妇血浆游离DNA推测胎儿是否患有如三体综合征之类的遗传疾病。在无创数据分析中一个关键参数是胎儿DNA浓度。对于男胎可以直接用Y染色体数据比例推断胎儿DNA浓度,而对于女胎只能开发其他算法。论文《Maternal plasma fetal DNA fractions in pregnancies with low and high risks for fetal chromosomal aneuploidies》公开了利用Y染色体数据估算胎儿DNA浓度的方法,该方法仅适用于男胎。论文《Size-based molecular diagnostics using plasma DNA for noninvasive prenatal testing》公开了利用孕妇和胎儿游离核酸(cfDNA)片段差异估计胎儿DNA浓度的方法,该方法是基于插入片段长度在100到150bp和163到169bp之间的比例。这种方法需要双端(PE)测序或者电泳实验才能估计胎儿核酸浓度,NIPT数据的单端(SE)测序数据不适用于此方法。论文《Determination of fetal DNA fraction from the plasma of pregnant women using sequence read counts》描述了被称为seqFF的方法,该方法把全基因组划分窗口,统计所有窗口的测序读段的数量。由于直接进行窗口划分,可能导致单个窗口内既包含胎儿数据富集区域又包含孕妇数据富集区域,使得分辨率降低。虽然该方法不需要额外的数据且不局限于男胎,但其仅能用于估算胎儿核酸浓度较高的样本,不适用于胎儿核酸浓度在5%以内的样本。论文《Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus》描述了基于父母和胎儿等位基因的计算方法,该方法需要高深度测序。论文《Noninvasive prenatal methylomic analysis by genomewide bisulfite sequencing of maternal plasma DNA》描述了基于甲基化特征的计算方法,该方法需要进行甲基化测序。
因此,本发明需要一种估计孕妇血浆游离DNA中胎儿核酸浓度的方法。
发明内容
鉴于现有技术中存在的问题,本发明的目的在于提供一种估计孕妇血浆游离DNA中胎儿核酸浓度的方法。
因此,在第一方面,本发明提供了一种估计无创产前基因检测数据中胎儿核酸浓度的方法,所述方法包括:
(1)获得待测孕妇的游离核酸片段的测序数据,其中所述测序数据包括若干读段;
(2)计算读段在多个基因区域和/或多个启动子区域上的拷贝比例;
(3)将读段在多个基因区域和/或多个启动子区域上的拷贝比例输入训练的机器学习模型以预测胎儿核酸浓度,
所述机器学习模型以孕有男胎的多个孕妇的数据进行训练,所述多个孕妇的数据包括以Y染色体测序深度计算的胎儿核酸浓度和所述多个孕妇的每一个的读段在所述多个基因区域和/或多个启动子区域上的拷贝比例。
在一个实施方案中,在(1)中,所述游离核酸片段来自于孕妇外周血浆、孕妇肝脏和/或胎盘。
在一个实施方案中,在(1)中,所述游离核酸片段为游离DNA。
在一个实施方案中,所述多个基因区域和/或多个启动子区域来自常染色体,更优选染色体13、18和21之外的常染色体。
在一个实施方案中,所述多个基因区域的长度为7~2473538bp,所述多个启动子区域的长度为199~43798bp。
在一个实施方案中,所述多个基因区域和/或多个启动子区域的个数为多于1万,优选多于5万,更优选多于10万,最优选多于20万。
在一个实施方案中,所述机器学习模型为机器学习回归模型。
在一个实施方案中,所述机器学习回归模型包括线性回归模型和非线性回归模型。
在一个实施方案中,所述机器学习回归模型为岭回归模型、套索回归模型、最小二乘法线性回归模型、基于随机森林算法的回归模型或基于深度神经网络的回归模型。
在一个实施方案中,所述多个基因区域和/或多个启动子区域来自ENSEMBLE。
在一个实施方案中,Y染色体深度计算的胎儿核酸浓度(Fraction fetal)为:
Figure PCTCN2021110058-appb-000001
Depth Y为Y染色体的平均覆盖深度,Depth autosomes为常染色体上测序数据的平均覆盖深度,优选所述常染色体不包括13、18和21号染色体。
在一个实施方案中,读段在多个基因区域和/或多个启动子区域上的拷贝比例为:
Figure PCTCN2021110058-appb-000002
其中copy_ratio ip为样本i的p区域的拷贝比例,reads_number ip为样本i的p区域的读段数目,length p为区域p的总长,reads_number i为样本i的读段数目,length ref为参考基因组的总长,样本总共n个,区域共m个。
在第二方面,本发明提供了一种构建用于本发明第一方面中估计无创产前基因检测数据中胎儿核酸浓度的机器学习模型的方法,所述方法包括:
(a)获得孕有男胎的多个孕妇的游离核酸片段的测序数据,其中所述测序数据包括若干读段;
(b)对于所述多个孕妇的每一个,以Y染色体测序深度计算的胎儿核酸浓度,并计算读段在多个基因区域和/或多个启动子区域上的拷贝比例;
(c)将以Y染色体测序深度计算的胎儿核酸浓度和读段在多个基因区域和/或多个启动子区域上的拷贝比例输入机器学习模型进行训练,获得训练的机器学习模型。
在一个实施方案中,在(a)中,所述游离核酸片段来自于孕妇外周血浆、孕妇肝脏和/或胎盘。
在一个实施方案中,在(a)中,所述游离核酸片段为游离DNA。
在一个实施方案中,所述多个基因区域和/或多个启动子区域来自常染色体,更优选染色体13、18和21之外的常染色体。
在一个实施方案中,所述多个基因区域的长度为7~2473538bp,所述多个启动子区域的长度为199~43798bp
在一个实施方案中,所述多个基因区域和/或多个启动子区域的个数为多于1万,优选多于5万,更优选多于10万,最优选多于20万
在一个实施方案中,所述机器学习模型为机器学习回归模型。
在一个实施方案中,所述机器学习模型包括线性回归模型和非线性回归模型。
在一个实施方案中,所述机器学习模型为岭回归模型、套索回归模型、最小二乘法线性回归模型、基于随机森林算法的回归模型或基于深度神经网络的回归模型。
在一个实施方案中,所述多个基因区域和/或多个启动子区域来自ENSEMBLE。
在一个实施方案中,Y染色体深度计算的胎儿核酸浓度(Fraction fetal)为:
Figure PCTCN2021110058-appb-000003
Depth Y为Y染色体的平均覆盖深度,Depth autosomes为常染色体上测序数据的平均覆盖深度,优选所述常染色体不包括13、18和21号染色体。
在一个实施方案中,读段在多个基因区域和/或多个启动子区域上的拷贝比例为:
Figure PCTCN2021110058-appb-000004
其中copy_ratio ip为样本i的p区域的拷贝比例,raads_number ip为样本i的p区域的读段数目,length p为区域p的总长,reads_number i为 样本i的读段数目,length ref为参考基因组的总长,样本总共n个,区域共m个。
在第三方面,本发明提供了一种用于本发明第一方面中估计无创产前基因检测数据中胎儿核酸浓度的机器学习模型,所述机器学习模型根据本发明第二方面的方法进行构建。
在第四方面,本发明提供了一种估计无创产前基因检测数据中胎儿核酸浓度的系统,所述系统包括:
测序数据获取模块,用于获得孕妇的游离核酸片段的测序数据,其中所述测序数据包括若干读段;
拷贝比例计算模块,用于计算读段在多个基因区域和/或多个启动子区域上的拷贝比例;
模型训练模块,用于以孕有男胎的多个孕妇的数据进行训练,所述训练包括将以Y染色体测序深度计算的胎儿核酸浓度和读段在多个基因区域和/或多个启动子区域上的拷贝比例输入机器学习模型进行训练,获得训练的机器学习模型;
预测模块,用于以待测孕妇样本的数据进行预测,所述预测包括将读段在多个基因区域和/或多个启动子区域上的拷贝比例输入训练的机器学习模型以预测胎儿核酸浓度。
在一个实施方案中,在测序数据获取模块中,所述游离核酸片段来自于孕妇外周血浆、孕妇肝脏和/或胎盘。
在一个实施方案中,在测序数据获取模块中,所述游离核酸片段为游离DNA。
在一个实施方案中,所述多个基因区域和/或多个启动子区域来自常染色体,更优选染色体13、18和21之外的常染色体。
在一个实施方案中,所述多个基因区域的长度为7~2473538bp,所述多个启动子区域的长度为199~43798bp。
在一个实施方案中,所述多个基因区域和/或多个启动子区域的个数为多于1万,优选多于5万,更优选多于10万,最优选多于20万
在一个实施方案中,所述机器学习模型为机器学习回归模型。
在一个实施方案中,所述机器学习回归模型包括线性回归模型和非线性回归模型。
在一个实施方案中,所述机器学习回归模型为岭回归模型、套索回归模型、最小二乘法线性回归模型、基于随机森林算法的回归模型或基于深度神经网络的回归模型。
在一个实施方案中,所述多个基因区域和/或多个启动子区域来自ENSEMBLE。
在一个实施方案中,Y染色体深度计算的胎儿核酸浓度(Fraction fetal)为:
Figure PCTCN2021110058-appb-000005
Depth Y为Y染色体的平均覆盖深度,Depth autosomes为常染色体上测序数据的平均覆盖深度,优选所述常染色体不包括13、18和21号染色体。
在一个实施方案中,读段在多个基因区域和/或多个启动子区域上的拷贝比例为:
Figure PCTCN2021110058-appb-000006
其中copy_ratio ip为样本i的p区域的拷贝比例,reads_number ip为样本i的p区域的读段数目,length p为区域p的总长,reads_number i为样本i的读段数目,length ref为参考基因组的总长,样本总共n个,区域共m个。
本发明的方法不需要高深度测序,不需要PE测序,不需要甲基化测序,也不需要对父母进行额外测序。本发明的方法仅用NIPT数据就能估计胎儿核酸浓度。
附图说明
图1为以2400例孕男胎的孕妇样本数据作为训练所得模型,计算600例孕妇样本的胎儿核酸浓度预测结果。
具体实施方式
在本发明中,统计具有生物学功能的基因区和启动子区,相对于单纯的一定长度的窗口划分方案,具有生物功能的区域单元内同时包含孕妇数据高比例富集区和胎儿数据高比例富集区可能性较小,所以分辨率相对更高。优选地,所述基因区和启动子区包括基因或者启动子扩充延伸后的区域。
在本发明中,包括以孕有男胎的多个孕妇的数据训练机器学习模型,获得训练的机器学习模型,所述训练包括将以Y染色体深度计算的胎儿核酸浓度和读段在多个基因区域和/或多个启动子区域上的拷贝比例输入机器学习模型进行训练。对于待测孕妇的游离核酸片段,将读段在多个基因区域和/或多个启动子区域上的拷贝比例输入训练的机器学习模型以预测胎儿核酸浓度。对于训练样本和待测样本,优选用同样的多个基因区域和/或多个启动子区域。由于本发明的方法是基于具有生物功能的基因和启动子区数据开发的计算方法,分辨率更高,可以应用于胎儿核酸浓度在5%以内的样本中。
在本发明中,所述机器学习模型为机器学习回归模型,例如包括线性回归模型和非线性回归模型。所述机器学习回归模型可以为岭回归模型、套索回归模型、基于随机森林算法的回归模型或基于深度神经网络的回归模型。岭回归是一种多元线性回归模型,其本质是拟合一个线性函数,使得目标变量y是自变量x(也称为特征)的线性组合。岭回归进一步通过对线性函数的自变量系数(即特征系数)施加惩罚(即进行L2正则化处理)来减低模型过拟合风险。
以下结合以下的具体实施例对本发明的方法和系统进行示例性说明。
实施例
已知孕有男胎的多个孕妇的样本作为学习样本。对于学习样本和测试孕妇样本,提取孕妇血浆循环DNA核酸(cfDNA)样本,并进行测序,获得各个孕妇样本的测序读段,然后进行后续分析。
第一步,所有用于模型训练及预测的样本的原始下机数据(fq格式)完成质控后使用BWA中samse模式比对至人类参考染色体hg38上;使用Picard去除比对结果中的重复读段并计算重复率,使用GATK等变异检测算法中碱基质量值纠正BQSR功能完成比对结果的局部矫正。准备好男胎无创产前测试(Non Invasive Prenatal Testing,NIPT)数据,比如SE35数据的比对文件bam/cram格式。
第二步,下载ENSEMBLE上hg19/hg38的基因区域文件及调控区域文件,过滤掉性染色体以及13、18和21号可能会发生三体的三条染色体。
第三步,训练模型,针对男胎,分别计算常染色体上测序数据的平均覆盖深度(Depth autosomes)和Y染色体的平均覆盖深度(Depth Y),则可得男胎核酸浓度(Fraction fetal)的计算公式为:
Figure PCTCN2021110058-appb-000007
下载ENSEMBLE上hg38的基因区域文件及调控区域文件,过滤掉性染色体以及13、18和21可能会发生三体的三条染色体。最终共54119个基因区和160209个启动子区域,计算所有基因和启动子区域的拷贝比例,计算公式为:
Figure PCTCN2021110058-appb-000008
其中copy_ratio ip为样本i的p区域的拷贝比例,reads_number ip为样本i的p区域的符合质控(MAPQ>30)的读段数目,length p为区域p的总长,reads_number i为样本i的符合质控(MAPQ>30)的读段数目,length ref为参考基因组的总长,样本总共n个,区域共m个。这里m为54119个基因区和160209个启动子区域或二者的组合214328。基因区域的长度为7~2473538bp;启动子区域的长度为199~43798bp。
第四步,以Y染色体数据估计的胎儿核酸浓度为Y值,所有基因和启动子拷贝比例为X值进行机器学习模型如岭回归训练,岭回归模型实现使用用sklearn python包中的LinearRegression模块。2400例孕男胎的孕妇样本数据作为训练,训练中使用L2正则化及交叉验证,得到每个区域的权重估计并保存训练后岭回归模型。该模型中特征权重等同于线性回归模型的权重系数β,即:
y i=β 01x i12x i2+…+β nx ip,i=1,…,p,
其中,y i为样本i对应的Y染色体深度推算所得男胎核酸浓度,{β 0…}为模型中该特征的系数{x 1…x p}为样本i中全部p个基因和启动子区域拷贝比例。
在本发明的一个实例中,还可以使用其他的机器学习或者深度学习算法,例如最小二乘法线性回归或者套索线性回归。线性回归的损失函数是计算平均平方误差,训练过程中使其最小。套索回归和岭回归其实就是在标准线性回归的基础上更改损失函数,分别加入L1和L2正则化。加入正则化后可以用来解决线性回归的过拟合问题。同时模型构建中可以对正则化 系数进行交叉验证确定,以提高模型准确性和稳健性。
在本发明的一个实例中,本发明的方法可以引入孕周、年龄等表型数据作为基因和启动子区域之外的特征,应用同样的方法进行计算增加计算准确度。
第五步,胎儿核酸浓度计算,针对待计算胎儿核酸浓度的孕妇血浆cfDNA样本,计算该样本的读段在基因和启动子的拷贝比例,所述基因和启动子包括训练机器学习模型所用的基因和启动子,利用第四步所得模型对胎儿核酸浓度进行预测计算。
结果:以2400例孕男胎的孕妇样本数据作为训练所得模型,计算600例孕妇样本的胎儿核酸浓度预测,与Y染色体深度计算所得的胎儿核酸浓度的比较结果显示在图1中,如图所示,基于Y染色体深度计算所得胎儿核酸浓度(横坐标)与本发明方法计算所得胎儿核酸浓度(纵坐标)相关性(R2)为0.83。另外,发明人用单独的基因区域进行计算,与基于Y染色体方法计算结果的相关性R2作为准确性值为0.83,单独的启动子区进行计算准确性值为0.73。

Claims (16)

  1. 一种估计无创产前基因检测数据中胎儿核酸浓度的方法,所述方法包括:
    (1)获得待测孕妇的游离核酸片段的测序数据,其中所述测序数据包括若干读段;
    (2)计算读段在多个基因区域和/或多个启动子区域上的拷贝比例;
    (3)将读段在多个基因区域和/或多个启动子区域上的拷贝比例输入训练的机器学习模型以预测胎儿核酸浓度,
    所述机器学习模型以孕有男胎的多个孕妇的数据进行训练,所述多个孕妇的数据包括以Y染色体测序深度计算的胎儿核酸浓度和所述多个孕妇的每一个的读段在所述多个基因区域和/或多个启动子区域上的拷贝比例。
  2. 根据权利要求1所述的方法,其中(3)中的机器学习模型的构建方法包括:
    (a)获得孕有男胎的多个孕妇的游离核酸片段的测序数据,其中所述测序数据包括若干读段;
    (b)对于所述多个孕妇的每一个,以Y染色体测序深度计算的胎儿核酸浓度,并计算读段在多个基因区域和/或多个启动子区域上的拷贝比例;
    (c)将以Y染色体测序深度计算的胎儿核酸浓度和读段在多个基因区域和/或多个启动子区域上的拷贝比例输入机器学习模型进行训练,获得训练的机器学习模型。
  3. 根据权利要求1或2所述的方法,在(1)中,所述游离核酸片段来自于孕妇外周血浆、孕妇肝脏和/或胎盘,优选所述游离核酸片段为游离DNA。
  4. 根据权利要求1-3任一项所述的方法,在(1)中,所述多个基因区域和/或多个启动子区域来自常染色体,更优选染色体13、18和21之外的常染色体。
  5. 根据权利要求1-4任一项所述的方法,所述多个基因区域和/或多个启动子区域的个数为多于1万,优选多于5万,更优选多于10万,最优选多于20万
  6. 根据权利要求1-5任一项所述的方法,所述机器学习模型为机器学习回归模型,例如包括线性回归模型和非线性回归模型,优选为岭回归模型、套索回归模型、最小二乘法线性回归模型、基于随机森林算法的回归模型或基于深度神经网络的回归模型。
  7. 根据权利要求1-6任一项所述的方法,Y染色体深度计算的胎儿核酸浓度(Fraction fetal)为:
    Figure PCTCN2021110058-appb-100001
    Depth Y为Y染色体的平均覆盖深度,Depth autosomes为常染色体上测序数据的平均覆盖深度,优选所述常染色体不包括13、18和21号染色体。
  8. 根据权利要求1-7任一项所述的方法,读段在多个基因区域和/或多个启动子区域上的拷贝比例为:
    Figure PCTCN2021110058-appb-100002
    其中copy_ratio ip为样本i的p区域的拷贝比例,reads_number ip为样本i的p区域的读段数目,length p为区域p的总长,reads_number i为样本i的读段数目,length ref为参考基因组的总长,样本总共n个,区域共m个。
  9. 根据权利要求2-8任一项所述的方法构建的机器学习模型。
  10. 一种估计无创产前基因检测数据中胎儿核酸浓度的系统,所述系统包括:
    测序数据获取模块,用于获得孕妇的游离核酸片段的测序数据,其中所述测序数据包括若干读段;
    拷贝比例计算模块,用于计算读段在多个基因区域和/或多个启动子区域上的拷贝比例;
    模型训练模块,用于以孕有男胎的多个孕妇的数据进行训练,所述训练包括将以Y染色体测序深度计算的胎儿核酸浓度和读段在多个基因区域和/或多个启动子区域上的拷贝比例输入机器学习模型进行训练,获得训练的机器学习模型;
    预测模块,用于以待测孕妇样本的数据进行预测,所述预测包括将读段在多个基因区域和/或多个启动子区域上的拷贝比例输入训练的机器学习模型以预测胎儿核酸浓度。
  11. 根据权利要求10所述的系统,在测序数据获取模块中,所述游离核酸片段来自于孕妇外周血浆、孕妇肝脏和/或胎盘,优选地所述游离核酸片段为游离DNA。
  12. 根据权利要求10或11所述的系统,所述多个基因区域和/或多个启动子区域来自常染色体,更优选染色体13、18和21之外的常染色体。
  13. 根据权利要求10-12任一项所述的系统,所述多个基因区域和/或多个启动子区域的个数为多于1万,优选多于5万,更优选多于10万,最优选多于20万。
  14. 根据权利要求10-13任一项所述的系统,所述机器学习模型为机器学习回归模型,例如包括线性回归模型和非线性回归模型,优选为岭回归模型、套索回归模型、最小二乘法线性回归模型、基于随机森林算法的回归模型或基于深度神经网络的回归模型。
  15. 根据权利要求10-14任一项所述的系统,Y染色体深度计算的胎儿核酸浓度Fraction fetal为:
    Figure PCTCN2021110058-appb-100003
    Depth Y为Y染色体的平均覆盖深度,Depth autosomes为常染色体上测序数据的平均覆盖深度,优选所述常染色体不包括13、18和21号染色体。
  16. 根据权利要求10-15任一项所述的系统,读段在多个基因区域和/或多个启动子区域上的拷贝比例为:
    Figure PCTCN2021110058-appb-100004
    其中copy_ratio ip为样本i的p区域的拷贝比例,reads_number ip为样本i的p区域的读段数目,length p为区域p的总长,reads_number i为样本i的读段数目,length ref为参考基因组的总长,样本总共n个,区域共m个。
PCT/CN2021/110058 2021-08-02 2021-08-02 估计无创产前基因检测数据中胎儿核酸浓度的方法和系统 WO2023010242A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/110058 WO2023010242A1 (zh) 2021-08-02 2021-08-02 估计无创产前基因检测数据中胎儿核酸浓度的方法和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/110058 WO2023010242A1 (zh) 2021-08-02 2021-08-02 估计无创产前基因检测数据中胎儿核酸浓度的方法和系统

Publications (1)

Publication Number Publication Date
WO2023010242A1 true WO2023010242A1 (zh) 2023-02-09

Family

ID=85154036

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/110058 WO2023010242A1 (zh) 2021-08-02 2021-08-02 估计无创产前基因检测数据中胎儿核酸浓度的方法和系统

Country Status (1)

Country Link
WO (1) WO2023010242A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100216151A1 (en) * 2004-02-27 2010-08-26 Helicos Biosciences Corporation Methods for detecting fetal nucleic acids and diagnosing fetal abnormalities
CN104232777A (zh) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 同时确定胎儿核酸含量和染色体非整倍性的方法及装置
CN105296606A (zh) * 2014-07-25 2016-02-03 深圳华大基因股份有限公司 确定生物样本中游离核酸比例的方法、装置及其用途

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100216151A1 (en) * 2004-02-27 2010-08-26 Helicos Biosciences Corporation Methods for detecting fetal nucleic acids and diagnosing fetal abnormalities
CN105296606A (zh) * 2014-07-25 2016-02-03 深圳华大基因股份有限公司 确定生物样本中游离核酸比例的方法、装置及其用途
CN104232777A (zh) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 同时确定胎儿核酸含量和染色体非整倍性的方法及装置

Similar Documents

Publication Publication Date Title
JP7159270B2 (ja) 遺伝子の変異の非侵襲的な評価のための方法および処理
KR102018444B1 (ko) 생물학적 샘플 중의 무세포 핵산의 분획을 결정하기 위한 방법 및 장치 및 이의 용도
Chu et al. Comprehensive analysis of preeclampsia-associated DNA methylation in the placenta
JP2021035393A (ja) 染色体提示の決定
JP6971845B2 (ja) 遺伝子の変動の非侵襲的評価のための方法および処理
JP6473744B2 (ja) 遺伝子の変動の非侵襲的評価のための方法および処理
US10930368B2 (en) Methods and processes for non-invasive assessment of genetic variations
CA3189752A1 (en) Methods and processes for non-invasive assessment of genetic variations
US20210065842A1 (en) Systems and methods for determining tumor fraction
US20210104297A1 (en) Systems and methods for determining tumor fraction in cell-free nucleic acid
CN110191964B (zh) 确定生物样本中预定来源的游离核酸比例的方法及装置
EP3662479A1 (en) A method for non-invasive prenatal detection of fetal sex chromosomal abnormalities and fetal sex determination for singleton and twin pregnancies
Dan et al. Non-invasive prenatal diagnosis of lethal skeletal dysplasia by targeted capture sequencing of maternal plasma
US20220090211A1 (en) Sample Validation for Cancer Classification
US20180300451A1 (en) Techniques for fractional component fragment-size weighted correction of count and bias for massively parallel DNA sequencing
WO2023010242A1 (zh) 估计无创产前基因检测数据中胎儿核酸浓度的方法和系统
KR20220013349A (ko) 검출 한계 기반 품질 제어 메트릭
US20190139627A1 (en) System for Increasing the Accuracy of Non Invasive Prenatal Diagnostics and Liquid Biopsy by Observed Loci Bias Correction at Single Base Resolution
CN115223654A (zh) 检测胎儿染色体非整倍体异常的方法、装置及存储介质
Ju et al. Estimation of cell-free fetal DNA fraction from maternal plasma based on linkage disequilibrium information
EP3635138B1 (en) Method for analysing cell-free nucleic acids
US20180089367A1 (en) Techniques for fine grained correction of count bias in massively parallel DNA sequencing
KR102532991B1 (ko) 태아의 염색체 이수성 검출방법
WO2024140881A1 (zh) 胎儿dna浓度的确定方法及装置
Huang et al. A noninvasive prenatal test pipeline with a well-generalized machine-learning approach for accurate fetal trisomy detection using low-depth short sequence data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21952152

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE