WO2021134513A1 - 确定染色体非整倍性、构建分类模型的方法和装置 - Google Patents

确定染色体非整倍性、构建分类模型的方法和装置 Download PDF

Info

Publication number
WO2021134513A1
WO2021134513A1 PCT/CN2019/130625 CN2019130625W WO2021134513A1 WO 2021134513 A1 WO2021134513 A1 WO 2021134513A1 CN 2019130625 W CN2019130625 W CN 2019130625W WO 2021134513 A1 WO2021134513 A1 WO 2021134513A1
Authority
WO
WIPO (PCT)
Prior art keywords
chromosome
sample
feature
concentration
aneuploidy
Prior art date
Application number
PCT/CN2019/130625
Other languages
English (en)
French (fr)
Inventor
张红云
袁玉英
柴相花
周丽君
王梦杰
刘强
尹烨
Original Assignee
深圳华大医学检验实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大医学检验实验室 filed Critical 深圳华大医学检验实验室
Priority to EP19958118.2A priority Critical patent/EP4086356A4/en
Priority to PCT/CN2019/130625 priority patent/WO2021134513A1/zh
Priority to US17/612,515 priority patent/US20220336047A1/en
Priority to AU2019480813A priority patent/AU2019480813A1/en
Priority to KR1020227003512A priority patent/KR20220122596A/ko
Priority to CA3141362A priority patent/CA3141362A1/en
Priority to JP2021569370A priority patent/JP7467504B2/ja
Priority to CN201980004859.0A priority patent/CN111226281B/zh
Priority to IL277746A priority patent/IL277746A/en
Publication of WO2021134513A1 publication Critical patent/WO2021134513A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Definitions

  • the present invention relates to the field of biotechnology, particularly non-invasive prenatal genetic testing, and specifically relates to a method and device for determining chromosome aneuploidy and a corresponding method and device for constructing a machine learning classification model.
  • Prenatal screening methods are usually divided into two categories, namely traumatic methods (also called prenatal diagnosis) and non-invasive methods.
  • the former mainly includes amniocentesis, villus sampling, cord blood sampling, etc.; the latter includes ultrasound, maternal peripheral serum marker determination, and fetal cell detection.
  • Traumatic methods such as chorionic villus sampling (CVS) or amniocentesis are used to obtain cells isolated from the fetus, which can be used for routine prenatal diagnosis.
  • CVS chorionic villus sampling
  • amniocentesis are used to obtain cells isolated from the fetus, which can be used for routine prenatal diagnosis.
  • Non-invasive prenatal screening mainly uses high-throughput sequencing technology to analyze the free DNA of the fetus in the peripheral blood of pregnant women to assess the risk of common chromosomal aneuploidy abnormalities in the fetus.
  • the common screening scopes are chromosome 21 aneuploidy (T21), chromosome 18 aneuploidy (T18), chromosome 13 aneuploidy (T13) and sex chromosomes.
  • NIPT based on the quantitative method of sequencing sequence number: The main principle of this method is to use the comparison software to locate the sequencing sequence (read, sometimes called “sequencing read") in a pre-defined window, and then use it appropriately. The method of aneuploidy detection of the chromosome to be tested.
  • NIPT based on a single nucleotide polymorphism (SNP) method The main principle of this method is to capture and sequence the genomic DNA and fetal cell-free DNA of both parents according to the predetermined SNP site region, thereby using the parents And the genotype information of the fetus adopts Bayesian model to detect the chromosome aneuploidy under examination.
  • SNP single nucleotide polymorphism
  • NIPT based on the size of DNA fragments:
  • PE paired-end sequencing technology
  • Z test is used to detect the aneuploidy of the chromosome to be inspected based on the reference sample.
  • an object of the present invention is to provide a method that can effectively determine chromosome aneuploidy.
  • the present invention provides a method for determining whether a fetus has chromosomal aneuploidy.
  • the method includes: (1) obtaining nucleic acid sequencing data from a pregnant woman sample, said The pregnant woman sample contains free fetal nucleic acid, and the nucleic acid sequencing data is composed of a plurality of sequencing reads; (2) the fetal concentration of the pregnant woman sample and the back estimated concentration of a predetermined chromosome are determined based on the nucleic acid sequencing data, and the back estimated concentration It is determined based on the difference between the number of sequencing reads of the predetermined chromosome and the number of sequencing reads of the first comparison chromosome, the predetermined chromosome includes the chromosome to be tested and the second comparison chromosome, and the first comparison chromosome includes at least one difference The autosome of the predetermined chromosome; (3) the first feature is determined based on the difference between the inverse estimated concentration of the test
  • This method can effectively determine whether the fetus has aneuploidy for the chromosome to be tested.
  • the method replaces the current threshold setting based on the number of sequencing sequences.
  • the established strategy eliminates the gray area of detection, at the same time it can shorten the sample detection cycle, improve the customer experience, and can significantly reduce the cost of sequencing and testing.
  • the above method may also have the following additional technical features:
  • the pregnant woman sample includes a pregnant woman's peripheral blood.
  • the nucleic acid sequencing sample is obtained by paired-end sequencing, single-end sequencing, or single-molecule sequencing.
  • the fetal concentration is determined by the following steps: (a) comparing the nucleic acid sequencing data from the pregnant woman sample with a reference sequence, so as to determine the sequence reads that fall within a predetermined window And (b) determine the fetal concentration of the pregnant woman sample based on the number of sequencing reads that fall into the predetermined window.
  • the number of sequencing reads of the first comparison chromosome is the average number of sequencing reads of a plurality of autosomes, and the plurality of autosomes includes at least one known not having Autosomes with aneuploidy.
  • the number of sequencing reads of the first comparison chromosome is the average number of sequencing reads of at least 15 autosomes, optionally, the sequencing reads of the first comparison chromosome The number is the average number of sequencing reads of at least 20 autosomes.
  • the number of sequencing reads of the first comparison chromosome is the average number of sequencing reads of all autosomes.
  • the inverse estimated concentration is determined according to the following formula:
  • j represents the number of the chromosome for which the inverse estimated concentration needs to be determined
  • Fj represents the inverse estimated concentration of chromosome j
  • Rr represents the average number of sequencing reads of the multiple autosomes
  • Rj represents the number of reads sequenced on chromosome j.
  • the first feature is determined based on the difference between the counter-estimated concentration of the chromosome to be tested and the average value of the counter-estimated concentration of the second comparison chromosome.
  • the second comparison chromosome includes at least 10 autosomes.
  • the second comparison chromosome includes 15 autosomes.
  • it further includes: determining the inverse estimated concentration of a plurality of autosomes; and selecting the target-ranked autosomes as the second comparison chromosome in an order of priority from small to large.
  • the first feature is determined by the following formula:
  • X1 represents the first feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fr represents the average value of the inverse estimated concentration of the second comparison chromosome.
  • the second characteristic is determined by the following formula:
  • X2 represents the second feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fa represents the fetal concentration
  • the first feature and the second feature are standardized, so that the absolute values of the first feature and the second feature are independently at 0. Between ⁇ 1.
  • step (4) the ratio of the number of positive samples to the number of negative samples is not less than 1:4.
  • step (4) the ratio of the number of positive samples to the number of negative samples does not exceed 4:1.
  • step (4) the ratio of the number of the positive samples to the negative samples is 1:0.1-5.
  • step (4) the ratio of the number of positive samples to the number of negative samples is 1:0.25-4.
  • neither the positive sample nor the negative sample has aneuploidy for chromosomes other than the chromosome to be tested.
  • the first feature and the second feature are used to determine the two-dimensional feature vector of the pregnant woman sample and the control sample, based on the two-dimensional feature vector Determine the distance between samples, and classify the pregnant woman sample between the positive control sample and the negative control sample, so as to determine whether the fetus has aneuploidy for the chromosome to be tested.
  • the distance is Euclidean distance, Manhattan distance or Chebyshev distance.
  • step (4) it further includes: (4-1) respectively calculating the distance between the pregnant woman sample and the control sample; (4-2) comparing the obtained distance Perform sorting, the sorting is based on the order from small to large; (4-3) based on the sorting, a predetermined number of control samples are selected from small to large; (4-4) the predetermined number of control samples are respectively determined The number of positive samples and negative samples in the middle; (4-5) Based on the majority decision-making method, determine the result of classifying the pregnant women samples.
  • the predetermined number is not more than 20.
  • the predetermined number is 3-10.
  • step (4-2) before the sorting, the distance between the sample to be tested and the predetermined control sample is weighted in advance.
  • the present invention provides a device for determining whether a fetus has chromosomal aneuploidy, which is characterized by comprising: a data acquisition module for acquiring nucleic acid sequencing data from a sample of a pregnant woman, the pregnant woman The sample contains free fetal nucleic acid, and the nucleic acid sequencing data is composed of multiple sequencing reads; the fetal concentration-inverse concentration determination module is used to determine the fetal concentration of the pregnant woman sample and the inverse estimate of the predetermined chromosome based on the nucleic acid sequencing data The inverse estimated concentration is determined based on the difference between the number of sequencing reads of the predetermined chromosome and the number of sequencing reads of the first comparison chromosome.
  • the predetermined chromosome includes the chromosome to be tested and the second comparison chromosome.
  • the comparison chromosome includes at least one autosome that is different from the predetermined chromosome; a feature determination module is used to determine the first feature based on the difference between the back-estimated concentration of the chromosome to be tested and the back-estimated concentration of the second comparison chromosome, based on The difference between the inverse estimated concentration of the chromosome to be tested and the fetal concentration determines the second feature; and the aneuploidy determination module is configured to determine the second feature based on the first feature and the second feature and using the corresponding data of the control sample Whether the fetus of the pregnant woman has aneuploidy for the chromosome to be tested, wherein the control sample includes a positive sample and a negative sample, the positive sample has aneuploidy for the chromosome to be tested, and the negative The sample does not have aneuploidy for the chromosome to be
  • the device for determining whether a fetus has chromosome aneuploidy can effectively implement the method for determining whether a fetus has chromosome aneuploidy, so as to effectively determine whether the fetus is targeted for the chromosome to be tested. Whether there is aneuploidy.
  • the method replaces the current threshold setting strategy based on the number of sequencing sequences, eliminates the detection gray area, and can also shorten the sample detection cycle and improve customers Experience degree, and can significantly reduce sequencing and detection costs.
  • the above-mentioned device may also have the following additional technical features:
  • the fetal concentration-reverse-estimated concentration determination module includes: a comparison unit, configured to compare the nucleic acid sequencing data from the pregnant woman sample with a reference sequence, so as to determine what falls within a predetermined window The number of sequencing reads; and a fetal concentration calculation unit for determining the fetal concentration of the pregnant woman sample based on the number of sequencing reads that fall into the predetermined window.
  • the fetal concentration-reverse estimated concentration determination module includes: a reverse estimated concentration calculation unit configured to determine the reverse estimated concentration according to the following formula:
  • j represents the number of the chromosome for which the inverse estimated concentration needs to be determined
  • Fj represents the inverse estimated concentration of chromosome j
  • Rr represents the average number of sequencing reads of the multiple autosomes
  • Rj represents the number of reads sequenced on chromosome j.
  • the fetal concentration-reverse estimated concentration determining module includes: a second comparing chromosome determining unit is used to sort the reverse estimated concentrations of a plurality of autosomes in a priority order from small to large, and to sort the targets The autosome is used as the second comparison chromosome.
  • the feature determination module includes:
  • the first feature determining unit is configured to determine the first feature using the following formula:
  • X1 represents the first feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fr represents the average value of the inverse estimated concentration of the second comparison chromosome.
  • the feature determining module includes: a second feature determining unit, configured to determine the second feature using the following formula:
  • X2 represents the second feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fa represents the fetal concentration
  • the feature determination module includes: a standardization processing unit, configured to perform standardization processing on the first feature and the second feature, so that the absolute value of the first feature and the second feature The values are independently between 0 and 1.
  • the aneuploidy determination module is configured to use the first feature and the second feature to determine the two-dimensional feature vector of the pregnant woman sample and the control sample, based on the two The inter-sample distance determined by the dimensional feature vector classifies the pregnant woman sample between the positive control sample and the negative control sample, so as to determine whether the fetus has aneuploidy for the chromosome to be tested.
  • the distance is Euclidean distance, Manhattan distance or Chebyshev distance.
  • the aneuploidy determination module is configured to use a k-nearest neighbor model to determine the classification result of the pregnant woman sample.
  • the K value adopted by the k-nearest neighbor model does not exceed 20.
  • the K value adopted by the k-nearest neighbor model is 3-10.
  • the distance between the samples is weighted.
  • the present invention provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the aforementioned determination of whether the fetus has chromosomal aneuploidy is realized.
  • the steps of the sexual method Therefore, the method for determining whether the fetus has chromosome aneuploidy described above can be effectively implemented, so that it can be effectively determined whether the fetus has aneuploidy with respect to the chromosome to be tested.
  • the method replaces the current threshold setting strategy based on the number of sequencing sequences, eliminates the detection gray area, and can also shorten the sample detection cycle and improve customers Experience degree, and can significantly reduce sequencing and detection costs.
  • the present invention provides an electronic device, which includes: the aforementioned computer-readable storage medium; and one or more processors configured to execute program. Therefore, the method for determining whether the fetus has chromosome aneuploidy described above can be effectively implemented, so that it can be effectively determined whether the fetus has aneuploidy with respect to the chromosome to be tested.
  • the method replaces the current threshold setting strategy based on the number of sequencing sequences, eliminates the detection gray area, and can also shorten the sample detection cycle and improve customers Experience degree, and can significantly reduce sequencing and detection costs.
  • the present invention proposes a method for constructing a machine learning classification model.
  • the method includes: (a) For each of a plurality of pregnant women samples: The nucleic acid sequencing data of a pregnant woman sample, the pregnant woman sample contains free fetal nucleic acid, the nucleic acid sequencing data is composed of a plurality of sequencing reads, the pregnant woman sample includes at least one positive sample and at least one negative sample, and the positive sample is for The chromosome to be tested has aneuploidy, and the negative sample does not have aneuploidy for the chromosome to be tested; the fetal concentration of the pregnant woman sample and the inverse estimated concentration of the predetermined chromosome are determined based on the nucleic acid sequencing data.
  • the inverse estimation concentration is determined based on the difference between the number of sequencing reads of the predetermined chromosome and the number of sequencing reads of the first comparison chromosome, the predetermined chromosome includes the chromosome to be tested and the second comparison chromosome, and the first comparison chromosome includes at least An autosome that is different from the predetermined chromosome; and determining the first feature based on the difference between the inverse estimated concentration of the test chromosome and the inverse estimated concentration of the second comparison chromosome, based on the inverse estimated concentration of the test chromosome
  • the difference between the concentration of the fetus and the concentration of the fetus determines the second feature, and (b) the multiple pregnant women samples are used as samples, and the first feature and the second feature of the samples are used to perform machine learning training, so as to construct a Machine learning classification model with aneuploidy.
  • a machine learning classification model can be effectively constructed, so that the classification model can be further used to identify and classify unknown samples to determine whether there is chromosome aneuploidy for a specific chromosome Sex.
  • the machine learning classification model is a KNN model.
  • the KNN model adopts Euclidean distance.
  • the present invention provides a device for constructing a machine learning classification model, which includes: a feature acquisition module for performing separately for each of a plurality of pregnant women samples: acquiring nucleic acids from the pregnant women samples Sequencing data, the pregnant woman sample contains free fetal nucleic acid, the nucleic acid sequencing data consists of a plurality of sequencing reads, the pregnant woman sample includes at least one positive sample and at least one negative sample, and the positive sample has a non-negative sample for the chromosome to be tested.
  • the negative sample does not have aneuploidy for the chromosome to be tested;
  • the fetal concentration of the pregnant woman sample and the back-estimated concentration of the predetermined chromosome are determined based on the nucleic acid sequencing data, and the back-estimated concentration is based on Is determined by the difference between the number of sequencing reads of the predetermined chromosome and the number of sequencing reads of the first comparison chromosome
  • the predetermined chromosome includes a chromosome to be tested and a second comparison chromosome
  • the first comparison chromosome includes at least one that is different from the An autosome of a predetermined chromosome; and determining the first feature based on the difference between the inverse estimated concentration of the test chromosome and the inverse estimated concentration of the second comparison chromosome, based on the inverse estimated concentration of the test chromosome and the fetal concentration
  • the second feature is determined by the difference of, and the training module is used to perform machine learning training using the
  • the device can effectively implement the aforementioned method of constructing a machine learning classification model, thereby effectively constructing a machine learning classification model, so that the classification model can be further used to identify and classify unknown samples to determine the target Whether there is chromosome aneuploidy in a specific chromosome.
  • the machine learning classification model is a KNN model.
  • the present invention proposes a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, it implements the steps for constructing a machine learning classification method described in the preceding claims.
  • the aforementioned method of constructing a machine learning classification model can be effectively implemented, so that a machine learning classification model can be effectively constructed, so that the classification model can be further used to identify and classify unknown samples to determine the target Whether there is chromosome aneuploidy in a specific chromosome.
  • Figure 1 shows a schematic flow chart of a method for determining whether a fetus has chromosomal aneuploidy according to an embodiment of the present invention
  • Figure 2 shows a schematic flow chart of a method for determining fetal concentration according to an embodiment of the present invention
  • Fig. 3 shows a schematic flow chart of a method for classifying pregnant women samples according to an embodiment of the present invention
  • Figure 4 shows a block diagram of a device for determining whether a fetus has chromosomal aneuploidy according to an embodiment of the present invention
  • Figure 5 shows a block diagram of a fetal concentration-inverse concentration determination module according to an embodiment of the present invention
  • Figure 6 shows a block diagram of a feature determining module according to an embodiment of the present invention
  • Fig. 7 shows a block diagram of constructing a machine learning classification model according to an embodiment of the present invention
  • Figures 12 and 13 show the ROC curve corresponding to the parameter k when the KNN model is used to detect T13 according to an embodiment of the present invention.
  • the present invention provides a method for determining whether a fetus has chromosomal aneuploidy.
  • the method for determining whether a fetus has chromosomal aneuploidy according to an embodiment of the present invention will be described in detail below by referring to FIGS. 1 to 3.
  • the method for determining whether a fetus has chromosomal aneuploidy includes:
  • the pregnant woman sample that can be used includes, but is not limited to, the peripheral blood of the pregnant woman. .
  • the peripheral blood of the pregnant woman includes, but is not limited to, the peripheral blood of the pregnant woman.
  • NIPT non-invasive prenatal diagnosis
  • nucleic acid sequencing when obtaining samples of pregnant women, such as the peripheral blood of pregnant women, nucleic acid sequencing can be performed on these samples to obtain nucleic acid sequencing data of the samples of pregnant women.
  • the nucleic acid sequencing data is composed of multiple or a large number of sequencing reads. (read) constituted.
  • the method for sequencing the nucleic acid molecules of the pregnant woman sample is not particularly limited. Specifically, any sequencing method known to those skilled in the art can be used, for example, including but not limited to paired-end sequencing, single-end sequencing, and single-end sequencing. End-sequencing or single-molecule sequencing sequence the nucleic acid molecules of pregnant women's samples.
  • the obtained sequencing data consisting of a large number of sequencing reads can be filtered and screened according to the quality control standards to remove the sequencing reads with sequencing quality problems. , which can improve the accuracy of subsequent data analysis.
  • the fetal concentration of the pregnant woman sample and the inverse estimated concentration of a specific chromosome can be determined.
  • the fetal concentration refers to the ratio of the number of free nucleic acids from the fetus to the total number of free nucleic acids in free nucleic acids in a sample of pregnant women, such as peripheral blood.
  • the value of the fetal concentration will increase with the increase of the gestational week. For example, around the 12th gestational week, the ratio of fetal free nucleic acid (sometimes directly referred to as "fetal free DNA") to the total free nucleic acid (ie " Fetal concentration”) can reach 10-14%, and after the 20th gestational week, this ratio can reach more than 20%.
  • the fetal concentration will be abnormal. Therefore, the fetal concentration can be used as an important indicator to characterize the samples of pregnant women.
  • Y chromosome estimation method SNP-based fetal-specific SNP site method
  • nucleosome-based imprinting method the inventors of the present invention found that these methods have their limitations.
  • the Y chromosome estimation method is not suitable for female fetuses, and the SNP-based fetal-specific SNP site method needs to obtain the father’s DNA samples (sometimes these samples are more difficult to use). Obtained), based on the poor accuracy of the nucleosome imprinting method, and deep sequencing is required when constructing the model.
  • the fetal concentration in a nucleic acid sample can be determined through the following steps, which specifically include:
  • S210 Compare the nucleic acid sequencing data from the pregnant woman sample with the reference sequence, so as to determine the number of sequencing reads that fall into the predetermined window;
  • S220 Determine the fetal concentration of the pregnant woman sample based on the number of sequencing reads that fall into the predetermined window.
  • the method for determining the fetal concentration is based on the number of sequencing reads in a specific window (ie, a certain length of nucleic acid sequence), which is positively correlated with the fetal concentration. Therefore, by determining the number of sequencing reads in at least one predetermined window, the fetal concentration of the pregnant woman's sample can be obtained inversely, for example, in a weighted average manner.
  • the predetermined window can be determined by means of statistics or machine learning.
  • the predetermined window is obtained by continuously dividing specific chromosomes of the reference genome sequence, and the weight of each predetermined window is further used to determine the fetal concentration.
  • the weight of each predetermined window is predetermined by using training samples. As a result, the results are accurate, reliable, and repeatable.
  • the weight is determined using at least one of a ridge regression statistical model and a neural network model.
  • the neural network model adopts a TesnsorFlow learning system.
  • the parameters of the TesnsorFlow learning system include: adopting the number of sequencing data in each window of autosomes as the input layer; adopting the fetal concentration as the output layer; adopting ReLu as the neuron type; adopting the optimization algorithm selected from Adam At least one of, SGD and Ftrl; preferably Ftrl.
  • the parameters of the Tesnsor Flow learning system further include: the learning rate is set to 0.002; the number of hidden layers is 1; the number of neurons in the hidden layer is 200.
  • the results are accurate and reliable.
  • weight used in this article is a relative concept and is aimed at a certain index.
  • the weight of an indicator refers to the relative importance of the indicator in the overall evaluation.
  • a certain "weight of a predetermined window” refers to the relative importance of a certain predetermined window among all predetermined windows.
  • a certain "connection weight” refers to the relative importance of a connection between two different layers in all two different layers.
  • PCT/CN2018/07204 title of invention: method and device for determining the proportion of free nucleic acid from a predetermined source in a biological sample
  • the full text is incorporated by reference.
  • the method can obtain fetal concentration data simply, quickly and accurately.
  • the obtained fetal concentration data can be more effectively applied to the method of the present invention to determine whether the fetus has chromosomal aneuploidy.
  • the fetal concentration can be determined, but also the inverse estimated concentration of the predetermined chromosomes can be further determined.
  • inverse estimation concentration used in this article refers to a measure that characterizes the difference between the DNA content of a specific chromosome and that of a normal chromosome. Specifically, the number of sequencing reads of a specific chromosome can be compared with that of normal chromosomes. The difference in the number of segments is expressed. For example, in an ideal state, for a chromosome with trisomy, the inverse-estimated concentration is the amount that represents the DNA content of one extra chromosome. For normal chromosomes, because there is no extra chromosome, the inverse-estimated concentration is 0.
  • normal chromosome refers to a chromosome without chromosome aneuploidy, and does not mean that the chromosome does not have other abnormalities.
  • the expression “number of sequencing reads of" is mentioned many times, such as “number of sequencing reads of normal chromosome”, “number of sequencing reads of specific chromosome”, “sequencing that falls into a predetermined pair
  • the meaning of “number of reads” refers to the number of sequencing reads that can be matched with the region.
  • the nucleic acid sequencing result can be compared with a reference sequence such as hg19.
  • a conventional software such as SOAP is used for comparison, it can be compared with a specific
  • the sequencing reads compared to the region of are considered as the sequencing reads of the region.
  • the step of determining the number of corrected sequencing reads includes:
  • inverse estimated concentration refers to a measure that characterizes the difference between the DNA content of a specific chromosome and the DNA content of normal chromosomes. Therefore, the inverse estimated concentration can be used as an important indicator for characterizing pregnant women's samples. According to an embodiment of the present invention, the inverse estimation concentration is determined based on the difference between the number of sequencing reads of the predetermined chromosome and the number of sequencing reads of the first comparison chromosome.
  • predetermined chromosome includes the chromosome to be tested, that is, the chromosome for which aneuploidy needs to be determined.
  • the predetermined chromosome also includes a second comparison chromosome.
  • the second comparison chromosome includes At least one autosome. It should be noted that the inverse estimated concentration is calculated separately for each of the predetermined chromosomes, so for each of the chromosome to be tested and the second comparison chromosome, the inverse estimated concentration corresponding to the chromosome will be obtained respectively.
  • the first comparison chromosome and the second comparison chromosome are derived from the same sample as the chromosome to be tested, instead of using data from other samples for analysis.
  • the second comparison chromosome includes at least 10 autosomes. According to an embodiment of the present invention, the second comparison chromosome includes 15 autosomes.
  • the back-estimated concentration can be used as an indicator to characterize whether a chromosome is abnormal. Therefore, the second comparison chromosome can be selected by using the back-estimated concentration. According to the embodiment of the present invention, it further includes: determining the inverse estimated concentration of a plurality of autosomes; and selecting the target-ranked autosomes as the second comparison chromosome in an order of priority from small to large. According to the previous description, the smaller the inverse concentration, the higher the probability of the chromosome as a normal chromosome.
  • a suitable autosome can be selected as the second comparison chromosome.
  • whether there is an abnormality in the number of chromosomes can be determined through experience. For example, statistical analysis finds that some chromosomes have almost no aneuploidy. Therefore, these chromosomes can be regarded as the first Two compare chromosomes.
  • the inverse estimation of the concentration is to characterize the difference between the characteristic chromosome and the normal chromosome. Therefore, according to an embodiment of the present invention, the first comparison chromosome includes at least one that is different from the predetermined An autosome of a chromosome. It should be noted that the first comparison chromosome and the second comparison chromosome mentioned here may be crossed. Specifically, in the calculation formula of the inverse estimation concentration, a specific chromosome will be selected from the predetermined chromosomes. Therefore, the rest Although the chromosome may be covered by the meaning of "second comparison chromosome", it still belongs to the concept of "autosome different from the predetermined chromosome".
  • chromosome 23 is selected as the test chromosome and chromosomes 2 to 5 are used as the second comparison chromosome, when calculating the inverse estimated concentration of chromosome 23, chromosomes 2 to 5 can still be used as the first comparison chromosome.
  • the first comparison chromosome may include multiple autosomes, and when calculating the inverse estimation concentration, the average number of reads for sequencing may be selected. In this way, the efficiency and accuracy of sequencing data analysis can be further improved.
  • the number of sequencing reads of the first comparison chromosome is an average number of sequencing reads of a plurality of autosomes, the plurality of autosomes including at least one autosome that is known to have no aneuploidy.
  • the number of sequencing reads of the first comparison chromosome is an average number of sequencing reads of at least 15 autosomes.
  • the number of sequencing reads of the first comparison chromosome is an average of at least 20 autosomes.
  • the number of sequencing reads of the first comparison chromosome is the average number of sequencing reads of all autosomes. In this way, by selecting the average number of reads for multiple chromosomes, the differences between the chromosomes can be eliminated.
  • the inverse estimated concentration is determined according to the following formula:
  • j represents the number of the chromosome for which the inverse estimated concentration needs to be determined
  • Fj represents the inverse estimated concentration of chromosome j
  • Rr represents the average number of sequencing reads of the multiple autosomes
  • Rj represents the number of reads sequenced on chromosome j.
  • the fetal concentration and the inverse estimated concentration determined in this step are both affected by the chromosome aneuploidy to varying degrees, so these two parameters can be used in subsequent aneuploidy detection.
  • these parameters can be further used as the characteristic values of the sample, so that machine learning can be further used for analysis.
  • the first feature is determined by the difference between the back-estimated concentration of the chromosome to be tested and the back-estimated concentration of the second comparison chromosome, and the difference between the previously determined back-estimated concentration of the test chromosome and the fetal concentration is determined
  • the second feature Determine the second feature. Therefore, the obtained first feature and second feature can be regarded as features that can be affected by aneuploidy, and therefore, can be effectively applied to subsequent analysis.
  • those skilled in the art can use a variety of algorithms to characterize the aforementioned differences, for example, by calculating the difference of the values, the ratio of the values, and so on.
  • the counter-estimated concentration of the second comparison chromosome is preferably the average counter-estimated concentration of multiple autosomes. As a result, the efficiency and accuracy of the analysis can be further improved.
  • the first feature is determined by the following formula:
  • X1 represents the first feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fr represents the average value of the inverse estimated concentration of the second comparison chromosome.
  • the second characteristic is determined by the following formula:
  • X2 represents the second feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fa represents the fetal concentration
  • the first feature and the second feature thus obtained can reflect the differences adopted by each, on the other hand, the obtained values are all on the same order of magnitude, avoiding excessive influence of a single parameter Analyze the result of the situation. If the selection of features is not appropriate, subsequent analysis results may be biased.
  • the distance between samples should be calculated according to the characteristics of the samples (for example, the feature of sample x 1 The characteristics of sample x 2 are Then the distance between samples x 1 and x 2 is ), if the feature value difference between the two samples is very large, for example, the distance is
  • the obtained first feature and the second feature are standardized, so that the absolute values of the first feature and the second feature are independently at Between 0 and 1.
  • the means for standardizing the first feature and the second feature is not particularly limited. Specifically, the following methods can be used to deal with a batch of data of the same dimension (both the first feature or the second feature: ), processed according to the following formula
  • min and max are the minimum and maximum values of this batch of values
  • oldvale represents the value before processing
  • newvalue represents the value after normalization processing
  • S400 Determines the aneuploidy based on the first feature and the second feature
  • the values of the first feature and the second feature are both affected by aneuploidy. Therefore, after obtaining the first feature and the second feature, use the corresponding data of the control sample to determine that the fetus is specific to the chromosome to be tested. Whether there is aneuploidy.
  • the control sample includes a positive sample and a negative sample, the positive sample has aneuploidy for the chromosome to be tested, and the negative sample does not have aneuploidy for the chromosome to be tested.
  • the determination of whether the test chromosome has aneuploidy can be realized.
  • the inventor found in the research process that the number of positive samples and negative samples satisfying a certain ratio can further improve the accuracy of the analysis.
  • the ratio of the number of positive samples and negative samples is not less than 1:4.
  • the ratio of the number of positive samples to the number of negative samples does not exceed 4:1.
  • the ratio of the numbers of the positive samples and the negative samples is 1:0.1-5.
  • the ratio of the numbers of the positive samples and the negative samples is 1:0.25-4.
  • neither the positive sample nor the negative sample has aneuploidy for chromosomes other than the chromosome to be tested.
  • the classification reference ability of the control sample can be further improved.
  • the method of using the first feature and the second feature to classify is not particularly limited, and a variety of machine learning methods, such as neural networks, SVM methods, etc., can be used.
  • machine learning methods such as neural networks, SVM methods, etc.
  • the first feature and the second feature may be used to determine the two-dimensional feature vector of the pregnant woman sample and the control sample. Based on the distance between the samples determined by the two-dimensional feature vector, the pregnant woman sample is placed in the positive control.
  • the sample and the negative control sample are classified to determine whether the fetus has aneuploidy for the chromosome to be tested.
  • the distance that can be used includes, but is not limited to, Euclidean distance, Manhattan distance, or Chebyshev distance.
  • KNN K-nearest neighbor
  • the classification process includes the following steps:
  • S450 Based on the majority decision-making method, determine the result of classifying the sample of pregnant women.
  • the predetermined number is not more than 20. According to an embodiment of the present invention, the predetermined number is 3-10.
  • the K value can be an odd number to avoid situations where a decision cannot be made.
  • the final K value selected for different chromosomes to be tested may be different. For example, according to an embodiment of the present invention, the final selected k for T13 and T18 detection is 7, and the final selection for T21 detection choose k as 9.
  • the distance between the sample to be tested and a predetermined control sample may be weighted in advance. As a result, the accuracy of the inspection can be further improved.
  • weighting coefficients of these weighting processes or the K value of the KNN model can be obtained through machine learning and using known samples as the training set for training.
  • the output of the model the category y to which the sample x belongs
  • the method can effectively determine whether the fetus has aneuploidy for the chromosome to be tested.
  • the method replaces the current number based on the number of sequencing sequences.
  • the threshold setting strategy eliminates the gray area of detection, and at the same time can shorten the sample detection cycle, improve customer experience, and can significantly reduce sequencing and detection costs.
  • an embodiment of the present application also provides a corresponding device for implementing the foregoing method.
  • the present invention provides a device for determining whether a fetus has chromosomal aneuploidy.
  • the device including determining whether a fetus has chromosomal aneuploidy includes:
  • the data acquisition module 100 is used to acquire nucleic acid sequencing data from a sample of pregnant women.
  • the pregnant woman sample contains free fetal nucleic acid, and the nucleic acid sequencing data is composed of multiple sequencing reads;
  • the fetal concentration-inverse concentration determination module 200 is used to determine the fetal concentration of the pregnant woman sample and the inverse estimated concentration of the predetermined chromosome based on the nucleic acid sequencing data.
  • the inverse estimated concentration is based on the number of sequencing reads of the predetermined chromosome and the sequencing read of the first comparison chromosome If the difference in the number of segments is determined, the predetermined chromosome includes the chromosome to be tested and the second comparison chromosome, and the first comparison chromosome includes at least one autosome that is different from the predetermined chromosome;
  • the feature determination module 300 is configured to determine the first feature based on the difference between the back-estimated concentration of the chromosome to be tested and the back-estimated concentration of the second comparison chromosome, and determine the second feature based on the difference between the back-estimated concentration of the chromosome to be tested and the fetal concentration;
  • the aneuploidy determination module 400 is used to determine whether the pregnant woman’s fetus has aneuploidy for the chromosome to be tested based on the first feature and the second feature and using the corresponding data of the control sample.
  • the photo sample includes a positive sample and a negative sample. A sample, the positive sample has aneuploidy for the chromosome to be tested, and the negative sample does not have aneuploidy for the chromosome to be tested.
  • the device for determining whether a fetus has chromosome aneuploidy can effectively implement the method for determining whether a fetus has chromosome aneuploidy, so as to effectively determine whether the fetus is targeted for the chromosome to be tested. Whether there is aneuploidy.
  • the method replaces the current threshold setting strategy based on the number of sequencing sequences, eliminates the detection gray area, and can also shorten the sample detection cycle and improve customers Experience degree, and can significantly reduce sequencing and detection costs.
  • the fetal concentration-inverse concentration determination module 200 includes:
  • the comparison unit 210 is configured to compare the nucleic acid sequencing data from a pregnant woman sample with a reference sequence, so as to determine the number of sequencing reads that fall into a predetermined window;
  • the fetal concentration calculation unit 220 is configured to determine the fetal concentration of the pregnant woman sample based on the number of sequencing reads that fall into the predetermined window.
  • the fetal concentration-inverse estimated concentration determination module 200 further includes:
  • the counter-estimation concentration calculation unit 230 is configured to determine the counter-estimation concentration according to the following formula:
  • j represents the number of the chromosome for which the inverse estimated concentration needs to be determined
  • Fj represents the inverse estimated concentration of chromosome j
  • Rr represents the average number of sequencing reads of the multiple autosomes
  • Rj represents the number of reads sequenced on chromosome j.
  • the fetal concentration-inverse concentration determination module 200 includes:
  • the second comparison chromosome determination unit 240 is configured to select the target-ranked autosomes as the second comparison chromosome according to the inverse estimated concentration of the plurality of autosomes in a priority order from small to large.
  • the feature determination module 300 includes:
  • the first feature determining unit 310 is configured to determine the first feature through the following formula:
  • X1 represents the first feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fr represents the average value of the inverse estimated concentration of the second comparison chromosome.
  • the feature determining module 300 further includes:
  • the second feature determining unit 320 is configured to determine the second feature using the following formula:
  • X2 represents the second feature
  • i the number of the chromosome to be tested
  • Fi represents the inverse estimated concentration of the chromosome to be tested
  • Fa represents the fetal concentration
  • the feature determining module 300 further includes:
  • the standardization processing unit 330 is configured to perform standardization processing on the first feature and the second feature, so that the absolute values of the first feature and the second feature are independently between 0 and 1 respectively.
  • the aneuploidy determination module 400 is configured to use the first feature and the second feature to determine the two-dimensional feature vector of the pregnant woman sample and the control sample, based on the two-dimensional The distance between samples determined by the feature vector classifies the pregnant woman sample between the positive control sample and the negative control sample, so as to determine whether the fetus has aneuploidy with respect to the chromosome to be tested.
  • the distance is Euclidean distance, Manhattan distance or Chebyshev distance.
  • the aneuploidy determination module is configured to use a k-nearest neighbor model to determine the classification result of the pregnant woman sample.
  • the K value adopted by the k-nearest neighbor model does not exceed 20.
  • the K value adopted by the k-nearest neighbor model is 3-10.
  • the distance between the samples is weighted.
  • the present invention provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the aforementioned determination of whether the fetus has chromosomal aneuploidy is realized.
  • the steps of the sexual method Therefore, the method for determining whether the fetus has chromosome aneuploidy described above can be effectively implemented, so that it can be effectively determined whether the fetus has aneuploidy with respect to the chromosome to be tested.
  • the method replaces the current threshold setting strategy based on the number of sequencing sequences, eliminates the detection gray area, and can also shorten the sample detection cycle and improve customers. Experience degree, and can significantly reduce sequencing and detection costs.
  • the present invention provides an electronic device, which includes: the aforementioned computer-readable storage medium; and one or more processors configured to execute program. Therefore, the method for determining whether the fetus has chromosome aneuploidy described above can be effectively implemented, so that it can be effectively determined whether the fetus has aneuploidy with respect to the chromosome to be tested.
  • the method replaces the current threshold setting strategy based on the number of sequencing sequences, eliminates the detection gray area, and can also shorten the sample detection cycle and improve customers Experience degree, and can significantly reduce sequencing and detection costs.
  • the present invention proposes a method for constructing a machine learning classification model.
  • the method includes:
  • the pregnant women samples contain fetal free nucleic acids.
  • the nucleic acid sequencing data consists of multiple sequencing reads.
  • the pregnant women samples include at least one positive sample and at least one negative sample.
  • the positive sample has an aneuploidy for the chromosome to be tested.
  • Sex, the negative sample does not have aneuploidy for the chromosome to be tested;
  • the predetermined chromosome includes the chromosome to be tested.
  • the first comparison chromosome includes at least one autosome different from the predetermined chromosome; and the first feature is determined based on the difference between the back-estimated concentration of the chromosome to be tested and the back-estimated concentration of the second comparison chromosome, based on the chromosome to be tested The difference between the inverse estimated concentration and the fetal concentration determines the second feature,
  • a machine learning classification model can be effectively constructed, so that the classification model can be further used to identify and classify unknown samples to determine whether there is chromosome aneuploidy for a specific chromosome Sex.
  • the machine learning classification model is a KNN model.
  • the KNN model adopts Euclidean distance.
  • the present invention provides a device for constructing a machine learning classification model.
  • the device includes:
  • the feature acquisition module 800 is used to perform separately for each of multiple pregnant women samples: acquire nucleic acid sequencing data from the pregnant women samples, the pregnant women samples contain free fetal nucleic acids, the nucleic acid sequencing data are composed of multiple sequencing reads, and the pregnant women samples include at least one A positive sample and at least one negative sample, the positive sample has aneuploidy for the chromosome to be tested, and the negative sample does not have aneuploidy for the chromosome to be tested; based on nucleic acid sequencing data to determine the fetal concentration of the pregnant sample and the inverse estimated concentration of the chromosome The back-estimation concentration is determined based on the difference between the number of sequencing reads of the predetermined chromosome and the number of sequencing reads of the first comparison chromosome.
  • the predetermined chromosome includes the test chromosome and the second comparison chromosome
  • the first comparison chromosome includes at least one different from the predetermined chromosome. And determine the first feature based on the difference between the back estimated concentration of the test chromosome and the back estimated concentration of the second comparison chromosome, and determine the second feature based on the difference between the back estimated concentration of the test chromosome and the fetal concentration;
  • the training module 900 is configured to use multiple pregnant women samples as samples to perform machine learning training, so as to construct a machine learning classification model for determining whether the fetus has aneuploidy.
  • the device can effectively implement the previous method of constructing a machine learning classification model, thereby effectively constructing a machine learning classification model, so that the classification model can be further used to identify and classify unknown samples to determine the specific Whether there is chromosome aneuploidy.
  • the machine learning classification model is a KNN model.
  • a machine learning classification model can be effectively constructed, so that the classification model can be further used to identify and classify unknown samples to determine whether there is chromosome aneuploidy for a specific chromosome Sex.
  • the machine learning classification model is a KNN model.
  • the KNN model adopts Euclidean distance.
  • the present invention proposes a computer-readable storage medium on which a computer program is stored.
  • the program When the program is executed by a processor, it implements the steps for constructing a machine learning classification method in the preceding claims.
  • the previous method of constructing machine learning classification models can be effectively implemented, so that machine learning classification models can be effectively constructed, so that the classification model can be further used to identify and classify unknown samples to determine specific Whether there is chromosome aneuploidy.
  • the features and advantages described above for the method for determining whether a fetus has chromosomal aneuploidy are applicable to the computer-readable storage medium of the constructed model, and will not be repeated here.
  • This example is based on the BGISEQ-500 platform from 2017 to 2018 with 3075 samples with return visit results (including male fetus: 1716 cases, female fetus: 1359 cases, negative samples: 2215 cases, chromosome 21 trisomy (T21): 637 cases, chromosome 18 trisomy (T18): 165 cases, chromosome 13 trisomy (T13): 58 cases) for model training and model prediction.
  • the reference genome (GRCh37) is continuously divided into adjacent windows according to a fixed length (60K is used in this method), the windows in the N area are filtered out, and the GC content in the window is counted to obtain the reference window file hg19.gc;
  • Filtering and preliminary statistics According to the comparison results, select the only completely aligned sequence, remove the repetitive sequence and the sequence with base mismatches to obtain the effective sequence, and then count the effective sequence number of each window and its GC according to the window in the hg19.gc file content;
  • j represents the number of the chromosome, Indicates the number of GC-corrected sequencing reads that can match the reference sequence of chromosome j, Represents the average number of GC-corrected sequencing reads that can be matched with all autosomal reference sequences.
  • Sample set division and data preprocessing The sample set is randomly divided into training set, validation set and test set at a ratio of 6:2:2; data preprocessing is performed on the samples of training set, validation set and test set respectively , So that each sample gets a two-dimensional feature vector, and the corresponding label (negative is -1, positive is +1).
  • Model training consists of two parts: KNN model training and k value selection. At this time, the Euclidean distance and the majority voting rule are selected.
  • KNN model training For classification decision function:
  • Figures 10 and 11 show the ROC curves when the KNN model detects T18 and the parameter k is selected as 6, 7, 8, and 9 respectively.
  • Figures 12 and 13 show the ROC curves when the KNN model detects T13 and the parameter k is selected as 6, 7, 8, and 9, respectively. According to the results of Figs. 8-13, the final selection k for T13 and T18 is 7, and the final selection k for T21 is 9.
  • Model prediction Based on the model trained in the above steps, the test set is predicted, and the prediction results are shown in the following table.
  • SVM Small Vector Machine
  • the KNN model has 14 false positive samples, while the KNN model has only 3 false positives; in the T18 test, the SVM model has 8 false positives, while the KNN model has only 5 false positives; in the T13 test, The SVM model has 8 false positives, while the KNN model has 6 false positives. Regardless of T21, T18 or T13, the KNN model has a lower false positive rate than the SVM model.
  • the inventor analyzes that the main reason for the lower false positive rate of the KNN model than the SVM model is: the model itself, that is, KNN is mainly based on clustering, it is a lot of refined clusters, and SVM is only two simple categories, so the level of detail There is no better KNN.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Primary Health Care (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Pathology (AREA)
  • Biochemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Analysing Materials By The Use Of Radiation (AREA)

Abstract

提供一种确定胎儿是否存在染色体非整倍性的方法,该方法包括:(1)获取来自孕妇样本的核酸测序数据;(2)基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度;(3)基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征;和(4)基于所述第一特征和第二特征,利用对照样本的相应数据,确定所述胎儿针对所述待测染色体是否存在非整倍性,其中,所述对照样本包括阳性样本和阴性样本,所述阳性样本针对所述待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性。

Description

确定染色体非整倍性、构建分类模型的方法和装置 技术领域
本发明涉及生物技术领域,特别是无创产前基因检测,具体地涉及确定染色体非整倍性的方法和装置以及相应的构建机器学习分类模型的方法和装置。
背景技术
产前筛查方法通常分为两大类,即创伤性方法(也可称之为产前诊断)和非创伤性方法。前者主要包括羊膜腔穿刺、绒毛取样、脐血取样等;后者包括超声波检查、母体外周血清标志物测定和胎儿细胞检测等。通过如绒毛膜绒毛取样(CVS)或羊膜穿刺术等创伤性方法,获得从胎儿处分离得到的细胞,可以利用这些细胞进行常规产前诊断。虽然用这种方法来诊断胎儿非整倍体的准确性较高,但是这些常规的方法是侵入性的,对于孕妇和胎儿都具有一定的危险性。
常规的非创伤性筛查方法,例如产前血清学筛查等方法通常准确性较低。
Dennis Lo等人在母体血浆和血清中发现有非细胞的游离胎儿DNA,为无创产前筛查(NIPT)提供了新思路。无创性产前筛查主要是利用高通量测序技术分析方法对孕妇外周血中胎儿的游离DNA进行分析,以评估胎儿常见染色体非整倍性异常风险。目前筛查范围常见的是21号染色体非整倍性(T21)、18号染色体非整倍性(T18)、13号染色体非整倍性(T13)和性染色体。
基于高通量测序技术,采用孕妇外周血中胎儿游离DNA进行胎儿常见染色体非整倍性检测的现有常见技术如下:
1、基于测序序列数定量的方法进行NIPT:该方法的主要原理是采用比对软件将测序序列(read,有时也称为“测序读段”)定位到预先划定的窗口内,再利用适当的方法对待检染色体进行非整倍性检测。
2、基于单核苷酸多态性(SNP)的方法进行NIPT:该方法的主要原理是依据预先确定的SNP位点区域,分别对父母双方基因组DNA和胎儿游离DNA进行捕获测序,从而利用父母和胎儿的基因型信息采用贝叶斯模型进行待检染色体非整倍性检测。
3、基于DNA片段大小的方法进行NIPT:该方法的主要思想是利用双端(PE,paired-end)测序技术,基于胎儿游离DNA片段与母体DNA片段之间的分布差异特性专门提取胎儿游离DNA片段,最后采用Z检验基于参照样本对待检染色体进行非整倍性检测。
然而,这些现有的无创产前诊断方法各有相应的缺点,为了方便理解,特总结在下表中:
Figure PCTCN2019130625-appb-000001
Figure PCTCN2019130625-appb-000002
因此,目前通过无创手段确定染色体非整倍性的方法仍有待改进。
发明内容
本发明旨在至少解决现有技术中存在的技术问题之一。为此,本发明的一个目的在于提出一种能够有效地确定染色体非整倍性的方法。
根据本发明的一个方面,本发明提供了一种确定胎儿是否存在染色体非整倍性的方法,根据本发明的实施例,该方法包括:(1)获取来自孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成;(2)基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;(3)基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征;和(4)基于所述第一特征和第二特征,利用对照样本的相应数据,确定所述胎儿针对所述待测染色体是否存在非整倍性,其中,所述对照样本包括阳性样本和阴性样本,所述阳性样本针对所述待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性。
通过该方法能够有效地确定胎儿针对待测染色体是否具有非整倍性,另外,根据本发明的实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。
根据本发明的实施例,上述方法还可以具有下列附加技术特征:
根据本发明的实施例,所述孕妇样本包括孕妇外周血。
根据本发明的实施例,所述核酸测序样本是通过双末端测序、单末端测序或者单分子测序获得的。
根据本发明的实施例,所述胎儿浓度是通过下列步骤确定的:(a)将来自所述孕妇样本的所述核酸测序数据与参照序列比对,以便确定落入预定窗口的测序读段的数目;和(b)基于所述落入预定窗口的测序读段的数目,确定所述孕妇样本的胎儿浓度。
根据本发明的实施例,在步骤(2)中,所述第一比较染色体的测序读段数目为多条常染色体的平均测序读段数目,所述多条常染色体包括至少一个已知不具有非整倍性的常染色体。
根据本发明的实施例,在步骤(2)中,所述第一比较染色体的测序读段数目为至少15条常染色体的平均测序读段数目,可选的,第一比较染色体的测序读段数目为至少20条常染色体的平均测序读段数目,可选的,第一比较染色体的测序读段数目为全部常染色体的平均测序读段数目。
根据本发明的实施例,反估浓度是按照下列公式确定的:
Fj=2*|Rj-Rr|/(Rr)
其中
j表示需要确定所述反估浓度的染色体的编号,
Fj表示第j号染色体的反估浓度,
Rr表示所述多条常染色体的平均测序读段数目,和
Rj表示第j号染色体的测序读段数目。
根据本发明的实施例,在步骤(3)中,基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度平均值的差异确定第一特征。
根据本发明的实施例,所述第二比较染色体包含至少10条常染色体。
根据本发明的实施例,所述第二比较染色体包含15条常染色体。
根据本发明的实施例,进一步包括:确定多条常染色体的所述反估浓度;和按照由小至大的优先顺序,选择目标排序的常染色体作为所述第二比较染色体。
根据本发明的实施例,所述第一特征是通过下列公式确定的:
X1=Fi-Fr
其中
X1表示第一特征,
i表示所述待测染色体的编号,
Fi表示所述待测染色体的所述反估浓度,
Fr表示所述第二比较染色体的反估浓度平均值。
根据本发明的实施例,所述第二特征是通过下列公式确定的:
Figure PCTCN2019130625-appb-000003
其中
X2表示第二特征,
i表示所述待测染色体的编号,
Fi表示所述待测染色体的所述反估浓度,
Fa表示所述胎儿浓度。
根据本发明的实施例,在进行步骤(4)之前,所述第一特征和所述第二特征进行标准化处理,以便所述第一特征和所述第二特征的绝对值分别独立地处于0~1之间。
根据本发明的实施例,在步骤(4)中,所述阳性样本和所述阴性样本的数目比例不低于1:4。
根据本发明的实施例,在步骤(4)中,所述阳性样本和所述阴性样本的数目比例不超过4:1。
根据本发明的实施例,在步骤(4)中,所述阳性样本和所述阴性样本的数目比例为1:0.1~5。
根据本发明的实施例,在步骤(4)中,所述阳性样本和所述阴性样本的数目比例为1:0.25~4。
根据本发明的实施例,所述阳性样本和所述阴性样本针对所述待测染色体以外的其他染色体均不存在非整倍性。
根据本发明的实施例,在步骤(4)中,采用所述第一特征和所述第二特征确定所述孕妇样本和所述对照样本的二维特征向量,基于由所述二维特征向量确定的样本间距离,将所述孕妇样本在所述阳性对照样本和所述阴性对照样本之间进行归类,以便确定所述胎儿针对所述待测染色体是否存在非整倍性。
根据本发明的实施例,所述距离为欧几里得距离、曼哈顿距离或切比雪夫距离。
根据本发明的实施例,在步骤(4)中,进一步包括:(4-1)分别计算所述孕妇样本与所述对照样本之间的距离;(4-2)将所得到的所述距离进行排序,所述排序基于由小到大的顺序;(4-3)基于所述排序,从小到大选择预定数量的对照样本;(4-4)分别确定所述预定数量的所述对照样本中阳性样本和阴性样本的数目;(4-5)基于多数决策法,确定将所述孕妇样本的归类结果。
根据本发明的实施例,所述预定数量为不超过20。
根据本发明的实施例,所述预定数量为3~10。
根据本发明的实施例,在步骤(4-2)中,在进行所述排序之前,预先对所述待测样本与预定所述对照样本之间的距离进行加权处理。
在本发明的第二方面,本发明提供了一种确定胎儿是否存在染色体非整倍性的装置,其特征在于,包括:数据获取模块,用于获取来自孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成;胎儿浓度-反估浓度确定模块,用于基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;特征确定模块,用于基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征;和非整倍性确定模块,用于基于所述第一特征和第二特征,利用对照样本的相应数据,确定所述孕妇的胎儿针对所述待测染色体是否存在非整倍性,其中,所述对照样本包括阳性样本和阴性样本,所述阳性样本针对所述待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性。利用根据本发明的实施例的确定胎儿是否存在染色体非整倍性的装置,能够有效地实施前面所描 述的确定胎儿是否存在染色体非整倍性的方法,从而能够有效地确定胎儿针对待测染色体是否存在非整倍性。另外,根据本发明的实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。
根据本发明的实施例,上述装置还可以具有下列附加技术特征:
根据本发明的实施例,所述胎儿浓度-反估浓度确定模块包括:比对单元,用于将来自所述孕妇样本的所述核酸测序数据与参照序列比对,以便确定落入预定窗口的测序读段的数目;和胎儿浓度计算单元,用于基于所述落入预定窗口的测序读段的数目,确定所述孕妇样本的胎儿浓度。
根据本发明的实施例,所述胎儿浓度-反估浓度确定模块包括:反估浓度计算单元,用于按照下列公式确定所述反估浓度:
Fj=2*|Rj-Rr|/(Rr)
其中
j表示需要确定所述反估浓度的染色体的编号,
Fj表示第j号染色体的反估浓度,
Rr表示所述多条常染色体的平均测序读段数目,和
Rj表示第j号染色体的测序读段数目。
根据本发明的实施例,所述胎儿浓度-反估浓度确定模块包括:第二比较染色体确定单元用于将多条常染色体的所述反估浓度按照由小至大的优先顺序,选择目标排序的常染色体作为所述第二比较染色体。
根据本发明的实施例,所述特征确定模块包括:
第一特征确定单元,用于通过下列公式确定所述第一特征:
X1=Fi-Fr
其中
X1表示第一特征,
i表示所述待测染色体的编号,
Fi表示所述待测染色体的所述反估浓度,
Fr表示所述第二比较染色体的反估浓度平均值。
根据本发明的实施例,所述特征确定模块包括:第二特征确定单元,用于通过下列公式确定所述第二特征:
Figure PCTCN2019130625-appb-000004
其中
X2表示第二特征,
i表示所述待测染色体的编号,
Fi表示所述待测染色体的所述反估浓度,
Fa表示所述胎儿浓度。
根据本发明的实施例,所述特征确定模块包括:标准化处理单元,用于对所述第一特征和所述第二特征进行标准化处理,以便所述第一特征和所述第二特征的绝对值分别独立地处于0~1之间。
根据本发明的实施例,所述非整倍性确定模块用于采用所述第一特征和所述第二特征确定所述孕妇样本和所述对照样本的二维特征向量,基于由所述二维特征向量确定的样本间距离,将所述孕妇样本在所述阳性对照样本和所述阴性对照样本之间进行归类,以便确定所述胎儿针对所述待测染色体是否存在非整倍性。
根据本发明的实施例,所述距离为欧几里得距离、曼哈顿距离或切比雪夫距离。
根据本发明的实施例,所述非整倍性确定模块用于采用k-近邻模型确定将所述孕妇样本的归类结果。
根据本发明的实施例,所述k-近邻模型采用的K值为不超过20。
根据本发明的实施例,所述k-近邻模型采用的K值为3~10。
根据本发明的实施例,所述k-近邻模型中,对所述样本间距离进行加权处理。
在本发明的第三方面,本发明提出了一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现前面所述确定胎儿是否存在染色体非整倍性的方法的步骤。由此,能够有效地实施前面所描述的确定胎儿是否存在染色体非整倍性的方法,从而能够有效地确定胎儿针对待测染色体是否存在非整倍性。另外,根据本发明的实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。
在本发明的第四方面,本发明提出了一种电子设备,其包括:前面所述的计算机可读存储介质;以及一个或者多个处理器,用于执行所述计算机可读存储介质中的程序。由此,能够有效地实施前面所描述的确定胎儿是否存在染色体非整倍性的方法,从而能够有效地确定胎儿针对待测染色体是否存在非整倍性。另外,根据本发明的实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。
在本发明的第五方面,本发明提出了一种构建机器学习分类模型的方法,根据本发明的实施例,该方法包括:(a)针对多个孕妇样本的每一个分别进行:获取来自所述孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成,所述孕妇样本包括至少一个阳性样本和至少一个阴性样本,所述阳性样本针对待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性;基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于 所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;和基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征,(b)将所述多个孕妇样本作为样本,利用所述样本的第一特征和第二特征,进行机器学习训练,以便构建用于确定胎儿是否具有非整倍性的器学习分类模型。利用该方法,根据本发明的实施例,能够有效地构建机器学习的分类模型,从而进一步可以利用该分类模型对未知的样本进行识别和归类,以确定针对特定的染色体是否存在染色体非整倍性。
根据本发明的实施例,所述机器学习分类模型为KNN模型。
根据本发明的实施例,所述KNN模型采用欧几里得距离。
在本发明的第六方面,本发明提供了一种构建机器学习分类模型的装置,其包括:特征获取模块,用于针对多个孕妇样本的每一个分别进行:获取来自所述孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成,所述孕妇样本包括至少一个阳性样本和至少一个阴性样本,所述阳性样本针对待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性;基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;和基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征,训练模块,用于将所述多个孕妇样本作为样本,进行机器学习训练,以便构建用于确定胎儿是否具有非整倍性的器学习分类模型。利用该装置能够有效地实施前面所述的构建机器学习分类模型的方法,从而能够有效地构建机器学习的分类模型,从而进一步可以利用该分类模型对未知的样本进行识别和归类,以确定针对特定的染色体是否存在染色体非整倍性。
根据本发明的实施例,所述机器学习分类模型为KNN模型。
在本发明的第七方面,本发明提出了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求前面所述用于构建机器学习分类方法的步骤。由此,可以有效地实施前面所述的构建机器学习分类模型的方法,从而能够有效地构建机器学习的分类模型,从而进一步可以利用该分类模型对未知的样本进行识别和归类,以确定针对特定的染色体是否存在染色体非整倍性。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1显示了根据本发明一个实施例的确定胎儿是否存在染色体非整倍性的方法的流程示意图;
图2显示了根据本发明一个实施例的确定胎儿浓度的方法的流程示意图;
图3显示了根据本发明一个实施例的对孕妇样本进行归类的方法的流程示意图;
图4显示了根据本发明一个实施例的确定胎儿是否存在染色体非整倍性的装置的框图;
图5显示了根据本发明一个实施例的胎儿浓度-反估浓度确定模块的框图;
图6显示了根据本发明一个实施例的特征确定模块的框图;
图7显示了根据本发明一个实施例的构建机器学习分类模型的框图;
图8和9显示了根据本发明一个实施例利用KNN模型对T21检测时参数k对应的ROC曲线;
图10和11显示了根据本发明一个实施例利用KNN模型对T18检测时参数k对应的ROC曲线;和
图12和13显示了根据本发明一个实施例利用KNN模型对T13检测时参数k对应的ROC曲线。
具体实施方式
下面详细描述本发明的实施例。下面描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。需要说明的是,本申请可用于众多通用或专用的计算装置环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器装置、包括以上任何装置或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
根据本发明的一个方面,本发明提供了一种确定胎儿是否存在染色体非整倍性的方法。下面通过参考图1~3,对根据本发明实施例的确定胎儿是否存在染色体非整倍性的方法进行详细描述。
参考图1,根据本发明的实施例,该确定胎儿是否存在染色体非整倍性的方法包括:
S100:获取来自孕妇样本的核酸测序数据
根据本发明的实施例,在该步骤中,首先获取来自孕妇样本的核酸测序数据,该孕妇 样本含有胎儿游离核酸,例如根据本发明的实施例,可以采用的孕妇样本包括但不限于孕妇外周血。如前所述,Dennis Lo等人在母体血浆和血清中发现有非细胞的游离胎儿DNA,为无创产前诊断(NIPT)提供了新思路。通过采用孕妇外周血,不会对对孕妇造成创伤,避免了由于取样而造成的流产风险。根据本发明的实施例,在获取孕妇样本,例如孕妇外周血,可以对这些样本进行核酸测序,以便获得该孕妇样本的核酸测序数据,通常,该核酸测序数据是由多个或者大量测序读段(read)构成的。根据本发明的实施例,对孕妇样本的核酸分子进行测序的方法并不受特别限制,具体的,可以采用本领域技术人员已知的任何测序方法,例如包括但不限于通过双末端测序、单末端测序或者单分子测序对孕妇样本的核酸分子进行测序。
本领域技术人员能够理解的是,在获得核酸测序数据之后,可以根据质控标准,对所得到的由大量测序读段构成的测序数据进行过滤和筛选处理,除去存在测序质量问题的测序读段,从而可以提高后续数据分析的准确性。
S200确定反估浓度和胎儿浓度
获取来自孕妇样本的核酸测序数据之后,通过对核酸测序数据的测序读段数目进行分析,可以确定该孕妇样本的胎儿浓度以及特定染色体的反估浓度。
根据本发明的实施例,胎儿浓度是指孕妇样本,例如外周血中的游离核酸中,来自胎儿的游离核酸的数目占总游离核酸数目的比例。通常,该胎儿浓度的数值会随着孕周的增加而提高,例如,在第12孕周左右时,胎儿游离核酸(有时直接称为“胎儿游离DNA”)占总游离核酸的比例(即“胎儿浓度”)可以达到10~14%,在第20孕周之后,这个比例可以达到20%以上。当胎儿存在异常状况,例如存在染色体非整倍性时,胎儿浓度会出现异常。由此,胎儿浓度可以作为表征孕妇样本的一个重要指标。
本领域技术人员能够通过各种已知的方法获取孕妇样本中的胎儿浓度数据。例如,根据本发明的一个实施例,可以采用包括但不限于Y染色体估算法、基于SNP的胎儿特异SNP位点法、基于核小体印迹法等方法。然而,本发明的发明人发现,这些方法均有其局限性,例如Y染色体估算法不适用于女性胎儿,基于SNP的胎儿特异SNP位点法需要获取父亲的DNA样本(有时这些样本是比较难以获得的),基于核小体印迹法的准确性差同时在构建模型时需要进行深度测序。
参考图2,根据本发明的实施例,可以通过下列步骤确定核酸样本中的胎儿浓度,具体的,包括:
S210:将来自孕妇样本的核酸测序数据与参照序列比对,以便确定落入预定窗口的测序读段的数目;和
S220:基于所述落入预定窗口的测序读段的数目,确定所述孕妇样本的胎儿浓度。
该确定胎儿浓度的方法是基于特定窗口(即一定长度的核酸序列)中的测序读段数目,是与胎儿浓度呈正相关的。因此,通过确定至少一个预定窗口的测序读段的数目,可以反推获得孕妇样本的胎儿浓度,例如加权平均的方式。该预定窗口可以通过统计学的手段或 者机器学习的手段进行确定。根据本发明的实施例,预定窗口是通过对参考基因组序列的特定染色体进行连续划分而获得的,进一步利用各预定窗口的权重,确定胎儿浓度。根据本发明的一些具体示例,各预定窗口的权重是通过利用训练样品预先确定的。由此,结果准确可靠,可重复性好。
根据本发明的实施例,权重是利用岭回归统计模型和神经网络模型的至少之一确定的。根据本发明的一些实施例,所述神经网络模型采用TesnsorFlow学习系统。根据本发明的一些具体示例,所述TesnsorFlow学习系统的参数包括:采用常染色体的各窗口的测序数据数目作为输入层;采用胎儿浓度作为输出层;神经元类型采用ReLu;优化算法采用选自Adam、SGD和Ftrl的至少之一;优选Ftrl。优选地,Tesnsor Flow学习系统的参数进一步包括:学习速率设置为0.002;隐藏层的层数为1;隐藏层中神经元数为200。由此,结果准确可靠。需要说明的是,本文中所使用的术语“权重”是一个相对的概念,针对某一指标而言。某一指标的权重是指该指标在整体评价中的相对重要程度。例如,某一个“预定窗口的权重”指某一个预订窗口在所有预定窗口中的相对重要程度。某一个“连接权重”指某一个两个不同层连接在所有两个不同层连接中的相对重要程度。
关于该确定胎儿浓度的方法,PCT/CN2018/07204(发明名称:确定生物样本中预定来源的游离核酸比例的方法及装置)有详细的介绍,在此不再赘述,在此将该申请的全文以引用的方式全文并入。通过该方法能够简单、快捷、准确地获得胎儿浓度数据,同时,所获得的胎儿浓度数据能够更有效的应用于本发明的方法,用于确定胎儿是否存在染色体非整倍性。
另外,获取来自孕妇样本的核酸测序数据之后,不仅可以确定胎儿浓度,还可以进一步确定预定染色体的反估浓度。
在本文中所使用的术语“反估浓度”是指表征特定染色体的DNA含量与正常染色体的DNA含量之间差异的量度,具体的,可以用特定染色体的测序读段数目与正常染色体的测序读段数目的差异来进行表示。例如,理想状态下,对于存在三体的染色体,其反估浓度为表征多余的一条染色体的DNA含量的量,对于正常染色体,则因为没有多余出来的染色体,所以其反估浓度为0.
因本文主要集中于染色体非整倍性的分析,因此,在本文中所使用的术语“正常染色体”是指不存在染色体非整倍性的染色体,而不意味着该染色体不存在其他的异常状况。
另外,在本文中,多次提到表达方式“……的测序读段数目”,例如“正常染色体的测序读段数目”,“特定染色体的测序读段数目”,“落入预定对的测序读段数目”等,其含义是指能够与该区域匹配的测序读段数目,例如将核酸测序结果与参考序列例如hg19进行比对,例如采用常规软件如SOAP等进行比对时,能够与特定的区域比对的测序读段,则被认定为该区域的测序读段。另外,根据本发明的实施例,还可以仅选择“唯一比对测序读段”作为落入特定区域的测序读段,即仅能够与参考序列的一个位置比对上的测序读段。进一步,考虑到测序时有可能存在测序设备受到某些因素的影响造成测序倾向性偏差,例如由于GC含量的影响,可以对所得到的测序读段数目进行校正,例如通过GC含量进行 校正,具体的,例如,根据本发明的实施例,确定经过校正的测序读段数目的步骤包括:
将参考序列例如人基因组(GRCh37)划分出多个窗口,利用bwa(0.7.7-r441)把高通量测序后的测序读段与人类参考基因组(GRCh37)进行比对,统计测序读段比对到每条染色体上的每个窗口内的信息,即每个窗口内的测序读段数目,记第i个窗口内的测序读段数目为URi,记参考基因组在第i个窗口的GC含量为GCi。将各窗口的测序读段数目和GC含量进行拟合,并基于拟合系数对原窗口内的测序读段数目进行校正,记第i个窗口GC校正后的有效序列数为URAi。
由此,通过选择唯一比对测序读段和进行GC含量校正处理,能够有效地提高测序数据分析的准确性和精确性。
如前所述,“反估浓度”是指表征特定染色体的DNA含量与正常染色体的DNA含量之间差异的量度,因此,该反估浓度可以作为表征孕妇样本的一个重要指标。根据本发明的实施例,反估浓度是基于预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的。
这里所使用的术语“预定染色体”包括待测染色体,即需要确定其是否存在非整倍性的染色体,另外,预定染色体还包括第二比较染色体,根据本发明的实施例,第二比较染色体包括至少一条常染色体。需要说明的是,反估浓度是针对预定染色体的每一条分别进行计算的,因此针对待测染色体和第二比较染色体的每一条,会分别得到与该染色体对应的反估浓度。另外,需要说明的是,对于第一比较染色体和第二比较染色体与待测染色体均来源于相同的样本,而不是采用其他样本的数据进行分析。
根据本发明的实施例,第二比较染色体包含至少10条常染色体。根据本发明的实施例,第二比较染色体包含15条常染色体。另外,如前所述,反估浓度可以作为表征染色体是否存在异常的一个指标,因此,可以通过借助反估浓度来进行第二比较染色体的选择。根据本发明的实施例,进一步包括:确定多条常染色体的所述反估浓度;和按照由小至大的优先顺序,选择目标排序的常染色体作为所述第二比较染色体。根据前面所描述的,反估浓度越小说明该染色体作为正常染色体的概率越高。例如,通过将所有常染色体按照反估浓度(可以采用绝对反估浓度的绝对值)由小至大进行排序,然后选择反估浓度比较小的排位前15的常染色体做第二比较染色体。由此,可以在不确定染色体非整倍性状态的前提下,选择合适的常染色体作为第二比较染色体。当然,本领域技术人员能够理解,在实践中可以通过经验确定其染色体数目是否存在异常的情形,例如通过统计分析发现某些染色体几乎不存在非整倍性,由此,可以将这些染色体作为第二比较染色体。
另外,关于第一比较染色体,如前所述,反估浓度是希望表征特征染色体与正常染色体之间的差异,因此,根据本发明的实施例,第一比较染色体包括至少一个不同于所述预定染色体的常染色体。需要说明的是,这里所说的第一比较染色体和第二比较染色体可能是由交叉的,具体的,在进行反估浓度计算式,会在预定染色体中选择一个特定的染色体,由此,其余的染色体尽管有可能被“第二比较染色体”的含义所覆盖,但仍属于“不同于预定染色体的常染色体”的概念范围。例如,选定第23号染色体作为待测染色体,第2~5 号染色体作为第二比较染色体,则当计算第23号染色体的反估浓度时,第2~5号染色体仍然可以作为第一比较染色体。另外,根据本发明的实施例,第一比较染色体可以包括多条常染色体,在计算反估浓度时,选择其平均测序读段数目即可。这样,可以进一步提高测序数据分析的效率和准确性。根据本发明的实施例,第一比较染色体的测序读段数目为多条常染色体的平均测序读段数目,该多条常染色体包括至少一个已知不具有非整倍性的常染色体。根据本发明的实施例,第一比较染色体的测序读段数目为至少15条常染色体的平均测序读段数目,可选的,第一比较染色体的测序读段数目为至少20条常染色体的平均测序读段数目,可选的,第一比较染色体的测序读段数目为全部常染色体的平均测序读段数目。这样,通过选择多条染色体的平均测序读段数目,可以消除各染色体之间的差异。
根据本发明的实施例,反估浓度是按照下列公式确定的:
Fj=2*|Rj-Rr|/(Rr)
其中
j表示需要确定所述反估浓度的染色体的编号,
Fj表示第j号染色体的反估浓度,
Rr表示所述多条常染色体的平均测序读段数目,和
Rj表示第j号染色体的测序读段数目。
发明人发现通过该公式计算得到的反估浓度,可以有效地应用于后续的机器学习归类模型。
如前所述,在本步骤中确定的胎儿浓度和反估浓度都受到染色体非整倍性在不同程度上的影响,因此这两个参数可以在后续应用于进行非整倍性的检测。
S300确定第一特征和第二特征
在确定胎儿浓度和反估浓度之后,可以进一步将这些参数作为样本的特征值,从而可以进一步利用机器学习进行分析。
具体的,根据本发明的实施例,通过待测染色体的反估浓度与第二比较染色体的反估浓度的差异确定第一特征,通过前面所确定的测染色体的反估浓度与胎儿浓度的差异确定第二特征。由此,所得到的第一特征和第二特征,均可以视为能够受到非整倍性影响的特征,因此,可以有效地应用于后续分析。根据本发明的实施例,本领域技术人员可以采用多种算法来表征前面所描述的差异,例如通过计算数值的差,数值的比值等。
如前所述,第二比较染色体的反估浓度优选为多条常染色体的平均反估浓度。由此,可以进一步提高分析的效率和准确性。
另外,根据本发明的实施例,第一特征是通过下列公式确定的:
X1=Fi-Fr
其中
X1表示第一特征,
i表示待测染色体的编号,
Fi表示待测染色体的所述反估浓度,
Fr表示第二比较染色体的反估浓度平均值。
根据本发明的实施例,所述第二特征是通过下列公式确定的:
Figure PCTCN2019130625-appb-000005
其中
X2表示第二特征,
i表示所述待测染色体的编号,
Fi表示所述待测染色体的所述反估浓度,
Fa表示所述胎儿浓度。
根据本发明的实施例,如此得到的第一特征和第二特征,一方面均能够体现各自所采用的差异,另一方面所得到的数值均在相同的数量级上,避免出现单个参数过多影响分析结果的情形。如果特征的选择不合适,则后续分析结果有可能会出现偏差,例如K模型中要根据样本的特征计算样本之间的距离(例如,样本x 1的特征为
Figure PCTCN2019130625-appb-000006
样本x 2的特征为
Figure PCTCN2019130625-appb-000007
则样本x 1和x 2之间的距离为
Figure PCTCN2019130625-appb-000008
),若两样本之间特征数值相差特别大,例如距离为
Figure PCTCN2019130625-appb-000009
则尽管两维特征是同等重要的,但显然第二维特征会对距离产生较大的影响。
为了消除这种影响,根据本发明的实施例,在进行后续步骤之前,将所得到的的第一特征和第二特征进行标准化处理,以便第一特征和第二特征的绝对值分别独立地处于0~1之间。根据本发明的实施例,对第一特征和第二特征进行标准化处理的手段不受特别限制,具体的,可以采用下列方法,针对一批相同维度的数据(均为第一特征或者第二特征),按照下列公式进行处理
newValue=(oldVale-min)/(max-min)
其中,min和max分别为这批数值的最小和最大值,oldvale表示处理前的数值,newvalue 表示经过标准化处理后的数值。
由此,可以消除某个特征过多影响最终的分析结果,提高分析结果的准确性。
S400基于第一特征和第二特征,确定非整倍性
如前所述,第一特征和第二特征的数值均受到非整倍性的影响,由此,在获得第一特征和第二特征之后,利用对照样本的相应数据,确定胎儿针对待测染色体是否存在非整倍性。具体的,对照样本包括阳性样本和阴性样本,阳性样本针对待测染色体具有非整倍性,阴性样本针对待测染色体不具有非整倍性。
通过采用第一特征和第二特征作为分类特征,将待测样本针对待测染色体在阳性样本和阴性样本之间进行分类,可以实现所述待测染色体是否存在非整倍性的确定。其中,根据本发明的实施例,发明人在研究过程中,发现,阳性样本和阴性样本的数目满足一定比例能够进一步提高分析的准确性。例如,根据本发明的实施例,阳性样本和阴性样本的数目比例不低于1:4。根据本发明的实施例,所述阳性样本和所述阴性样本的数目比例不超过4:1。根据本发明的实施例,所述阳性样本和所述阴性样本的数目比例为1:0.1~5。根据本发明的实施例,所述阳性样本和所述阴性样本的数目比例为1:0.25~4。发明人发现,通过采用上述比例可以避免模型结果的偏向性,发明人发现,如果阳性样本偏多时则结果偏阳性,即假阳性率高,反之阴性样本偏多时则结果偏阴性,即假阴性率高。
根据本发明的实施例,所述阳性样本和所述阴性样本针对所述待测染色体以外的其他染色体均不存在非整倍性。由此,可以进一步提高对照样本的归类参考能力。
根据本发明的实施例,采用第一特征和第二特征进行归类的方法,并不受特别限制,可以采用多种机器学习的方法,例如神经网络、SVM法等。发明人在进行深入研究的过程中发现,神经网络需要的训练集的数目比较庞大,而SVM则有可能需要额外更多的参数来进行分类,以提高分类的准确性。根据本发明的实施例,可以采用第一特征和第二特征确定所述孕妇样本和所述对照样本的二维特征向量,基于由二维特征向量确定的样本间距离,将孕妇样本在阳性对照样本和阴性对照样本之间进行归类,以便确定胎儿针对所述待测染色体是否存在非整倍性。根据本发明的实施例,,可以采用的距离包括但不限于欧几里得距离、曼哈顿距离或切比雪夫距离。
具体的,根据本发明的实施例,可以采用K-近邻法(KNN)模型进行归类分析,为了方便理解,参考图3,对KNN模型的过程简单描述如下:
根据本发明的实施例,归类处理包括下列步骤:
S410:分别计算孕妇样本与各个对照样本之间的距离;
S420:将所得到的距离进行排序,该排序基于由小到大的顺序;
S430:基于所得到的排序,从小到大选择预定数量的对照样本(这个预定数量即为KNN模型中的K值);
S440:分别确定所得到的预定数量的对照样本中阳性样本和阴性样本的数目;
S450:基于多数决策法,确定将所述孕妇样本的归类结果。
根据本发明的实施例,所述预定数量为不超过20。根据本发明的实施例,所述预定数量为3~10。为了方便处理,K值可以采用奇数,以避免无法做出决定的情形。当然本领域技术人员能够理解的是,对于不同待测染色体最终选择的K值可以是不同的,例如,根据本发明的一个实施例,对于T13和T18检测最终选的k为7,T21检测最终选择k为9。
另外,根据本发明的实施例,在所述排序之前,可以预先对所述待测样本与预定的对照样本之间的距离进行加权处理。由此,可以进一步提高检验的准确性。
本领域技术人员能够理解,这些加权处理的加权系数或者KNN模型的K值都是可以通过机器学习,利用已知的样本作为训练集进行训练获得的。
具体的,根据本发明的实施例,可以通过下列步骤进行:
A、样本集选择
选择有回访结果的样本作为样本集,并按照6:2:2的比例划分为训练集、测试集和验证集。
B、模型训练
模型的输入:k值;训练数据集T={(x 1,y 1),(x 2y 2),...,(x N,y N)},其中x i∈R n为样本的n维特征向量;y i∈{+1,-1},i=1,2,...,N为样本阴阳性标签(阴性为-1,阳性为+1),N为样本集大小。
模型的输出:样本x所属的类别y
C:模型验证
初始化k=1,基于验证集不断调整k值(可采用交叉验证和网格搜索等方法),直到模型的预测能力达到较好的准确性。
D:模型预测
利用训练好的模型对测试集进行预测,从而评估模型的预测性能。
由此,通过该方法能够有效地确定胎儿针对待测染色体是否具有非整倍性,另外,根据本发明的实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。
在本发明的第二方面,与上述方法相对应的,本申请实施例还提供了对应的装置,用于实现上述方法。具体的,本发明提供了一种确定胎儿是否存在染色体非整倍性的装置。 参考图4,该包括确定胎儿是否存在染色体非整倍性的装置包括:
数据获取模块100,用于获取来自孕妇样本的核酸测序数据,孕妇样本含有胎儿游离核酸,核酸测序数据由多个测序读段构成;
胎儿浓度-反估浓度确定模块200,用于基于核酸测序数据确定孕妇样本的胎儿浓度以及预定染色体的反估浓度,反估浓度是基于预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,预定染色体包括待测染色体和第二比较染色体,第一比较染色体包括至少一个不同于预定染色体的常染色体;
特征确定模块300,用于基于待测染色体的反估浓度与第二比较染色体的反估浓度的差异确定第一特征,基于待测染色体的反估浓度与胎儿浓度的差异确定第二特征;和
非整倍性确定模块400,用于基于第一特征和第二特征,利用对照样本的相应数据,确定孕妇的胎儿针对待测染色体是否存在非整倍性,其中,照样本包括阳性样本和阴性样本,所述阳性样本针对所述待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性。
利用根据本发明的实施例的确定胎儿是否存在染色体非整倍性的装置,能够有效地实施前面所描述的确定胎儿是否存在染色体非整倍性的方法,从而能够有效地确定胎儿针对待测染色体是否存在非整倍性。另外,根据本发明的实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。
参考图5,根据本发明的实施例,胎儿浓度-反估浓度确定模块200包括:
比对单元210,用于将来自孕妇样本的所述核酸测序数据与参照序列比对,以便确定落入预定窗口的测序读段的数目;和
胎儿浓度计算单元220,用于基于落入预定窗口的测序读段的数目,确定孕妇样本的胎儿浓度。
根据本发明的实施例,胎儿浓度-反估浓度确定模块200还包括:
反估浓度计算单元230,用于按照下列公式确定所述反估浓度:
Fj=2*|Rj-Rr|/(Rr)
其中
j表示需要确定所述反估浓度的染色体的编号,
Fj表示第j号染色体的反估浓度,
Rr表示所述多条常染色体的平均测序读段数目,和
Rj表示第j号染色体的测序读段数目。
根据本发明的实施例,胎儿浓度-反估浓度确定模块200包括:
第二比较染色体确定单元240,用于将多条常染色体的所述反估浓度按照由小至大的优先顺序,选择目标排序的常染色体作为所述第二比较染色体。
根据本发明的实施例,特征确定模块300包括:
第一特征确定单元310,用于通过下列公式确定第一特征:
X1=Fi-Fr
其中
X1表示第一特征,
i表示所述待测染色体的编号,
Fi表示所述待测染色体的所述反估浓度,
Fr表示所述第二比较染色体的反估浓度平均值。
根据本发明的实施例,特征确定模块300还包括:
第二特征确定单元320,用于通过下列公式确定第二特征:
Figure PCTCN2019130625-appb-000010
其中
X2表示第二特征,
i表示所述待测染色体的编号,
Fi表示所述待测染色体的所述反估浓度,
Fa表示所述胎儿浓度。
根据本发明的实施例,特征确定模块300还包括:
标准化处理单元330,用于对所述第一特征和所述第二特征进行标准化处理,以便所述第一特征和所述第二特征的绝对值分别独立地处于0~1之间。
根据本发明的实施例,非整倍性确定模块400用于采用所述第一特征和所述第二特征确定所述孕妇样本和所述对照样本的二维特征向量,基于由所述二维特征向量确定的样本间距离,将所述孕妇样本在所述阳性对照样本和所述阴性对照样本之间进行归类,以便确定所述胎儿针对所述待测染色体是否存在非整倍性。
根据本发明的实施例,所述距离为欧几里得距离、曼哈顿距离或切比雪夫距离。
根据本发明的实施例,所述非整倍性确定模块用于采用k-近邻模型确定将所述孕妇样本的归类结果。
根据本发明的实施例,所述k-近邻模型采用的K值为不超过20。
根据本发明的实施例,所述k-近邻模型采用的K值为3~10。
根据本发明的实施例,所述k-近邻模型中,对所述样本间距离进行加权处理。
需要说明的是,前面针对确定胎儿是否存在染色体非整倍性的方法所描述的特征和优点均适用于该确定胎儿是否存在染色体非整倍性的装置,在此不再赘述。
在本发明的第三方面,本发明提出了一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现前面所述确定胎儿是否存在染色体非整倍性的方法的步骤。由此,能够有效地实施前面所描述的确定胎儿是否存在染色体非整倍性的方法,从而能够有效地确定胎儿针对待测染色体是否存在非整倍性。另外,根据本发明的 实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。
本领域技术人员能够理解的是,前面针对确定胎儿是否存在染色体非整倍性的方法所描述的特征和优点均适用于该计算机可读存储介质,在此不再赘述。
在本发明的第四方面,本发明提出了一种电子设备,其包括:前面所述的计算机可读存储介质;以及一个或者多个处理器,用于执行所述计算机可读存储介质中的程序。由此,能够有效地实施前面所描述的确定胎儿是否存在染色体非整倍性的方法,从而能够有效地确定胎儿针对待测染色体是否存在非整倍性。另外,根据本发明的实施例,在实施该方法的过程中,发现该方法替代了目前基于测序序列数目中的阈值设定策略,消除了检测灰区,同时还能够缩短样本检测周期,提高客户体验度,并且能够显著降低测序和检测成本。本领域技术人员能够理解的是,前面针对确定胎儿是否存在染色体非整倍性的方法所描述的特征和优点均适用于该电子设备,在此不再赘述。
在本发明的第五方面,本发明提出了一种构建机器学习分类模型的方法,根据本发明的实施例,该方法包括:
(a)针对多个孕妇样本的每一个分别进行:
获取来自孕妇样本的核酸测序数据,孕妇样本含有胎儿游离核酸,核酸测序数据由多个测序读段构成,孕妇样本包括至少一个阳性样本和至少一个阴性样本,阳性样本针对待测染色体具有非整倍性,阴性样本针对待测染色体不具有非整倍性;
基于核酸测序数据确定孕妇样本的胎儿浓度以及预定染色体的反估浓度,反估浓度是基于预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,预定染色体包括待测染色体和第二比较染色体,第一比较染色体包括至少一个不同于预定染色体的常染色体;和基于待测染色体的反估浓度与第二比较染色体的反估浓度的差异确定第一特征,基于待测染色体的反估浓度与胎儿浓度的差异确定第二特征,
(b)将多个孕妇样本作为样本,利用样本的第一特征和第二特征,进行机器学习训练,以便构建用于确定胎儿是否具有非整倍性的器学习分类模型。
利用该方法,根据本发明的实施例,能够有效地构建机器学习的分类模型,从而进一步可以利用该分类模型对未知的样本进行识别和归类,以确定针对特定的染色体是否存在染色体非整倍性。根据本发明的实施例,机器学习分类模型为KNN模型。根据本发明的实施例,KNN模型采用欧几里得距离。
本领域技术人员能够理解的是,前面针对确定胎儿是否存在染色体非整倍性的方法所描述的特征和优点均适用于该构建模型的方法,在此不再赘述。
在本发明的第六方面,本发明提供了一种构建机器学习分类模型的装置。
参考图7,该装置包括:
特征获取模块800,用于针对多个孕妇样本的每一个分别进行:获取来自孕妇样本的核酸测序数据,孕妇样本含有胎儿游离核酸,核酸测序数据由多个测序读段构成,孕妇样本包括至少一个阳性样本和至少一个阴性样本,阳性样本针对待测染色体具有非整倍性,阴性样本针对待测染色体不具有非整倍性;基于核酸测序数据确定孕妇样本的胎儿浓度以及预定染色体的反估浓度,反估浓度是基于预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,预定染色体包括待测染色体和第二比较染色体,第一比较染色体包括至少一个不同于预定染色体的常染色体;和基于待测染色体的反估浓度与第二比较染色体的反估浓度的差异确定第一特征,基于待测染色体的反估浓度与胎儿浓度的差异确定第二特征;和
训练模块900,用于将多个孕妇样本作为样本,进行机器学习训练,以便构建用于确定胎儿是否具有非整倍性的器学习分类模型。利用该装置能够有效地实施前面的构建机器学习分类模型的方法,从而能够有效地构建机器学习的分类模型,从而进一步可以利用该分类模型对未知的样本进行识别和归类,以确定针对特定的染色体是否存在染色体非整倍性。
根据本发明的实施例,机器学习分类模型为KNN模型。
利用该装置,根据本发明的实施例,能够有效地构建机器学习的分类模型,从而进一步可以利用该分类模型对未知的样本进行识别和归类,以确定针对特定的染色体是否存在染色体非整倍性。根据本发明的实施例,机器学习分类模型为KNN模型。根据本发明的实施例,KNN模型采用欧几里得距离。
本领域技术人员能够理解的是,前面针对确定胎儿是否存在染色体非整倍性的方法所描述的特征和优点均适用于该构建模型的装置,在此不再赘述。
在本发明的第七方面,本发明提出了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求前面用于构建机器学习分类方法的步骤。由此,可以有效地实施前面的构建机器学习分类模型的方法,从而能够有效地构建机器学习的分类模型,从而进一步可以利用该分类模型对未知的样本进行识别和归类,以确定针对特定的染色体是否存在染色体非整倍性。本领域技术人员能够理解的是,前面针对确定胎儿是否存在染色体非整倍性的方法所描述的特征和优点均适用于该构建模型的计算机可读存储介质,在此不再赘述。
下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解,下面的实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体技术或条件的,按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。实施例中未注明具体条件者,按照常规条件或制造商建议的条件进行。所用试剂或仪器未注明生产厂商者,均为可以通过市场获得的常规产品。
实施例1:
本实施例基于BGISEQ-500平台2017年到2018年间的有回访结果的3075例样本(其中男胎:1716例,女胎:1359例,阴性样本:2215例,21号染色体三体(T21):637例,18号染色体三体(T18):165例,13号染色体三体(T13):58例)进行模型训练和模型预测。
首先,将参考基因组(GRCh37)按固定长度(本方法使用60K)连续划分相邻的窗口,过滤掉N区内的窗口,统计窗口内GC含量,得到参照窗口文件hg19.gc;
接下来,将基于CG平台SE测序之后的序列(35bp)比对(BWA V0.7.7-r441)到参考基因组(GRCh37);
过滤及初步统计:根据比对结果选择唯一完全比对的序列,去掉重复序列和存在碱基错配的序列得到有效序列,然后按照hg19.gc文件中窗口统计各个窗口的有效序列数和其GC含量;
GC矫正,步骤具体如下:
对于某个样本,记第i个窗口的有效序列数为UR i,记参考基因组在该窗口的GC含量为GC i(hg19.gc文件中记录),记常染色体(1~22号染色体)上所有窗口有效序列数均值为
Figure PCTCN2019130625-appb-000011
利用常染色体所有窗口的有效序列数及GC含量进行拟合(本实施例中使用三次样条拟合)得到二者之间的关系式:ur=f(gc);
对所有染色体的窗口进行校正:
Figure PCTCN2019130625-appb-000012
记第i个窗口GC校正后的有效序列数为URA i
按照下列公式计算各染色体的反估浓度:
反估浓度的计算公式如下:
Figure PCTCN2019130625-appb-000013
j表示染色体的编号,
Figure PCTCN2019130625-appb-000014
表示能够与第j号染色体参照序列匹配的经过GC校正后的测序读段数目,
Figure PCTCN2019130625-appb-000015
表示能够与所有常染色体参照序列匹配的经过GC校正的平均测序读段数目。
按照常规方法或者PCT/CN2018/072045所公布的方法确定胎儿浓度。
基于样本集进行KNN模型训练和样本预测,具体步骤如下:
(a)样本集划分和数据预处理:将样本集以6:2:2的比例随机分为训练集、验证集和测试集;分别对训练集、验证集和测试集的样本进行数据预处理,使每个样本得到一个两维的特征向量,以及对应的标签(阴性为-1,阳性为+1)。
(b)超参数k的选择:发明人发现如果选择较小的k值,就相当于用较小的邻域中的训练样本集进行预测,预测结果会对邻近的样本点非常敏感,整体模型变得复杂,容易发生过拟合;如果选择较大的k值,就相当于用较大邻域中的训练样本集进行预测,这时与新输入样本较远(不相似)的训练样本集也会对预测结果起作用,使预测发生错误;一种极限情况是K为某个数值时,无论新输入的样本属于什么类别,都将简单的预测为在训练样本集中最多的类。因此,在本发明的实践中,k一般取一个比较小的数值。
(c)模型训练:包括两部分:KNN模型训练和k值的选择。此时选取欧氏距离和多数表决规则。
KNN模型训练:对于分类决策函数:
f:R n→{c 1,c 2}={-1,+1}
其中x∈R n为n维特征空间,-1和+1分别为样本标签(阴性为-1,阳性为+1)。那么误分类的概率为:
P(Y≠f(X))=1-P(Y=f(X))
对与给定的样本x∈X,其最近邻的k个训练样本点构成的集合为N k(x)。如果涵盖N k(x)的区域类别为c j,那么误分类的概率为:
Figure PCTCN2019130625-appb-000016
要使误分类概率最小,就要使
Figure PCTCN2019130625-appb-000017
最大。因此选定k值 后,模型训练的过程即为使
Figure PCTCN2019130625-appb-000018
最大的过程。
k值的选择:初始化k=1(k∈{1,2,...,20}),基于验证集采用线性搜索的方法确定k值。结果见图8~13,其中,图8~13均是ROC曲线图,分别表示参数k选择不同数值时对应的ROC曲线图,其反映对应的分类器的效果,评价标准就是AUC,即ROC曲线下面的面积,AUC越大,分类性能越好。图8和9表示KNN模型对T21检测时,参数k分别选择6,7,8和9时的ROC曲线图。图10和11表示KNN模型对T18检测时,参数k分别选择6,7,8和9时的ROC曲线图。图12和13表示KNN模型对T13检测时,参数k分别选择6,7,8和9时的ROC曲线图。根据图8~13的结果,对于T13和T18最终选的k为7,T21最终选择k为9。
(d)模型预测:基于上述步骤训练好的模型对测试集进行预测,预测结果如下表所示。
Figure PCTCN2019130625-appb-000019
Figure PCTCN2019130625-appb-000020
Figure PCTCN2019130625-appb-000021
分别计算检测的灵敏度、特异度、PPV和ACC的结果如下表所示。
  灵敏度 特异度 PPV ACC
T21 100% 99.38% 97.60% 99.51%
T18 100% 99.13% 86.84% 99.18%
T13 100% 99.00% 62.50% 99.01%
2.5与SVM模型的比较
基于相同的训练集、验证集和测试集,用SVM(支持向量机)的方法对样本的阴阳性进行分类,结果如下:
Figure PCTCN2019130625-appb-000022
Figure PCTCN2019130625-appb-000023
Figure PCTCN2019130625-appb-000024
分别计算检测的灵敏度、特异度、PPV和ACC的结果如下表所示
  灵敏度 特异度 PPV ACC
T21 100% 97.13% 89.71% 97.71%
T18 100% 98.61% 80.49% 98.69%
T13 100% 98.67% 55.56% 98.69%
从数据中可以看出,不管KNN模型还是SVM模型,在测试集中,T13、T18和T21的检测均没有漏检,灵敏度都达到了100%。但是在T21的检测中,SVM模型有14例假阳性样本,而KNN模型只有3例假阳;在T18的检测中,SVM模型有8例假阳,而KNN模型只有5例假阳;在T13的检测中,SVM模型有8例假阳,而KNN模型有6例假阳。不管对T21、T18还是T13,KNN模型均比SVM模型的假阳率低。
发明人分析,KNN模型较SVM模型假阳性率低的主要原因是:模型本身导致,即KNN主要依据聚类,它是很多个细化的聚类,而SVM只是简单的两类,所以细致程度上没有KNN好。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。

Claims (46)

  1. 一种确定胎儿是否存在染色体非整倍性的方法,其特征在于,包括:
    (1)获取来自孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成;
    (2)基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;
    (3)基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征;和
    (4)基于所述第一特征和第二特征,利用对照样本的相应数据,确定所述胎儿针对所述待测染色体是否存在非整倍性,其中,所述对照样本包括阳性样本和阴性样本,所述阳性样本针对所述待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性。
  2. 根据权利要求1所述的方法,其特征在于,所述孕妇样本包括孕妇外周血。
  3. 根据权利要求1所述的方法,其特征在于,所述核酸测序样本是通过双末端测序、单末端测序或者单分子测序获得的。
  4. 根据权利要求1所述的方法,其特征在于,所述胎儿浓度是通过下列步骤确定的:
    (a)将来自所述孕妇样本的所述核酸测序数据与参照序列比对,以便确定落入预定窗口的测序读段的数目;和
    (b)基于所述落入预定窗口的测序读段的数目,确定所述孕妇样本的胎儿浓度。
  5. 根据权利要求1所述的方法,其特征在于,在步骤(2)中,所述第一比较染色体的测序读段数目为多条常染色体的平均测序读段数目,所述多条常染色体包括至少一个已知不具有非整倍性的常染色体。
  6. 根据权利要求5所述的方法,其特征在于,在步骤(2)中,所述第一比较染色体的测序读段数目为至少15条常染色体的平均测序读段数目,
    可选的,第一比较染色体的测序读段数目为至少20条常染色体的平均测序读段数目,
    可选的,第一比较染色体的测序读段数目为全部常染色体的平均测序读段数目。
  7. 根据权利要求5所述的方法,其特征在于,反估浓度是按照下列公式确定的:
    Fj=2*|Rj-Rr|/(Rr)
    其中
    j表示需要确定所述反估浓度的染色体的编号,
    Fj表示第j号染色体的反估浓度,
    Rr表示所述多条常染色体的平均测序读段数目,
    Rj表示第j号染色体的测序读段数目。
  8. 根据权利要求1所述的方法,其特征在于,在步骤(2)中,所述第二比较染色体包含多个不具有非整倍性的常染色体,并且在步骤(3)中,基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度平均值的差异确定第一特征。
  9. 根据权利要求8所述的方法,其特征在于,所述第二比较染色体包含至少10条常染色体。
  10. 根据权利要求8所述的方法,其特征在于,所述第二比较染色体包含15条常染色体。
  11. 根据权利要求8所述的方法,其特征在于,进一步包括:
    确定多条常染色体的所述反估浓度;和
    按照由小至大的优先顺序,选择目标排序的常染色体作为所述第二比较染色体。
  12. 根据权利要求1所述的方法,其特征在于,所述第一特征是通过下列公式确定的:
    X1=Fi-Fr
    其中
    X1表示第一特征,
    i表示所述待测染色体的编号,
    Fi表示所述待测染色体的反估浓度,
    Fr表示所述第二比较染色体的反估浓度平均值。
  13. 根据权利要求12所述的方法,其特征在于,所述第二特征是通过下列公式确定的:
    Figure PCTCN2019130625-appb-100001
    其中,
    X2表示第二特征,
    i表示所述待测染色体的编号,
    Fi表示所述待测染色体的反估浓度,
    Fa表示所述胎儿浓度。
  14. 根据权利要求1~13任一项所述的方法,其特征在于,在进行步骤(4)之前,对所述第一特征和所述第二特征进行标准化处理,以便所述第一特征和所述第二特征的绝对值分别独立地处于0~1之间。
  15. 根据权利要求1所述的方法,其特征在于,在步骤(4)中,所述阳性样本和所述阴性样本的数目比例不低于1:4。
  16. 根据权利要求1所述的方法,其特征在于,在步骤(4)中,所述阳性样本和所述阴性样本的数目比例不超过4:1。
  17. 根据权利要求1所述的方法,其特征在于,在步骤(4)中,所述阳性样本和所述阴性样本的数目比例为1:0.1~5。
  18. 根据权利要求1所述的方法,其特征在于,在步骤(4)中,所述阳性样本和所述 阴性样本的数目比例为1:0.25~4。
  19. 根据权利要求1所述的方法,其特征在于,所述阳性样本和所述阴性样本针对所述待测染色体以外的其他染色体均不存在非整倍性。
  20. 根据权利要求1所述的方法,其特征在于,在步骤(4)中,采用所述第一特征和所述第二特征确定所述孕妇样本和所述对照样本的二维特征向量,基于由所述二维特征向量确定的样本间距离,将所述孕妇样本在所述阳性对照样本和所述阴性对照样本之间进行归类,以便确定所述胎儿针对所述待测染色体是否存在非整倍性。
  21. 根据权利要求20所述的方法,其特征在于,所述距离为欧几里得距离、曼哈顿距离或切比雪夫距离。
  22. 根据权利要求20所述的方法,其特征在于,在步骤(4)中,进一步包括:
    (4-1)分别计算所述孕妇样本与所述对照样本之间的距离;
    (4-2)将所得到的所述距离进行排序,所述排序基于由小到大的顺序;
    (4-3)基于所述排序,从小到大选择预定数量的对照样本;
    (4-4)分别确定所述预定数量的所述对照样本中阳性样本和阴性样本的数目;
    (4-5)基于多数决策法,确定将所述孕妇样本的归类结果。
  23. 根据权利要求22所述的方法,其特征在于,所述预定数量为不超过20。
  24. 根据权利要求22所述的方法,其特征在于,所述预定数量为3~10。
  25. 根据权利要求22所述的方法,其特征在于,在步骤(4-2)中,在进行所述排序之前,预先对所述待测样本与预定所述对照样本之间的距离进行加权处理。
  26. 一种确定胎儿是否存在染色体非整倍性的装置,其特征在于,包括:
    数据获取模块,用于获取来自孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成;
    胎儿浓度-反估浓度确定模块,用于基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;
    特征确定模块,基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,用于基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征;和
    非整倍性确定模块,用于基于所述第一特征和第二特征,利用对照样本的相应数据,确定所述孕妇的胎儿针对所述待测染色体是否存在非整倍性,其中,所述对照样本包括阳性样本和阴性样本,所述阳性样本针对所述待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性。
  27. 根据权利要求26所述的装置,其特征在于,所述胎儿浓度-反估浓度确定模块包括:
    比对单元,用于将来自所述孕妇样本的所述核酸测序数据与参照序列比对,以便确定落入预定窗口的测序读段的数目;和
    胎儿浓度计算单元,用于基于所述落入预定窗口的测序读段的数目,确定所述孕妇样本的胎儿浓度。
  28. 根据权利要求26所述的装置,其特征在于,所述胎儿浓度-反估浓度确定模块包括:
    反估浓度计算单元,用于按照下列公式确定所述反估浓度:
    Fj=2*|Rj-Rr|/(Rr)
    其中
    j表示需要确定所述反估浓度的染色体的编号,
    Fj表示第j号染色体的反估浓度,
    Rr表示多条常染色体的平均测序读段数目,和
    Rj表示第j号染色体的测序读段数目。
  29. 根据权利要求26所述的装置,其特征在于,所述胎儿浓度-反估浓度确定模块包括:
    第二比较染色体确定单元用于将多条常染色体的所述反估浓度按照由小至大的优先顺序,选择目标排序的常染色体作为所述第二比较染色体。
  30. 据权利要求26所述的装置,其特征在于,所述特征确定模块包括:
    第一特征确定单元,用于通过下列公式确定所述第一特征:
    X1=Fi-Fr
    其中
    X1表示第一特征,
    i表示所述待测染色体的编号,
    Fi表示所述待测染色体的所述反估浓度,
    Fr表示所述第二比较染色体的反估浓度平均值。
  31. 据权利要求26所述的装置,其特征在于,所述特征确定模块包括:
    第二特征确定单元,用于通过下列公式确定所述第二特征:
    Figure PCTCN2019130625-appb-100002
    其中
    X2表示第二特征,
    i表示所述待测染色体的编号,
    Fi表示所述待测染色体的所述反估浓度,
    Fa表示所述胎儿浓度。
  32. 据权利要求26所述的装置,其特征在于,所述特征确定模块包括:
    标准化处理单元,用于对所述第一特征和所述第二特征进行标准化处理,以便所述第一特征和所述第二特征的绝对值分别独立地处于0~1之间。
  33. 据权利要求26所述的装置,其特征在于,所述非整倍性确定模块用于采用所述第一特征和所述第二特征确定所述孕妇样本和所述对照样本的二维特征向量,基于由所述二维特征向量确定的样本间距离,将所述孕妇样本在所述阳性对照样本和所述阴性对照样本 之间进行归类,以便确定所述胎儿针对所述待测染色体是否存在非整倍性。
  34. 据权利要求33所述的装置,其特征在于,所述距离为欧几里得距离、曼哈顿距离或切比雪夫距离。
  35. 据权利要求26所述的装置,其特征在于,所述非整倍性确定模块用于采用k-近邻模型确定将所述孕妇样本的归类结果。
  36. 根据权利要求35所述的装置,其特征在于,所述k-近邻模型采用的K值为不超过20。
  37. 根据权利要求35所述的装置,其特征在于,所述k-近邻模型采用的K值为3~10。
  38. 根据权利要求35所述的装置,其特征在于,所述k-近邻模型中,对所述样本间距离进行加权处理。
  39. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1-25中任一项所述方法的步骤。
  40. 一种电子设备,其特征在于,包括:
    权利要求39中所述的计算机可读存储介质;以及
    一个或者多个处理器,用于执行所述计算机可读存储介质中的程序。
  41. 一种构建机器学习分类模型的方法,其特征在于,包括:
    (a)针对多个孕妇样本的每一个分别进行:
    获取来自所述孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成,所述孕妇样本包括至少一个阳性样本和至少一个阴性样本,所述阳性样本针对待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性;
    基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;和
    基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征,
    (b)将所述多个孕妇样本作为样本,利用所述样本的第一特征和第二特征,进行机器学习训练,以便构建用于确定胎儿是否具有非整倍性的器学习分类模型。
  42. 根据权利要求41所述的方法,其特征在于,所述机器学习分类模型为KNN模型。
  43. 根据权利要求42所述的方法,其特征在于,所述KNN模型采用欧几里得距离。
  44. 一种构建机器学习分类模型的装置,其特征在于,包括:
    特征获取模块,用于针对多个孕妇样本的每一个分别进行:
    获取来自所述孕妇样本的核酸测序数据,所述孕妇样本含有胎儿游离核酸,所述核酸测序数据由多个测序读段构成,所述孕妇样本包括至少一个阳性样本和至少一个阴性样 本,所述阳性样本针对待测染色体具有非整倍性,所述阴性样本针对所述待测染色体不具有非整倍性;
    基于所述核酸测序数据确定所述孕妇样本的胎儿浓度以及预定染色体的反估浓度,所述反估浓度是基于所述预定染色体的测序读段数目和第一比较染色体的测序读段数目的差异确定的,所述预定染色体包括待测染色体和第二比较染色体,所述第一比较染色体包括至少一个不同于所述预定染色体的常染色体;和
    基于所述待测染色体的反估浓度与所述胎儿浓度的差异确定第二特征,基于所述待测染色体的反估浓度与所述第二比较染色体的反估浓度的差异确定第一特征,
    训练模块,用于将所述多个孕妇样本作为样本,进行机器学习训练,以便构建用于确定胎儿是否具有非整倍性的器学习分类模型。
  45. 根据权利要求44所述的装置,其特征在于,所述机器学习分类模型为KNN模型。
  46. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求41~43任一项所述方法的步骤。
PCT/CN2019/130625 2019-12-31 2019-12-31 确定染色体非整倍性、构建分类模型的方法和装置 WO2021134513A1 (zh)

Priority Applications (9)

Application Number Priority Date Filing Date Title
EP19958118.2A EP4086356A4 (en) 2019-12-31 2019-12-31 METHOD FOR DETERMINING CHROMOSOME ANEUPLOIDY AND CONSTRUCTION CLASSIFICATION MODEL AND APPARATUS
PCT/CN2019/130625 WO2021134513A1 (zh) 2019-12-31 2019-12-31 确定染色体非整倍性、构建分类模型的方法和装置
US17/612,515 US20220336047A1 (en) 2019-12-31 2019-12-31 Method and device for determining chromosomal aneuploidy and constructing classification model.
AU2019480813A AU2019480813A1 (en) 2019-12-31 2019-12-31 Methods for determining chromosome aneuploidy and constructing classification model, and device
KR1020227003512A KR20220122596A (ko) 2019-12-31 2019-12-31 염색체 이수성 판별 및 분류 모델 구성 방법 및 장치
CA3141362A CA3141362A1 (en) 2019-12-31 2019-12-31 Method and device for determining chromosomal aneuploidy and constructing classification model
JP2021569370A JP7467504B2 (ja) 2019-12-31 2019-12-31 染色体異数性を判定するためおよび分類モデルを構築するための方法およびデバイス
CN201980004859.0A CN111226281B (zh) 2019-12-31 2019-12-31 确定染色体非整倍性、构建分类模型的方法和装置
IL277746A IL277746A (en) 2019-12-31 2020-10-01 Method and device for determining chromosome aneuploidy and building a classification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130625 WO2021134513A1 (zh) 2019-12-31 2019-12-31 确定染色体非整倍性、构建分类模型的方法和装置

Publications (1)

Publication Number Publication Date
WO2021134513A1 true WO2021134513A1 (zh) 2021-07-08

Family

ID=70827394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130625 WO2021134513A1 (zh) 2019-12-31 2019-12-31 确定染色体非整倍性、构建分类模型的方法和装置

Country Status (9)

Country Link
US (1) US20220336047A1 (zh)
EP (1) EP4086356A4 (zh)
JP (1) JP7467504B2 (zh)
KR (1) KR20220122596A (zh)
CN (1) CN111226281B (zh)
AU (1) AU2019480813A1 (zh)
CA (1) CA3141362A1 (zh)
IL (1) IL277746A (zh)
WO (1) WO2021134513A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037846A (zh) * 2020-07-14 2020-12-04 广州市达瑞生物技术股份有限公司 一种cffDNA非整倍体检测方法、系统、储存介质以及检测设备
EP4254418A4 (en) * 2020-11-27 2024-03-27 Bgi Shenzhen METHOD AND SYSTEM FOR DETECTING CHROMOSOMAL ANOMALIES IN FETUS
CN116312813B (zh) * 2023-05-22 2023-08-22 上海科技大学 鉴定干细胞群代次的方法及标志物

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011130880A1 (zh) * 2010-04-23 2011-10-27 深圳华大基因科技有限公司 胎儿染色体非整倍性的检测方法
WO2013040773A1 (zh) * 2011-09-21 2013-03-28 深圳华大基因科技有限公司 确定单细胞染色体非整倍性的方法和系统
WO2014153755A1 (zh) * 2013-03-28 2014-10-02 深圳华大基因研究院 确定胎儿染色体非整倍性的方法、系统和计算机可读介质
CN104232777A (zh) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 同时确定胎儿核酸含量和染色体非整倍性的方法及装置
WO2015006932A1 (zh) * 2013-07-17 2015-01-22 深圳华大基因科技有限公司 一种染色体非整倍性检测方法及装置
WO2015089726A1 (zh) * 2013-12-17 2015-06-25 深圳华大基因科技有限公司 一种染色体非整倍性检测方法及装置
CN104789686A (zh) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 检测染色体非整倍性的试剂盒和装置
CN104789466A (zh) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 检测染色体非整倍性的试剂盒和装置
CN106520940A (zh) * 2016-11-04 2017-03-22 深圳华大基因研究院 一种染色体非整倍体和拷贝数变异检测方法及其应用
WO2018132400A1 (en) * 2017-01-11 2018-07-19 Quest Diagnostics Investments Llc Method for non-invasive prenatal screening for aneuploidy

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK2768978T3 (en) * 2011-10-18 2017-12-18 Multiplicom Nv Fetal CHROMOSOMAL ANEUPLOIDID DIAGNOSIS
US20130267425A1 (en) * 2012-04-06 2013-10-10 The Chinese University Of Hong Kong Noninvasive prenatal diagnosis of fetal trisomy by allelic ratio analysis using targeted massively parallel sequencing
US20160026759A1 (en) * 2014-07-22 2016-01-28 Yourgene Bioscience Detecting Chromosomal Aneuploidy
US20180327844A1 (en) * 2015-11-16 2018-11-15 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
WO2017093561A1 (en) 2015-12-04 2017-06-08 Genesupport Sa Method for non-invasive prenatal testing
CN105844116B (zh) * 2016-03-18 2018-02-27 广州市锐博生物科技有限公司 测序数据的处理方法和处理装置
WO2019020180A1 (en) 2017-07-26 2019-01-31 Trisomytest, S.R.O. METHOD FOR NON-EFFECTIVE PRENATAL DETECTION OF FETAL CHROMOSOMAL ANEUPLOIDIE FROM MATERNAL BLOOD ON THE BASIS OF A BAYESIAN NETWORK
SK862017A3 (sk) * 2017-08-24 2020-05-04 Grendar Marian Doc Mgr Phd Spôsob použitia fetálnej frakcie a chromozómovej reprezentácie pri určovaní aneuploidného stavu v neinvazívnom prenatálnom testovaní
CN108363903B (zh) * 2018-01-23 2022-03-04 和卓生物科技(上海)有限公司 一种适用于单细胞的染色体非整倍性检测系统及应用
CN108611408A (zh) * 2018-02-23 2018-10-02 深圳市瀚海基因生物科技有限公司 检测胎儿染色体非整倍性的方法和装置

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011130880A1 (zh) * 2010-04-23 2011-10-27 深圳华大基因科技有限公司 胎儿染色体非整倍性的检测方法
WO2013040773A1 (zh) * 2011-09-21 2013-03-28 深圳华大基因科技有限公司 确定单细胞染色体非整倍性的方法和系统
WO2014153755A1 (zh) * 2013-03-28 2014-10-02 深圳华大基因研究院 确定胎儿染色体非整倍性的方法、系统和计算机可读介质
WO2015006932A1 (zh) * 2013-07-17 2015-01-22 深圳华大基因科技有限公司 一种染色体非整倍性检测方法及装置
WO2015089726A1 (zh) * 2013-12-17 2015-06-25 深圳华大基因科技有限公司 一种染色体非整倍性检测方法及装置
CN104232777A (zh) * 2014-09-19 2014-12-24 天津华大基因科技有限公司 同时确定胎儿核酸含量和染色体非整倍性的方法及装置
CN104789686A (zh) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 检测染色体非整倍性的试剂盒和装置
CN104789466A (zh) * 2015-05-06 2015-07-22 安诺优达基因科技(北京)有限公司 检测染色体非整倍性的试剂盒和装置
CN106520940A (zh) * 2016-11-04 2017-03-22 深圳华大基因研究院 一种染色体非整倍体和拷贝数变异检测方法及其应用
WO2018132400A1 (en) * 2017-01-11 2018-07-19 Quest Diagnostics Investments Llc Method for non-invasive prenatal screening for aneuploidy

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHIU ROSSA W K: "Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma.", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, vol. 105, no. 51, 23 December 2008 (2008-12-23), pages 20458 - 20463, XP002620454, ISSN: 0027-8424, DOI: 10.1073/pnas.0810641105 *
LAU TZE KIN; CHEN FANG; PAN XIAOYU; POOH RITSUKO K; JIANG FUMAN; LI YIHAN; JIANG HUI; LI XUCHAO; CHEN SHENGPEI; ZHANG XIUQING: "Noninvasive prenatal diagnosis of common fetal chromosomal aneuploidies by maternal plasma DNA sequencing.", JOURNAL OF MATERNAL-FETAL AND NEONATAL MEDICINE., vol. 25, no. 8, 31 December 2012 (2012-12-31), pages 1370 - 1374, XP008164835, ISSN: 1057-0802, DOI: 10.3109/14767058.2011.635730 *
See also references of EP4086356A4

Also Published As

Publication number Publication date
AU2019480813A8 (en) 2022-05-12
JP7467504B2 (ja) 2024-04-15
EP4086356A1 (en) 2022-11-09
IL277746A (en) 2021-12-01
US20220336047A1 (en) 2022-10-20
KR20220122596A (ko) 2022-09-02
CN111226281A (zh) 2020-06-02
AU2019480813A1 (en) 2021-12-16
EP4086356A4 (en) 2023-09-27
JP2023517155A (ja) 2023-04-24
CN111226281B (zh) 2023-03-21
CA3141362A1 (en) 2021-07-08

Similar Documents

Publication Publication Date Title
US11854666B2 (en) Noninvasive prenatal screening using dynamic iterative depth optimization
WO2021062904A1 (zh) 基于病理图像的tmb分类方法、系统及tmb分析装置
WO2021134513A1 (zh) 确定染色体非整倍性、构建分类模型的方法和装置
CN108778287B (zh) 用于早产结果的早期风险评估的方法和系统
US20230222311A1 (en) Generating machine learning models using genetic data
US20210065847A1 (en) Systems and methods for determining consensus base calls in nucleic acid sequencing
US20210102262A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
CN104951671A (zh) 基于单样本外周血检测胎儿染色体非整倍性的装置
US20220101135A1 (en) Systems and methods for using a convolutional neural network to detect contamination
US20210166813A1 (en) Systems and methods for evaluating longitudinal biological feature data
CN110191964B (zh) 确定生物样本中预定来源的游离核酸比例的方法及装置
CN113707330B (zh) 一种蒙医辨证模型的构建方法和蒙医辨证系统、方法
CN115223654A (zh) 检测胎儿染色体非整倍体异常的方法、装置及存储介质
US11535896B2 (en) Method for analysing cell-free nucleic acids
US20200105374A1 (en) Mixture model for targeted sequencing
US20230005569A1 (en) Chromosomal and Sub-Chromosomal Copy Number Variation Detection
Lu An embedded method for gene identification in heterogenous data involving unwanted heterogeneity
WO2024107868A1 (en) Systems and methods for identifying clonal expansion of abnormal lymphocytes
CN114512232A (zh) 基于级联机器学习模型的爱德华氏综合征筛查系统
CN117106870A (zh) 胎儿浓度的确定方法及装置
CN117393054A (zh) 鉴定核酸样本拷贝数变异真假阳性和细胞分裂来源的方法及装置
CN110428873A (zh) 一种染色体倍数异常检测方法及检测系统
CN112020565A (zh) 用于确保基于测序的测定的有效性的质量控制模板
Carey Clinical Interpretation of Novel Copy Number Variations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958118

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021569370

Country of ref document: JP

Kind code of ref document: A

Ref document number: 3141362

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2019480813

Country of ref document: AU

Date of ref document: 20191231

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019958118

Country of ref document: EP

Effective date: 20220801