CN112522387B - Noninvasive prenatal chromosome abnormality detection device - Google Patents

Noninvasive prenatal chromosome abnormality detection device Download PDF

Info

Publication number
CN112522387B
CN112522387B CN202011455655.XA CN202011455655A CN112522387B CN 112522387 B CN112522387 B CN 112522387B CN 202011455655 A CN202011455655 A CN 202011455655A CN 112522387 B CN112522387 B CN 112522387B
Authority
CN
China
Prior art keywords
chromosome
sample
standard deviation
detected
reads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011455655.XA
Other languages
Chinese (zh)
Other versions
CN112522387A (en
Inventor
张静波
曲丽
王伟伟
徐冰
伍启熹
王建伟
刘倩
唐宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Usci Medical Laboratory Co ltd
Original Assignee
Beijing Usci Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Usci Medical Laboratory Co ltd filed Critical Beijing Usci Medical Laboratory Co ltd
Priority to CN202011455655.XA priority Critical patent/CN112522387B/en
Publication of CN112522387A publication Critical patent/CN112522387A/en
Application granted granted Critical
Publication of CN112522387B publication Critical patent/CN112522387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Organic Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Zoology (AREA)
  • Pathology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to the technical field of bioinformatics, and particularly discloses a noninvasive prenatal chromosome abnormality detection device. The device comprises: the device comprises a detection module, a data quality control module, a data preprocessing module, a calculation module and a judgment module; the calculation module comprises: a chromosome heterozygosity ratio calculating unit, a chromosome adaptive correction standard deviation calculating unit and a chromosome Z value calculating unit; and the chromosome adaptive correction standard deviation calculating unit is used for performing logistic regression according to the relation between the sequencing depth of the reference set sample and the standard deviation of the chromosome hybridization ratio to obtain a fitting function, and substituting the fitting function into the sequencing depth of the pregnant woman sample to be detected to obtain the chromosome adaptive correction standard deviation of the sample to be detected. The device can improve the detectable rate of the chromosome abnormal true yang sample.

Description

Noninvasive prenatal chromosome abnormality detection device
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a device for noninvasive prenatal detection of chromosome abnormality.
Background
Prenatal screening is an important component of obstetrical care. Current prenatal examination procedures include amniocentesis and chorionic villus sampling, all of which involve some risk of miscarriage. Non-invasive prenatal screening (NIPT) can solve the problem of risk of abortion with prenatal examination. Typically, noninvasive prenatal screening uses fetal free dna (cfdna) in the peripheral blood of pregnant women to detect whether the fetus is chromosomally abnormal. The rapid development of bioinformatics algorithms and tools opens new avenues for the detection of fetal chromosomal abnormalities.
When abnormal staining is detected based on a high-throughput sequencing technology method, after a pregnant woman sample to be detected is subjected to extraction, library building, sequencing and data preprocessing, result data needs to be read by a biological information method. Generally, the conventional NIPT method uses a Z value method to detect whether a sample fetus has a chromosome abnormality, that is, a Z value is calculated by comparing the corrected number of chromosome reads in a sample of a pregnant woman to be detected with the mean value and standard deviation of the number of reads of the same chromosome in a sample of a reference set, and the obtained Z value is used to determine whether the fetus of the sample of the pregnant woman to be detected has a chromosome abnormality. However, in real clinical practice, false positives and false negatives often occur due to the difference between the sequencing depth of the reference set and the sequencing depth of the test sample. Specifically, the difference between the sequencing depth of the constructed reference set and the sequencing depth of the sample of the pregnant woman to be detected can cause the standard deviation of the number of chromosome reads to change, thereby influencing the Z value. The traditional NIPT method does not consider the difference of the standard deviation of chromosome reading numbers caused by different sequencing depths of a reference set when calculating the Z value, but the same standard deviation is used for pregnant woman samples to be detected with different sequencing depths, so that false positive and false negative results can be caused, and the obvious defect is caused.
If the training set is trained by a sample with higher sequencing depth and the sequencing depth of the sample to be tested is lower, the standard deviation of the used chromosome reading number is smaller, the Z value is larger, and False Positive (FP) appears; if the training set is trained by a sample with a low sequencing depth and the sequencing depth of the sample to be tested is high, the standard deviation of the number of used chromosome reads is large, the Z value is small, and False Negatives (FN) appear.
In order to solve the problem, in the prior art, reference set data of a plurality of corresponding sequencing depths are trained for samples to be tested with different sequencing depths, that is, a standard deviation parameter of a chromosome hybridization ratio of a reference set for a specific sequencing sample is retrained for the sample each time, but the scheme has some defects, and mainly comprises the following aspects: (1) for samples with high depth, if reference set data of corresponding depth samples need to be trained, much sequencing cost is wasted; (2) when the chromosome Z value of the sample to be tested is detected each time, if the sequencing depth corresponding to the sample is not trained, a new reference set needs to be reconstructed, a large amount of time is spent, and the constructed reference set is only used for a test sample with poor labeling, so that the efficiency is low.
Therefore, there is a need to provide a new apparatus for non-invasive prenatal detection of chromosomal abnormalities that solves the problems of the prior art.
Disclosure of Invention
In view of the above technical problems, the present invention provides a noninvasive prenatal chromosomal abnormality detection device capable of increasing the detection rate of true positive samples (TP) for chromosomal abnormalities and reducing the false negative rate (FP) of positive samples.
In order to realize the purpose of the invention, the technical scheme of the invention is as follows:
an apparatus for non-invasive prenatal detection of chromosomal abnormalities, the apparatus comprising: the device comprises a detection module, a data quality control module, a data preprocessing module, a calculation module and a judgment module;
the calculation module comprises: a chromosome heterozygosis ratio calculation unit, a chromosome adaptive correction standard deviation calculation unit and a chromosome Z value calculation unit;
the chromosome adaptive correction standard deviation calculation unit performs logistic regression according to the relation between the sequencing depth of the reference set sample and the standard deviation of the chromosome hybridization ratio to obtain a fitting function, and obtains the chromosome adaptive correction standard deviation of the sample to be detected by substituting the fitting function into the sequencing depth of the pregnant woman sample to be detected, wherein the fitting function is an R function: fit (as.vector (y) -x + I (x ^2)), predict (y (test) [ [1] ] data.frame (x (test) ═ num)), wherein fit represents a fitting function finally adopted in the invention, lm represents a fitting linear model in an R language, x represents the sequencing depth of all reference set samples, y represents the standard deviation of the chromosome heterozygosity ratio of all reference set samples, x (test) represents the sequencing depth of the pregnant woman sample to be detected, and y (test) represents the standard deviation of the chromosome heterozygosity ratio of the pregnant woman sample to be detected, which is calculated according to the fitting function. The fit function in the R language is used for fitting the relation between the total reads of the reference set sample and the standard deviation of the chromosome hybridization ratio of the sample into a quadratic function ((y (fit)) to A0+ A1x (fit) + A2(x (fit)) 2, wherein y (fit) is a function dependent variable fitted by linear regression, and x (fit) is an independent variable of the fitted function) and obtaining a polynomial coefficient (A0, A1, A2) of the fit function, and the fit function is used for obtaining the adaptive reads difference label after the subsequent test samples are substituted into the function; the prediction function in the R language is to substitute the x value (x (fit)) of the fitting function obtained above into the sequencing depth (i.e. the number of reads x (test)) of the test sample to obtain a dependent variable of the fitting function, i.e. the standard deviation y (test) of the chromosome hybridization ratio of the test sample.
The invention relates to a device for detecting chromosome abnormality based on high-throughput sequencing data, which obtains library products from DNA by a whole genome sequencing technology, then obtains the sequence of the library products by a high-throughput sequencing method, and determines the abnormal state of a sample chromosome by pretreatment and specific self-adaptive correction.
The device can obtain a detection result by utilizing the linear fitting function of the invention according to the sequencing depth information of different samples to be detected. The method only needs to use one reference set sample to train a fitting function, does not need to retrain the reference set sample for samples to be tested with different depths, and only needs to substitute the fitting function, thereby saving a large amount of time while ensuring the accuracy of standard deviation parameters of chromosome hybridization ratio.
The device is not applied to diagnosis and treatment of diseases.
In the present invention, the data preprocessing module: dividing the genome of the pregnant woman sample to be detected, which passes the quality control, into sliding windows with the size of 20 kb-100 kb, wherein an overlapping sequence with the size of 10 kb-50 kb exists between two adjacent windows, and calculating the number RC of reads in each sliding window;
performing average number standardization correction on the number RC of reads in all sliding windows of the genome of the pregnant woman sample to be detected; the aim is to eliminate the fluctuation of the number of reads in each sliding window due to the difference of the total number of reads for each sample.
Performing GC correction on all sliding windows of the genome of the pregnant woman sample to be detected through the read values RNor after the standard correction by using a smooth line method to obtain the number RGc of the reads after the GC correction; the aim is to eliminate the bias in the number of reads due to the GC level on the genome.
And (3) performing baseline correction on the number of reads RGc corrected by GC in all sliding windows of the genome of the pregnant woman sample to be detected. Aiming at eliminating the reading number deviation caused by batch effect.
Preferably, the genome of the pregnant woman sample to be tested, which passes the quality control, is divided into sliding windows with the size of 100kb, and 50kb of overlapping sequences exist between two adjacent windows.
The detection time is too long when the window width or the overlapping sequence is too small, the algorithm detection effect is unstable when the window width or the overlapping sequence is too large, and the 50k overlapping sequence is beneficial to enhancing the stability of the algorithm detection effect and saving the time cost for operating the algorithm.
In the invention, the number of reads RNor in each sliding window after standardization correction is RC/(total/BinT), wherein the total number of the reads of the genome of the pregnant woman sample to be detected is total, and the total number of the sliding windows is T.
In the present invention, GC correction is achieved by the R command: spline (RNor [ ord ]). RGc ═ smooth.
In the present invention, the chromosome hybridization ratio calculating unit: calculating a mean value according to the reading segment number after pretreatment by a data pretreatment module in all sliding windows in a chromosome, and recording as the hybridization ratio hh of the chromosome;
chromosome Z value calculation unit: calculating the Z value of each chromosome according to the self-adaptive correction standard deviation of the chromosome of the sample to be detected:
Figure BDA0002828736600000041
wherein, Zscore is a certain chromosome Z value of a pregnant woman sample to be detected;
sample hh: the hybridization ratio of a certain chromosome of a pregnant woman sample to be detected;
reference hh: a certain chromosome hybridization ratio of the reference set sample;
adaptive sd: and (4) adaptively correcting standard deviation of a certain chromosome based on logistic regression.
In the invention, the detection module: the method is used for carrying out high-throughput sequencing on the free DNA of the peripheral blood of the pregnant woman to obtain a genome of a pregnant woman sample to be detected.
In the invention, a data quality control module: and comparing the genome of the pregnant woman sample to be tested obtained by sequencing with the genome hg19 of a human, and controlling the quality of the file with the number of the reading segments larger than 2M.
In the invention, the judging module: and judging whether the chromosome is abnormal according to the Z value, judging the chromosome as an abnormal sample when the Z value is more than 3, and judging the chromosome as a normal sample when the Z value is less than 3.
The invention has the beneficial effects that:
1. the invention is based on a logistic regression method, and the corrected standard deviation parameters of the chromosome reads are self-adaptively learned according to the actual sequencing depth of the pregnant woman sample to be detected, so that the problem of the difference of the standard deviation of the chromosome reads caused by the difference of the sequencing depth of the reference set sample and the sequencing depth of the pregnant woman sample to be detected is solved. The used function is simple, the overfitting risk is reduced, and the detection speed and accuracy are improved;
2. the invention improves the detection rate of the chromosome abnormality true positive sample in noninvasive prenatal detection and reduces the number of false negative samples.
Drawings
FIG. 1 is a schematic view of the detection process performed by the apparatus of the present invention.
FIG. 2, FIG. 3 and FIG. 4 are graphs showing the variation of standard deviation of the number of 22 chromosome reads for 22 chromosomes at different sequencing depths. Wherein each graph represents a chromosome, the abscissa in each graph is the number of reads of the reference set sample, and the ordinate is the standard deviation of the chromosome under the number of reads.
FIG. 5 shows the number of reads in each window when tested with PN00G18DE1061 in example 3 of the present invention, where the upper graph shows the number of reads in each window before GC correction and the lower graph shows the number of reads in each window after GC correction.
Detailed Description
Preferred embodiments of the present invention will be described in detail with reference to the following examples. It is to be understood that the following examples are given for illustrative purposes only and are not intended to limit the scope of the present invention. Various modifications and alterations of this invention will become apparent to those skilled in the art without departing from the spirit and scope of this invention.
The experimental procedures used in the following examples are all conventional procedures unless otherwise specified. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1 fitting function establishment
In this example, standard deviations of the number of 22 chromosome reads of 22 chromosomes at different sequencing depths are calculated, and the variation graphs are shown in fig. 2, fig. 3 and fig. 4.
Wherein each graph represents a chromosome, the abscissa in each graph is the number of reads of the reference set sample, and the ordinate is the standard deviation of the chromosome under the number of reads. As can be seen from fig. 2-4, the standard deviation of the number of chromosome reads at different sequencing depths varies significantly, and becomes smaller as the number of reads increases (i.e., the sequencing depth increases), indicating that for one sample, a standard deviation parameter matching its sequencing depth should be used, rather than using the same standard deviation for all samples.
In each graph, according to the corresponding relationship (multiple points in the graph) between the number of reads of all the reference set samples (the number of reads of 22 chromosomes in the training set samples) and the standard deviation, the invention linearly fits a linear fit function (curve in the graph) of the number of reads and the standard deviation, namely obtains a fit function curve of the invention according to the reference set samples. fit (as.vector (y) -x + I (x ^2)), predict (y (test) [ [1] ] data.frame (x (test) ═ num)), wherein fit represents a fitting function finally adopted in the invention, lm represents a fitting linear model in an R language, x represents the sequencing depth of all reference set samples, y represents the standard deviation of the chromosome heterozygosity ratio of all reference set samples, x (test) represents the sequencing depth of the pregnant woman sample to be detected, and y (test) represents the standard deviation of the chromosome heterozygosity ratio of the pregnant woman sample to be detected, which is calculated according to the fitting function.
The invention predicts the standard deviation that should be used by the sample to be tested according to the curve function. It can also be seen from fig. 2-4 that the fitting function of the present invention fits well to the relevant points, and that the function is reasonable.
The specific process is as follows:
(1) in the samples shown in FIGS. 2-4, a total of 69,667 NIPT-simulated healthy samples (no chromosomal abnormalities or repeat variations greater than 10M microdeletions, greater than 10 weeks gestation) were used as reference set samples for fitting the linear fitting function shown in FIGS. 2-4. The used sequencing platform is an MGI2000 platform, fq files of the samples can be obtained through sequencing by the sequencing platform, the read length is 50bp, the sequencing depth is 0.1-1X, the number of the read segments of the samples is 7M-76M, and the pretreatment process in the steps (2) - (6) is carried out on each of 69,667 samples;
(2) and (3) performing quality control screening on the fq file: removing PCR repeated fragments, removing low-quality reads containing continuous N bases (or continuous 5-nucleotide Phred score average is less than 20), and reserving samples with the number of reads larger than 2M after removing the reads;
(3) dividing each chromosome into windows with the size of 100kb, wherein the window overlapping area is 50kb, and calculating the number of reads in each window, namely the initial base positions of the reads fall in the initial base positions of the sliding window;
(4) performing standardized correction on the number of the reading segments in the windows, namely dividing the number RC of the reading segments in each sliding window by the average value of the number of the reading segments in all the windows of the sample, and performing standardized correction on the number RNor (RC/(total/BinT) of the reading segments in each sliding window, wherein the total number of the genome reading segments of the sample of the pregnant woman to be detected is total and the total number of the sliding windows is T;
(5) counting the GC content of the reads in each window, and carrying out GC correction on the number of the reads after the standardization correction by using a smooth spline method in an R language according to the GC content;
(6) calculating the average value of the number of the reads of all windows on each chromosome after GC correction, recording as the hybridization ratio hh of the chromosome, calculating the hybridization ratio of 22 chromosomes according to the average value, and recording as Reference hh; the average values of the heterozygous ratios calculated for chromosomes 13, 18 and 21 in the embodiment are 0.0000034, 0.0000052 and 0.0000015 respectively, and the average values are taken as the heterozygous ratios of the reference set samples; further, the standard deviations of the hybridization ratios of all samples (69,667 NIPT simulated healthy samples) to chromosomes 13, 18, and 21 were calculated to be 0.06009, 0.09525, and 0.012189, which are SD values used in the conventional detection method, i.e., the standard deviation of chromosome 13 was fixed to 0.06009, the standard deviation of chromosome 18 was fixed to 0.09525, and the standard deviation of chromosome 21 was fixed to 0.012189, according to the conventional detection method.
(7) The 69,667 samples were divided into 138 intervals at 0.5M, which were (7M,7.5M), (7.5M,8M) … (75.5M,76M) based on the total number of reads in all windows of each of the 69,667 reference set samples, and for example, if the total number of reads for a sample was 7.25M, the sample belongs to the interval (7M,7.5M), all samples in each of the 138 intervals were counted, and the standard deviation of the hybridization ratio of each chromosome of the samples in the interval was calculated, so that 138 standard deviations were obtained for each chromosome. Then, the left boundary of the interval is used as an abscissa, and the standard deviation corresponding to the interval is used as an ordinate, so that 138 pairs of pairs, namely (7, SD1), (7.5, SD2) … (75.5, SD138), namely 138 points, are formed;
(8) for each of the 22 chromosomes, the 138 points in step (7) were fitted using linear regression of the quadratic function in the R package, based on the left boundary of the divided reading interval as the X coordinate and the standard deviation of the chromosome hybridization ratios of all samples in the interval as the ordinate. Then, the quadratic function polynomial coefficients are constructed by using a least square method, which are marked as a0, A1 and a2, so as to draw a linear fitting curve of each chromosome in fig. 2-4, wherein the fitted quadratic function is marked as (y (fit)) to a0+ A1x (fit) + a2(x (fit)) 2, wherein y (fit) is a function dependent variable fitted by using linear regression, x (fit) is an independent variable of the fitted function, a0 is a constant term in the polynomial coefficients of the fitting function, A1 is a first order term coefficient in the polynomial coefficients of the fitting function, and a2 is a second order coefficient in the polynomial coefficients of the fitting function;
to this end, the training process for the reference set samples is completed.
The coefficients of the quadratic function polynomial constructed by fitting to chromosomes 13, 18 and 21 in this embodiment are shown in table 1.
TABLE 1
A0 A1 A2
Chromosome 13 0.0498219876 -0.0016572671 0.0001552737
Chromosome 18 0.0614557842 -0.0021351633 0.0002116426
Chromosome 21 0.0164491534 -0.0028875793 0.0002817179
Example 2 chromosome abnormality judgment
The present embodiment provides a method for detecting chromosomal abnormalities using the apparatus for noninvasive prenatal detection of chromosomal abnormalities of the present invention. The detection process is schematically shown in FIG. 1. The method comprises the following specific steps:
(1) sequencing peripheral blood free DNA data of a pregnant woman sample to be detected by using a detection module, wherein the used sequencing platform is an MGI2000 platform, an fq file (containing a sequencing sequence and sequencing quality of the sequence) can be obtained by sequencing the sequencing platform, and the read segment length is 50 bp;
(2) and (3) performing quality control screening on the fq file by using a data quality control module: removing PCR repeated fragments, removing low-quality reads containing continuous N bases (or the average score of Phred score of continuous 5 nucleotides is less than 20), and if the number of the remaining reads of the fragments is less than 2M, rebuilding a library for sequencing;
(3) dividing each chromosome into windows with the size of 100kb by a data preprocessing module, wherein the window overlapping region is 50kb, namely (1,100k), (50k, 150k) …, calculating the number of reads in each window, namely the initial base positions of the reads all fall within the initial base positions of the sliding window, if the initial base positions of one read are all within the initial position coordinates of the window, adding 1 to the number of reads in the window, and repeating the operation for each read until all the number of reads belong to which window is completed;
(4) performing standardized correction on the number of the reads in the windows by using a data preprocessing module, namely dividing the number RC of the reads in each sliding window by the average value of the number of the reads in all the windows of the sample, wherein the number RNor of the reads in each sliding window after standardized correction is RC/(total/BinT), wherein the total number of the genome reads of the sample of the pregnant woman to be detected is total, and the total number of the sliding windows is T;
(5) counting the GC content of the reads in each window by a data preprocessing module, and carrying out GC correction on the number of the reads after the standardization correction by using a smooth spline method in R language according to the GC content; GC correction is achieved by R command: spline (RNor [ ord ]). RGc ═ smooth.
(6) Performing baseline correction on the number of reads corrected by the GC by using a weighted linear regression function by using a data preprocessing module, wherein the weight of the weighted linear regression function is the reciprocal of the standard deviation of the number of reads of all samples of the batch in each window;
(7) calculating the average value of the number of the reads of all windows on each chromosome after pretreatment by using a chromosome hybridization ratio calculation unit in a calculation module, recording the average value as the hybridization ratio hh of the chromosome, and calculating the hybridization ratio of 22 chromosomes;
(8) calculating the chromosome adaptive correction standard deviation of the sample to be tested by a chromosome adaptive correction standard deviation calculating unit in a calculating module according to the 22 fitting functions obtained in the embodiment 1 and the total reading number x (text) of all windows in the sample, wherein the adaptive correction standard deviation is Adaptivesd (A0 + A1x (text) + A2(x (text)) ^2)), wherein x (text) is the total reading number of the sample to be tested (the independent variable x (fit) of the fitted function), A0, A1 and A2 are polynomial coefficients fitted in the embodiment 1, and the Adaptivesd is an adaptive standard deviation parameter (the function dependent variable y (fit) obtained by linear regression);
(9) calculating each chromosome Z value by a chromosome Z value calculation unit in a calculation module:
Figure BDA0002828736600000101
wherein, Zscore is a certain chromosome Z value of a pregnant woman sample to be detected;
sample hh: the hybridization ratio of a certain chromosome of a pregnant woman sample to be detected;
reference hh: a certain chromosomal hybridization ratio of the reference set sample in example 1;
adaptive sd: and (4) adaptively correcting standard deviation of a certain chromosome based on logistic regression.
(10) Judging whether the chromosome is abnormal by a judging module: and (4) judging whether the Z value calculated in the step (9) is larger than 3 or not for each chromosome, if the Z value is larger than 3, judging that the chromosome is abnormal, and if the Z value is smaller than 3, judging that the chromosome is normal.
Example 3
This example verified 135 true positive and 3 negative samples of chromosomal abnormalities using the method of example 2 and the conventional method. The 138 samples were more than 10 weeks gestational week and their peripheral blood data were extracted for validation.
The conventional method is different from the method detection process of the embodiment 2 of the invention only in the last step of Z value calculation, and other processing processes including sequencing, calculating the number of reads in each window width, GC correction, baseline correction and the like are the same (as the process described in the embodiment 2).
The traditional method comprises the following steps:
Figure BDA0002828736600000111
the method comprises the following steps:
Figure BDA0002828736600000112
as can be seen from the above-described conventional method and the method of calculating the Z value of the method of the present invention, the denominator of the conventional method uses fixed standard deviations (the fixed standard deviation of chromosome 13 in this example is 0.06009, the fixed standard deviation of chromosome 18 is 0.09525, and the fixed standard deviation of chromosome 21 is 0.012189, which are derived from step (6) in example 1), that is, the same SD is used regardless of the depth of the test sample; the denominator used in the method of the present invention uses adaptive SD, that is, a linear fitting function trained by the reference set sample in embodiment 1 is used to calculate a corresponding adaptive standard deviation parameter according to the actual sequencing depth of the sample to be tested, and then calculate the Z value.
Sample source: in non-invasive prenatal testing, peripheral blood data of 138 pregnant woman samples are obtained.
The following will be specifically described by taking the sample with number PN00G18DE1061 as an example for detecting whether chromosome 21 is abnormal:
(1) sequencing peripheral blood free DNA data of a pregnant woman sample to be detected with the number of PN00G18DE1061, and sequencing through a sequencing platform MGI2000 to obtain an fq file (containing a sequencing sequence and sequencing quality of the sequence), wherein the read segment length is 50bp, and the total read segment number of the sample is 6,438,936;
(2) and (3) performing quality control on the generated fq file: PCR repeats are removed, low quality reads containing consecutive N bases (or a Phred score of less than 20 for consecutive 5 nucleotides) are removed, the total number of reads remaining is 5,129,088, i.e., about 5.1M, and the number of reads is greater than 2M, and the analysis of the following steps is performed directly without resequencing.
(3) Chromosome 21 was divided into 100 kb-sized windows with an overlap of 50kb, chromosome 21 was divided into 963 windows, and the number of reads in each window was calculated, i.e., the starting base positions of the reads all fell within the starting base position of the sliding window. In this way, the number of reads for 963 windows can be obtained, for example, the number of reads for the sample to be tested in the window (46650000, 46750000) is 936, and the number of reads for other windows is not listed.
(4) Normalizing the number of reads in the window, i.e., dividing the number of reads in each sliding window by the average of the number of reads in all windows of the sample, e.g., 892 for the average number of reads for the sample and 1.04 for 936/892 after normalizing for the number of reads for the window (46650000, 46750000);
(5) counting the GC content of the reads in each window, carrying out GC correction on the number of the reads after the standardization correction by using a smooth spline method in an R language according to the GC content, wherein for the test sample, the number of the reads after the GC correction in each window is shown in the lower graph of FIG. 5; the number of reads before GC correction per window is shown in the upper graph of fig. 5.
(6) Performing baseline correction on the number of reads in each window after GC correction in the step (5) by using a weighted linear regression function, wherein the used weight is the reciprocal of the standard deviation of the number of reads in the windows of all samples of the batch;
(7) calculating the average value of the number of the reads after the pretreatment in all the windows (963 windows in total) on the chromosome 21 to be 0.030172, namely the hybridization ratio of the chromosome 21 is 0.030172;
(8) according to example 1, the fitting function for chromosome 21 was obtained as: (y (fit) -0.0164491534-0.0028875793 x (fit) -0.0002817179 (x (fit) -2), substituting x (fit) of the function into the total number of reads of the sample of 5.1, and calculating the self-adaptive correction standard deviation of the chromosome of the sample to be detected to be 0.00905;
(9) the hybridization ratio of chromosome 21 of the reference set sample of chromosome 21 obtained according to example 1 was 0.0000015, the hybridization ratio of chromosome 21 obtained according to step (7) was 0.030172, and the Z-value of chromosome 21 was calculated according to the adaptive standard deviation of chromosome 21 of the sample obtained according to step (8) was 0.00905: zscore ═ (Sample hh-Reference hh)/(Adaptive sd) ═ 0.030172-0.0000015)/0.00905 ═ 3.333;
(10) from the Z value calculated in step (9) of 3.333, chromosome 21 abnormality of the sample No. PN00G18DE1061 was determined.
The method of the invention is used for obtaining the information of the hybridization ratio, the self-adaptive standard deviation, the Z value and the like of all samples, thereby realizing the judgment of the chromosome abnormality. The hybridization ratios, adaptive standard deviations, and Z values for the following numbered samples are shown in Table 2.
TABLE 2
Sample numbering Hetero ratio Adaptive standard deviation Z value
PN00G18EE6259 0.1803556(Chr13) 0.04550987(Chr13) 3.963(Chr13)
NE00G19EB0338 0.5252057(Chr13) 0.05958767(Chr13) 8.814(Chr13)
PN00G18PA0136 0.2037483(Chr18) 0.05633075(Chr18) 3.617(Chr18)
PN00G18DD9338 0.2300391(Chr18) 0.05694036(Chr18) 4.04(Chr18)
PN00G16AD0047 0.1102264(Chr21) 0.00941864(Chr21) 11.703(Chr21)
PN00G16AC0579 0.2422204(Chr21) 0.00948322(Chr21) 25.542(Chr21)
The Z values and the information on the specific judgment results of chromosome 13, chromosome 18 and chromosome 21 obtained from all samples by the conventional detection method (fixed standard deviation FSD) and the method of the present invention (adaptive standard deviation ASD) are shown in Table 3.
TABLE 3
Figure BDA0002828736600000131
Figure BDA0002828736600000141
Figure BDA0002828736600000151
Figure BDA0002828736600000161
Based on the data in Table 3, the two methods sample validation statistics are shown in Table 4.
TABLE 4
Number of false yin samples Number of false positive samples
Conventional methods 8 3
The method of the invention 5 0
As can be seen from the results in tables 3 and 4, 3 samples were judged as false negative samples in the conventional detection method, and were judged as true positive samples in the detection of the present invention. In addition, 3 samples judged to be false positive in the conventional method were tested for true negative samples judged to be chromosomal abnormalities using the present invention. In summary, the model based on the logistic regression adaptive learning parameters provided by the present invention can improve the detection rate of the chromosome abnormality true positive samples (TP) and reduce the false negative rate (FP) sample number of the positive samples.
Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (8)

1. An apparatus for noninvasive prenatal detection of chromosomal abnormalities, the apparatus comprising: the device comprises a detection module, a data quality control module, a data preprocessing module, a calculation module and a judgment module;
the calculation module comprises: a chromosome heterozygosis ratio calculation unit, a chromosome adaptive correction standard deviation calculation unit and a chromosome Z value calculation unit;
the chromosome adaptive correction standard deviation calculation unit performs logistic regression according to the relation between the sequencing depth of the reference set sample and the standard deviation of the chromosome hybridization ratio to obtain a fitting function, and substitutes the sequencing depth of the pregnant woman sample to be detected based on the fitting function to obtain the chromosome adaptive correction standard deviation of the sample to be detected;
the fitting function is an R function: fit ═ lm (as.vector (y) -x + I (x ^2)), predict (y (test) [ [1] ] data.frame (x (test) ═ num)), wherein fit represents the fitting function, lm represents the fitting linear model in the R language, x represents the sequencing depth of all reference set samples, y represents the standard deviation of chromosome heterozygosity of all reference set samples, x (test) represents the sequencing depth of the pregnant woman sample to be detected, and y (test) represents the standard deviation of chromosome heterozygosity of the pregnant woman sample to be detected calculated according to the fitting function;
chromosome hybridization ratio calculation unit: calculating a mean value according to the reading segment number after pretreatment by a data pretreatment module in all sliding windows in a chromosome, and recording as the hybridization ratio hh of the chromosome;
chromosome Z value calculation unit: calculating the Z value of each chromosome according to the self-adaptive correction standard deviation of the chromosome of the sample to be detected:
Figure FDA0003556112160000011
wherein, Zscore is a certain chromosome Z value of a pregnant woman sample to be detected;
sample hh: the hybridization ratio of a certain chromosome of a pregnant woman sample to be detected;
reference hh: a certain chromosome hybridization ratio of the reference set samples;
adaptive sd: and (4) adaptively correcting standard deviation of a certain chromosome based on logistic regression.
2. The apparatus of claim 1, wherein the data pre-processing module: dividing the genome of the pregnant woman sample to be detected, which passes the quality control, into sliding windows with the size of 20 kb-100 kb, wherein an overlapping sequence with the size of 10 kb-50 kb exists between two adjacent windows, and calculating the number RC of reads in each sliding window;
performing average number standardization correction on the number RC of reads in all sliding windows of the genome of the pregnant woman sample to be detected;
performing GC correction on all sliding windows of the genome of the pregnant woman sample to be detected through the read values RNor after the standard correction by using a smooth line method to obtain the number RGc of the reads after the GC correction;
and (3) performing baseline correction on the number of reads RGc corrected by GC in all sliding windows of the genome of the pregnant woman sample to be detected.
3. The apparatus of claim 2, wherein the genome of the maternal sample to be tested, which passes quality control, is divided into sliding windows of 100kb in size, with 50kb of overlapping sequences between two adjacent windows.
4. The device according to claim 2, wherein the number of reads RNor ═ RC/(total/BinT) in each sliding window after normalization correction, wherein the total number of genome reads of the maternal sample to be tested is total and the total number of sliding windows is T.
5. The apparatus of claim 2 or 4, wherein the GC correction is implemented by the R command: spline (RNor [ ord ]). RGc ═ smooth.
6. The apparatus of claim 2,
a detection module: the method is used for carrying out high-throughput sequencing on the free DNA of the peripheral blood of the pregnant woman to obtain a genome of a pregnant woman sample to be detected.
7. The apparatus of claim 2,
the data quality control module: and comparing the genome of the pregnant woman sample to be tested obtained by sequencing with the genome hg19 of a human, and controlling the quality of the file with the number of the reading segments larger than 2M.
8. The apparatus of claim 2,
a judging module: and judging whether the chromosome is abnormal or not according to the Z value, judging the chromosome as an abnormal sample when the Z value is more than 3, and judging the chromosome as a normal sample when the Z value is less than 3.
CN202011455655.XA 2020-12-10 2020-12-10 Noninvasive prenatal chromosome abnormality detection device Active CN112522387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011455655.XA CN112522387B (en) 2020-12-10 2020-12-10 Noninvasive prenatal chromosome abnormality detection device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011455655.XA CN112522387B (en) 2020-12-10 2020-12-10 Noninvasive prenatal chromosome abnormality detection device

Publications (2)

Publication Number Publication Date
CN112522387A CN112522387A (en) 2021-03-19
CN112522387B true CN112522387B (en) 2022-05-20

Family

ID=74999008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011455655.XA Active CN112522387B (en) 2020-12-10 2020-12-10 Noninvasive prenatal chromosome abnormality detection device

Country Status (1)

Country Link
CN (1) CN112522387B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116240273B (en) * 2023-04-19 2023-08-15 北京优迅医学检验实验室有限公司 Method for judging pollution proportion of parent source based on low-depth whole genome sequencing and application thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256296A (en) * 2017-12-29 2018-07-06 北京科迅生物技术有限公司 Data processing method and device
CN110993029A (en) * 2019-12-26 2020-04-10 北京优迅医学检验实验室有限公司 Method and system for detecting chromosome abnormality
CN111868254A (en) * 2018-04-09 2020-10-30 深圳华大生命科学研究院 Construction method and application of gene library

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256296A (en) * 2017-12-29 2018-07-06 北京科迅生物技术有限公司 Data processing method and device
CN111868254A (en) * 2018-04-09 2020-10-30 深圳华大生命科学研究院 Construction method and application of gene library
CN110993029A (en) * 2019-12-26 2020-04-10 北京优迅医学检验实验室有限公司 Method and system for detecting chromosome abnormality

Also Published As

Publication number Publication date
CN112522387A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
CN109767810B (en) High-throughput sequencing data analysis method and device
WO2021051875A1 (en) Cell classification method and apparatus, medium and electronic device
CN109887546B (en) Single-gene or multi-gene copy number detection system and method based on next-generation sequencing
CN112522387B (en) Noninvasive prenatal chromosome abnormality detection device
CN114359199A (en) Fish counting method, device, equipment and medium based on deep learning
CN111226281B (en) Method and device for determining chromosome aneuploidy and constructing classification model
EP4016533B1 (en) Method and apparatus for machine learning based identification of structural variants in cancer genomes
CN113947597B (en) Industrial defect detection method, device and medium based on shielding reconstruction
CN117423451B (en) Intelligent molecular diagnosis method and system based on big data analysis
Yang et al. Chromosome classification via deep learning and its application to patients with structural abnormalities of chromosomes
CN111696622B (en) Method for correcting and evaluating detection result of mutation detection software
CN117274212A (en) Bridge underwater structure crack detection method
CN115831219B (en) Quality prediction method, device, equipment and storage medium
CN111370055A (en) Intron retention prediction model establishing method and prediction method thereof
CN114512231A (en) Down syndrome screening system based on cascade characteristic selection algorithm
CN112446427B (en) Method and device for identifying myeloid blood cells, storage medium and electronic equipment
CN115223654A (en) Method, device and storage medium for detecting fetal chromosome aneuploidy abnormality
CN113344949A (en) Package detection method, system, medium and terminal based on RGB image
CN108733982B (en) Pregnant woman NIPT result correction method and device, and computer-readable storage medium and equipment
CN113593629B (en) Method for reducing non-invasive prenatal detection false positive and false negative based on semiconductor sequencing
CN110705570A (en) Image feature identification method
CN114703263B (en) Group chromosome copy number variation detection method and device
KR102532991B1 (en) Method for detecting fetal chromosomal aneuploidy
KR102404947B1 (en) Method and apparatus for machine learning based identification of structural variants in cancer genomes
CN109686401B (en) Method for identifying uniqueness of heterologous low-frequency genome signal and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant