Detailed Description
The following describes embodiments of the present invention in detail. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
It should be noted that the terms "primary", "secondary" and "final" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "primary," "secondary," or "final" may explicitly or implicitly include one or more of that feature. Further, in the description of the present invention, "a plurality" means two or more unless otherwise specified. The "uniquely aligned sequence" and "uniquely aligned and sequenced sequence" in the present invention may sometimes be referred to as "sequence" or "sequenced sequence".
The term "maternal sample" refers herein to a biological sample obtained from a pregnant subject, e.g., a woman.
The term "microdeletion of a micro-repeat" refers to the occurrence of a deletion or repeat on a chromosome that is 1.5kb to 10Mb in length.
The term "GC correction" refers to a correction for GC content in a sequence.
Referring to fig. 1, the present invention provides a method for determining fetal chromosomal microdeletion microreplication, comprising:
s1, obtaining the concentration fm of the micro-deletion micro-repetitive fragment;
s2, obtaining the concentration fy of the nucleic acid of the male fetus or the concentration fs of the nucleic acid of the female fetus;
s3, calculating the ratio rmY of the concentration fm of the microdeletion containing micro-repetitive fragments to the concentration fy of the fetal nucleic acids in males to be fm/fy, or calculating the ratio rms of the concentration fm of the microdeletion containing micro-repetitive fragments to the concentration fs of the fetal nucleic acids in females to be fm/fs;
s4, filtering out false positives according to the number of missing copies or the number of repeated copies;
s5, taking the decimal part dmY of rmY or the decimal part dms of rms, judging whether dmY or dms is positive, and otherwise, filtering out the result;
s6, calculating the sum of the concentration fm of the microdeletion-containing microreplicated fragment and the concentration fy of fetal nucleic acid in males to be amY ═ fm + fy, or calculating the sum of the concentration fm of the microdeletion-containing microreplicated fragment and the concentration fs of fetal nucleic acid in females to be ams ═ fm + fs;
and S7, filtering the microdeletion micro-repeat fragments according to a judgment principle, and filtering to obtain the fetal chromosome microdeletion micro-repeat fragments.
The inventors have surprisingly found that microdeletion microreplication in chromosomes can be accurately determined by using the method of the invention, and the method is particularly suitable for determining microdeletion microreplication of fetal chromosomes in the peripheral blood of pregnant women.
Referring to fig. 2, according to an embodiment of the present invention, the concentration fm of the microdeletion containing micro-repeat in step S1 is obtained by:
s11, obtaining a primary window without the microdeletion micro-repeats according to the primary window with the microdeletion micro-repeats, and calculating the total sequence number of the primary window with the microdeletion micro-repeats and the total number of the primary windows with the microdeletion micro-repeats, as well as the total sequence number of the primary window without the microdeletion micro-repeats and the total number of the primary windows without the microdeletion micro-repeats;
s12, obtaining the average depth d1 of the primary window containing microdeletion micro-repeats, d1 being the total number of sequences of the primary window containing microdeletion micro-repeats/total number of primary windows containing microdeletion micro-repeats;
s13, obtaining the average depth d2 of the primary window without microdeletion micro-repeats, d2 being the total number of sequences of the primary window without microdeletion micro-repeats/total number of primary windows without microdeletion micro-repeats;
s14, calculating the concentration fm of the micro-deletion repeat fragment, fm is 2 x | d2-d1 | d 2.
As will be understood by those skilled in the art, the total number of primary windows and the total number of sequences without microdeletion of a microduplicate can be derived by a method including a microdeletion of a terminal window. For example, the final window has absolute coordinates of the start and end positions, the coordinates of the secondary window are found according to the relationship with the coordinates of the secondary window, then how many primary windows are in the secondary window are confirmed, the initial and final primary windows are removed to eliminate the fluctuation of data, then the final primary window is obtained, and the total sequence number is calculated.
Referring to fig. 3, according to one embodiment of the present invention, the ultimate window containing microdeletion of a microduplicate is obtained by:
s111, performing nucleic acid sequencing on the biological sample containing the free nucleic acid so as to obtain a sequencing result consisting of a plurality of sequencing data;
s112, comparing the sequencing result with a reference genome so as to construct a unique comparison sequencing sequence set, wherein each sequencing sequence in the unique comparison sequencing sequence set can only be matched with one position of the reference genome;
s113, determining the length of each unique alignment sequencing sequence in the unique alignment sequencing sequence set;
s114, dividing a reference genome into a plurality of primary windows according to a preset length, wherein the preset length is 1 bp-5M;
s115, counting the number of the unique alignment sequencing sequences falling into each primary window;
s116, performing GC correction on the sequence number falling into the primary window, and performing batch-to-batch adjustment on the corrected result;
s117, combining a preset number of adjacent primary windows into a plurality of secondary windows, and determining the number of sequences in each secondary window;
s118, performing statistical test on each secondary window, calculating a T1 value, and filtering the secondary windows according to the T1 value;
s119, carrying out statistical test on the filtered secondary windows, calculating a T2 value, and merging two adjacent secondary windows without significant difference into an ultimate window according to the T2 value;
s120, repeating the steps S118-S120 until the combination can not be carried out;
and S121, performing hypothesis test on the final window obtained by final combination to obtain the final window containing the microdeletion microreplication.
According to one embodiment of the invention, the biological sample containing free nucleic acids is free fetal nucleic acids in the peripheral blood of a pregnant woman.
According to one embodiment of the invention, the nucleic acid is DNA.
According to one embodiment of the invention, the sequencing result comprises the length and base sequence order of the free nucleic acid. The "length" refers to the length of a nucleic acid, and may be in units of base pairs, i.e., bp.
According to one embodiment of the invention, the sequencing is double-ended sequencing, single-ended sequencing or single molecule sequencing. This facilitates the length of the free nucleic acid, which facilitates the subsequent steps.
As will be appreciated by those skilled in the art, single-ended sequencing requires sequencing through the entire piece of free DNA molecules, or double-ended sequencing, due to the relatively short length of free fetal DNA in a blood sample, which requires the length of all free DNA molecules to be obtained.
According to an embodiment of the present invention, the predetermined length in step S114 is 1bp to 5M, and the predetermined number in step S117 is 5 to 100. Preferably, the predetermined length is 20 to 40 Kb.
According to one embodiment of the present invention, the GC correction method includes using a local weighted regression method, a linear regression method or a logistic regression method.
According to one embodiment of the invention, the batch-to-batch adjustment is to calculate a baseline for each primary window corresponding to all samples in the sequenced batch, and perform weighted correction on the number of uniquely aligned sequencing sequences in each primary window according to the baseline.
According to an embodiment of the invention, the T1 value in the step S118 is calculated according to Z test or T test, and the filtering is to filter out a secondary window of T1 value between-3 and 3.
According to an embodiment of the present invention, the T2 value in step S119 is calculated according to a rank sum test, a sign test or a run test, and the insignificant difference is that the T2 values of two adjacent windows are between-3 and 3.
According to an embodiment of the present invention, the hypothesis test in step S121 includes calculation according to a Z test or a T test, and the test threshold is defined as 3. That is, when the tested statistic is >3 or < -3, the ultimate window containing microdeletion micro-repeats is judged.
Referring to fig. 4, according to an embodiment of the present invention, the male fetal nucleic acid concentration fy in step S2 is obtained by:
s211, sequencing a biological sample containing free nucleic acid to obtain a sequencing result consisting of a plurality of sequencing data;
s212, determining the number of unique alignment sequencing sequences falling into a primary window in the Y chromosome in the sample according to the sequencing result;
s213, counting the sum of the number of unique alignment sequencing sequences in each primary window on the Y chromosome and the total number of the primary windows;
s214, obtaining the average depth dy of the primary window in the Y chromosome, wherein dy is the total number of the unique alignment sequencing sequences on the Y chromosome and/or the number of the primary window on the Y chromosome;
and S215, obtaining male fetal nucleic acid concentration fy, wherein fy is 2 x dy/d2, d2 is the average depth of the primary window without microdeletion micro-repeats, and d2 is the total sequence number of the primary window without microdeletion micro-repeats/the total number of the primary windows without microdeletion micro-repeats.
As will be understood by those skilled in the art, the total number of primary windows and the total number of sequences without microdeletion of a microduplicate can be derived by a method including a microdeletion of a terminal window.
According to an embodiment of the present invention, the step S212 further includes: dividing the reference genome into a plurality of primary windows according to a preset length, and removing the primary windows in which the number of the unique aligned sequences in the Y chromosome is more than 5 times larger than the average sequence number. Preferably, the primary window is adjusted by GC modification.
Referring to fig. 5, according to an embodiment of the present invention, the concentration fs of fetal nucleic acid in the female in step S2 is obtained by:
s221, sequencing a biological sample containing free nucleic acid to obtain a sequencing result consisting of a plurality of sequencing data;
s222, determining the number of unique alignment sequencing sequences with the length falling into a preset range in the biological sample according to the sequencing result;
s223, determining the frequency of the unique alignment sequencing sequences in the preset range based on the number of the unique alignment sequencing sequences with the length falling into the preset range;
s224, determining the concentration fs of the female fetal nucleic acid in the sample according to the frequency of the unique alignment sequencing sequence in the preset range and a preset function.
Referring to fig. 6, according to an embodiment of the present invention, the predetermined range in step S222 is determined by the following steps:
s2221, determining the length of the unique alignment sequencing sequence contained in a plurality of control samples;
s2222, setting a plurality of candidate length ranges, and respectively determining the frequency of unique alignment sequencing sequences appearing in each candidate length range of the plurality of control samples;
s2223, determining a correlation coefficient of each candidate length range and the concentration of the nucleic acid in the control sample based on the frequency of occurrence of the unique aligned sequencing sequences in each candidate length range of the plurality of control samples and the concentration of the nucleic acid in the control sample;
s2224, determining at least one candidate length range or a combination of candidate length ranges as the predetermined range based on the value of the correlation coefficient.
According to one embodiment of the invention, the predetermined range is determined on the basis of a plurality of control samples in which the concentration of nucleic acids in the control samples is known, preferably the predetermined range is determined on the basis of at least 20 control samples.
According to one embodiment of the invention, the control sample is a sample of peripheral blood of a pregnant woman carrying a normal male fetus with a known proportion of free fetal nucleic acid, and the concentration of nucleic acid in the control sample is determined using the Y chromosome.
According to one embodiment of the present invention, the concentration of free fetal nucleic acid in the control sample is determined using the Y chromosome, i.e., by the method of the present invention for determining the concentration of male fetal nucleic acid fy as described above.
According to an embodiment of the present invention, the span of the candidate length range in S2222 is 1-300 bp, preferably 1-20 bp.
According to an embodiment of the invention, the step sizes of the plurality of candidate length ranges are 1-2 bp.
For example, the candidate length ranges are 1-20, 2-21, 3-22 … …, respectively, wherein the span is 20bp, and the step size is 1 bp.
According to an embodiment of the present invention, the predetermined range in the step S222 is 179bp to 206 bp.
Referring to fig. 7, according to an embodiment of the present invention, the function predetermined in step S223 is obtained by:
s2231, determining the frequency of unique aligned sequenced sequences appearing within the predetermined range in the plurality of control samples, respectively;
s2232, fitting the frequency of unique aligned sequenced sequences occurring within the predetermined range in the plurality of control samples to known nucleic acid concentrations to determine the predetermined function.
According to one embodiment of the invention, the fitting is a linear fitting.
According to one embodiment of the invention, the predetermined function is d-0.3215 xp +1.62562, wherein d represents the nucleic acid concentration and p represents the frequency of unique aligned sequenced sequences occurring within the predetermined range.
According to an embodiment of the present invention, the ratio rmY includes a ratio rmY1 calculated from the number of missing copies and a ratio rmY2 calculated from the number of duplicate copies, the ratio rms includes a ratio rms1 calculated from the number of missing copies and a ratio rms2 calculated from the number of duplicate copies, the step S4 further includes: if the copy number is rmY1 ≧ 2 according to the deletion or rmY2 ≧ 6 according to the duplication, the judgment is unreliable, and false positive results are filtered;
or, if the missing copy number rms1 is larger than or equal to 2 or the repeated copy number rms2 is larger than or equal to 6, the judgment is not credible, and false positive results are filtered.
According to an embodiment of the present invention, the step S5 further includes: dmY is positive if dmY <0.13 or dmY > 0.85;
alternatively, dms is positive if dms <0.15 or dms > 0.791.
According to an embodiment of the present invention, the determination rule in step S7 is: if amY is between 0.95-1.05, considering the microdeletion microreplicated fragment is from mother, filtering the microdeletion microreplicated fragment;
or, if the ams is between 0.93 and 1.06, the microdeletion of the microreplicated fragment is considered to be from the mother, and the microdeletion of the microreplicated fragment is filtered.
Referring to fig. 8, in one aspect, the present invention also provides an apparatus 100 for determining microdeletion microreplication in a chromosome of a fetus, comprising:
a microdeletion microreplicated segment concentration calculating means 110 for obtaining a microdeletion microreplicated segment concentration fm;
a fetal nucleic acid concentration obtaining means 120 for obtaining a male fetal nucleic acid concentration fy or a female fetal nucleic acid concentration fs;
ratio calculation means 130 for calculating the ratio rmY-fm/fy of the concentration fm of the microdeletion-containing microreplicated fragment to the concentration fy of fetal nucleic acids in males, or for calculating the ratio rms-fm/fs of the concentration fm of the microdeletion-containing microreplicated fragment to the concentration fs of fetal nucleic acids in females;
a first filtering means 140 for filtering out false positives according to the number of missing copies or the number of duplicate copies;
a second filtering device 150, for taking the decimal part dmY of rmY or the decimal part dms of rms, judging if dmY or dms is positive, otherwise filtering out the result;
a sum value calculating means 160 for calculating the sum of the concentration fm of the microdeletion-containing microreplicated fragments and the concentration fy of fetal nucleic acids in males at amY-fm + fy, or calculating the sum of the concentration fm of the microdeletion-containing microreplicated fragments and the concentration fs of fetal nucleic acids in females at ams-fm + fs;
and the third filtering device 170 is used for filtering the microdeletion micro-repeat segment according to the determination principle to obtain the fetal chromosome microdeletion micro-repeat segment after filtering.
Referring to fig. 9, the microdeletion micro-repeat concentration calculating means 110 further comprises:
a primary window obtaining unit 111 configured to obtain a primary window without microdeletion micro-repeats from the primary window with microdeletion micro-repeats, and calculate a total sequence number of the primary window with microdeletion micro-repeats and a total number of the primary windows with microdeletion micro-repeats, and a total sequence number of the primary window without microdeletion micro-repeats and a total number of the primary windows without microdeletion micro-repeats;
a first average depth obtaining unit 112, configured to obtain an average depth d1 of the primary window containing the microdeletion micro-repeats, where d1 is the total sequence number of the primary window containing the microdeletion micro-repeats/the total number of the primary windows containing the microdeletion micro-repeats;
a second average depth obtaining unit 113 for obtaining an average depth d2 of the primary window without microdeletion micro-repeats, d2 being the total number of sequences of the primary window without microdeletion micro-repeats/the total number of primary windows without microdeletion micro-repeats;
a microdeletion micro-repeat fragment concentration obtaining means 114 for calculating a concentration fm of microdeletion micro-repeat fragments, fm 2 x | d2-d1 | d 2.
According to an embodiment of the present invention, the microdeletion micro-repeat concentration calculating device 110 further includes a final window obtaining unit 115 for obtaining a final window of microdeletion micro-repeats, where the final window obtaining unit 115 includes:
a first sequencing element 1151 for performing nucleic acid sequencing on a biological sample containing free nucleic acids so as to obtain a sequencing result consisting of a plurality of sequencing data;
an alignment element 1152 for aligning the sequencing result to a reference genome to construct a set of unique aligned sequencing sequences, each sequencing sequence in the set of unique aligned sequencing sequences being capable of matching only one position of the reference genome;
a length determining element 1153 for determining the length of each unique aligned sequencing sequence in the set of unique aligned sequencing sequences;
a primary window determining element 1154 for dividing the reference genome into a plurality of primary windows according to a predetermined length, wherein the predetermined length is 1bp to 5M;
a first statistics element 1155 for counting the number of each unique aligned sequencing sequence falling into each primary window;
a correction element 1156 for performing GC correction on the number of sequences falling in the primary window and performing batch-to-batch adjustment on the corrected result;
a first combining element 1157 for combining a predetermined number of adjacent primary windows into a plurality of secondary windows, determining the number of sequences in each secondary window;
a first filter element 1158 for performing a statistical test on each secondary window, calculating a T1 value, and filtering the secondary window according to the T1 value;
the second merging element 1159 is configured to perform a statistical test on the filtered secondary windows, calculate a T2 value, and merge two adjacent secondary windows without significant difference into an ultimate window according to the T2 value;
a repeat element 1160 for repeatedly activating the first filter element 1158 and the second merge element 1159 until merging fails;
a microdeletion microreplication final window determining element 1161 for performing a hypothesis test on the final windows obtained by the final merging to obtain a final window containing microdeletion microreplication.
According to an embodiment of the present invention, the predetermined number of the first merge elements 1157 is 5 to 100. Preferably, the predetermined length is 20 to 40 Kb.
According to one embodiment of the invention, the method of GC correction in the correction element 1156 includes using a local weighted regression method, a linear regression method or a logistic regression method.
According to one embodiment of the invention, the batch-to-batch adjustment in the correction element 1156 is to calculate a baseline for each primary window corresponding to all samples in the sequenced batch, and to perform a weighted correction of the number of unique aligned sequencing sequences in each primary window based on the baseline.
According to one embodiment of the invention, the T1 values in the first filter element 1158 include values calculated according to a Z test or a T test, the filtering being to filter out a secondary window of T1 values between-3 and 3.
According to an embodiment of the present invention, the value of T2 in the second merge element 1159 is calculated according to rank sum test, sign test or run test, and the insignificant difference is that the values of T2 of two adjacent windows are between-3 and 3.
According to one embodiment of the present invention, the hypothesis test in the microdeletion microreplication ultimate window determining element 1161 includes a calculation based on a Z test or a T test, and the test threshold is defined as 3. That is, when the tested statistic is >3 or < -3, the ultimate window containing microdeletion micro-repeats is judged.
According to an embodiment of the present invention, the fetal nucleic acid concentration obtaining apparatus 120 further includes a male fetal nucleic acid concentration fy obtaining unit 121, see fig. 11, the male fetal nucleic acid concentration fy obtaining unit 121 including:
a second sequencing element 1211, for sequencing the biological sample containing the free nucleic acid to obtain a sequencing result composed of a plurality of sequencing data;
a first number determining element 1212 for determining, from the sequencing result, a number of uniquely aligned sequencing sequences in the Y chromosome in the biological sample that fall within a primary window;
a second statistical element 1213 for counting the sum of the number of unique aligned sequenced sequences in each primary window on the Y chromosome and the total number of said primary windows;
an average depth obtaining element 1214 for obtaining an average depth dy of the primary windows in the Y chromosome, dy being the total number of unique aligned sequencing sequences on the Y chromosome and/or the total number of primary windows on the Y chromosome;
a male fetal nucleic acid concentration obtaining element 1215 for obtaining a male fetal nucleic acid concentration fy, fy being 2 × dy/d2, d2 being the average depth of the primary window without microdeletion of the microduplicate, d2 being the total number of sequences of the primary window without microdeletion of the primary window/total number of primary windows without microdeletion of the microduplicate.
According to an embodiment of the present invention, the first number determining element 1212 further comprises a filtering module 12121, the filtering module 12121 being configured to divide the reference genome into a plurality of primary windows according to a predetermined length, and remove the primary windows in the Y chromosome where the number of uniquely aligned sequences is more than 5 times greater than the average number of sequences.
According to an embodiment of the present invention, the fetal nucleic acid concentration obtaining apparatus 120 further includes a female fetal nucleic acid concentration fs obtaining unit 122, see fig. 12, the female fetal nucleic acid concentration fs obtaining unit 122 including:
a third sequencing element 1221, configured to sequence a biological sample containing free nucleic acids, and obtain a sequencing result composed of a plurality of sequencing data;
a second number determining element 1222 for determining, from the sequencing result, the number of unique aligned sequencing sequences in the biological sample having a length falling within a predetermined range;
a frequency determining element 1223 configured to determine a frequency of occurrence of unique aligned sequenced sequences within the predetermined range based on the number of unique aligned sequenced sequences whose length falls within the predetermined range;
a female fetal nucleic acid concentration determining element 1224 for determining a female fetal nucleic acid concentration fs in the sample according to a predetermined function based on the frequency of uniquely aligned sequenced sequences present within the predetermined range.
According to an embodiment of the present invention, the female fetal nucleic acid concentration fs obtaining unit 122 further includes a predetermined range determining unit 1225, referring to fig. 13, and according to an embodiment of the present invention, the predetermined range determining unit 1225 further includes:
a length determination module 12251 for determining the length of the uniquely aligned sequencing sequences contained in the plurality of control samples;
a first frequency determination module 12252, configured to set a plurality of candidate length ranges, and determine frequencies of unique aligned sequencing sequences of the plurality of control samples appearing in the respective candidate length ranges, respectively;
a correlation coefficient determining module 12253 for determining a correlation coefficient for each of the candidate length ranges with the concentration of nucleic acid in the control sample based on the frequency of occurrence of uniquely aligned sequenced sequences in each of the candidate length ranges for the plurality of control samples and the concentration of nucleic acid in the control samples;
a predetermined range determination module 12254 for determining at least one candidate length range or a candidate length range combination as the predetermined range based on the value of the correlation coefficient.
According to one embodiment of the invention, the predetermined range is determined on the basis of a plurality of control samples in which the concentration of nucleic acids is known, preferably the predetermined range is determined on the basis of at least 20 control samples.
According to one embodiment of the invention, the control sample is a peripheral blood sample of a pregnant woman carrying a normal male fetus with a known proportion of free fetal nucleic acid, and the concentration of free fetal nucleic acid in the control sample is determined using the Y chromosome. I.e., determined by the method of the invention for determining the concentration of male fetal nucleic acid fy as described above.
According to an embodiment of the present invention, the candidate length range spans 1-300 bp, preferably 1-20 bp.
According to an embodiment of the invention, the step sizes of the plurality of candidate length ranges are 1-2 bp.
For example, the candidate length ranges are 1-20, 2-21, 3-22 … …, respectively, wherein the span is 20bp, and the step size is 1 bp.
According to one embodiment of the present invention, the predetermined range is 179bp to 206 bp.
According to an embodiment of the present invention, the female fetal nucleic acid concentration fs obtaining unit 122 further includes a predetermined function determining unit 1226, referring to fig. 14, the predetermined function determining unit 1226 including:
a second frequency determination module 12261 configured to determine a frequency of occurrence of uniquely aligned sequenced sequences within the predetermined range in the plurality of control samples, respectively;
a fitting module 12262 for fitting the frequency of occurrence of uniquely aligned sequencing sequences within the predetermined range in the plurality of control samples to known nucleic acid concentrations to determine the predetermined function.
According to one embodiment of the invention, the fitting is a linear fitting.
According to one embodiment of the invention, the predetermined function is d-0.3215 xp +1.62562, wherein d represents the free fetal nucleic acid concentration and p represents the frequency of unique aligned sequenced sequences occurring within the predetermined range.
According to an embodiment of the present invention, the first filtering device 140 further includes a false positive determination unit 141, configured to determine that the first filtering device is not reliable and filter out false positive results if the first filtering device determines that the first filtering device is not reliable according to the missing copy number rmY1 ≧ 2 or the duplicate copy number rmY2 ≧ 6;
or, if the missing copy number rms1 is larger than or equal to 2 or the repeated copy number rms2 is larger than or equal to 6, the judgment is not credible, and false positive results are filtered.
According to an embodiment of the present invention, the second filtering device 150 further comprises a positive determination unit 151 for determining that dmY is positive if dmY <0.13 or dmY > 0.85; alternatively, dms is positive if dms <0.15 or dms > 0.791.
According to an embodiment of the present invention, the determination principle in the third filtering device 170 is: if amY is between 0.95-1.05, considering the microdeletion microreplicated fragment is from mother, filtering the microdeletion microreplicated fragment;
or, if the ams is between 0.93 and 1.06, the microdeletion of the microreplicated fragment is considered to be from the mother, and the microdeletion of the microreplicated fragment is filtered.
Example 1
Firstly, obtaining the concentration fm of the micro-deletion micro-repetitive fragment;
1. performing nucleic acid sequencing on a biological sample containing free nucleic acids to obtain a sequencing result consisting of a plurality of sequencing data;
2. aligning the sequencing results with a reference genome to construct a set of unique aligned sequencing sequences, each sequencing sequence in the set of unique aligned sequencing sequences being capable of matching only one position of the reference genome;
3. determining the length of each unique alignment sequencing sequence in the unique alignment sequencing sequence set;
4. dividing a reference genome into a plurality of primary windows according to a preset length, wherein the preset length is 1 bp-5M, and preferably 20 kp-40 kp is a preset length, such as (1-20 bp, 20-40 bp, 40-80 bp, 80-100 bp, 100-120 bp, … …');
5. counting the number of the unique alignment sequencing sequences of which the lengths fall into each primary window;
6. performing GC correction on the sequence number falling into the primary window, and performing batch-to-batch adjustment on the corrected result, wherein the GC correction method comprises a local weighted regression method, a linear regression method or a logistic regression method;
7. combining a preset number of adjacent primary windows into a plurality of secondary windows, and determining the number of sequences in each secondary window, wherein the preset number is 5-100; for example, 5 primary windows are combined into 1 secondary window, the 5 primary windows are respectively 1-20 bp, 20-40 bp, 40-80 bp, 80-100 bp and 100-120 bp, and the combined secondary window is 1-120 bp.
8. Performing statistical test on each secondary window to calculate a T1 value, wherein the T1 value is obtained by Z test or T test calculation;
9. filtering the secondary window according to the T1 value, namely filtering the secondary window with the T1 value between-3 and 3;
10. performing a statistical test on the filtered secondary window to calculate a T2 value, wherein the T2 value includes but is not limited to a value calculated according to a rank sum test, a sign test or a run test;
11. merging two adjacent secondary windows without significant difference into an ultimate window according to a T2 value, wherein the T2 of the two windows is between-3 and 3;
12. repeating the steps of 8 to 10 until the combination can not be carried out;
13. and performing hypothesis test on the final window obtained by final combination to obtain the final window containing the microdeletion micro-repeats, wherein the hypothesis test comprises calculation according to Z test or T test, namely when the statistic of the test is more than 3 or < -3 >, the final window containing the microdeletion micro-repeats is judged.
14. Obtaining a primary window without microdeletion micro-repeats according to the primary window with microdeletion micro-repeats, and calculating the total sequence number of the primary window with microdeletion micro-repeats and the total number of the primary windows with microdeletion micro-repeats, and the total sequence number of the primary window without microdeletion micro-repeats and the total number of the primary windows without microdeletion micro-repeats;
15. calculating the average depth d1 of the final window containing microdeletion micro-repeats, d1 being the total number of sequences in the primary window containing microdeletion micro-repeats/the total number of final windows containing microdeletion micro-repeats;
16. calculating the average depth d2 of the primary window without microdeletion of micro-repeats, d2 being the total number of sequences of the primary window without microdeletion of micro-repeats/the total number of primary windows without microdeletion of micro-repeats;
17. the concentration fm of the micro-deletion micro-duplication fragment is calculated, wherein fm is 2 x | d2-d1 | d 2.
Secondly, obtaining the concentration fy of the nucleic acid of the male fetus or the concentration fs of the nucleic acid of the female fetus;
1. determining whether the sample to be detected contains Y chromosome, if so, calculating the concentration fy of nucleic acid of the male fetus, and if not, calculating the concentration fs of nucleic acid of the female fetus;
2. if Y chromosome is contained, the concentration of male fetal nucleic acid fy is calculated.
(1) Determining the number of unique aligned sequencing sequences in the Y chromosome in the sample falling into a primary window according to the sequencing result;
(2) removing the primary window in which the number of unique alignment sequences after GC modification and adjustment is more than 5 times of the average sequence number in the primary window;
(3) counting the sum of the number of unique aligned sequencing sequences in each primary window on the Y chromosome and the total number of the primary windows;
(4) obtaining the average depth dy of the primary window in the Y chromosome, wherein dy is the total number of the unique alignment sequencing sequences on the Y chromosome and/or the number of the primary window on the Y chromosome;
(5) obtaining male fetal nucleic acid concentration fy, fy 2 × dy/d2, d2 being the average depth of the primary window without microdeletion of a microduplicate, d2 being the total number of sequences in the primary window without microdeletion of a microduplicate/the number of primary windows without microdeletion of a microduplicate.
3. If the Y chromosome is not contained, the concentration fs of the fetal nucleic acid in the female is calculated.
(1) Determining the number of unique aligned sequencing sequences in the biological sample containing free nucleic acids whose length falls within a predetermined range; the predetermined range is 179bp to 206 bp.
The predetermined range is obtained by:
a. at least 20 control samples, i.e., samples containing known concentrations of free fetal nucleic acid, are selected, and this example uses a male fetal control sample in which the concentration of free fetal nucleic acid is determined from the Y chromosome, i.e., by the method described above for male fetal nucleic acid concentration fy.
b. Counting the length of the unique aligned sequencing sequence contained in all the control samples from 0bp to Mbp (M represents the longest length of nucleic acid), and determining the sequence number of the unique aligned sequencing sequence appearing at each length;
c. taking a certain length as a candidate length range, moving and dividing a plurality of candidate length ranges according to the step length of 1-2 bp, such as 1bp, 2bp, 3bp, …,100 bp, … and 300bp, and counting the frequency of unique alignment sequencing sequences appearing in each candidate length range of the control sample;
d. finding out the frequency of the unique aligned and sequenced sequences in each candidate length range of the plurality of control samples and the candidate length range or the combination of the candidate length ranges with stronger concentration correlation of the nucleic acid in the control samples, and determining the correlation coefficient of each candidate length range and the concentration of the nucleic acid; the correlation coefficient is obtained through correlation calculation, and the correlation coefficient comprises linear regression, logistic regression, local weighting and other methods.
Wherein the span of the candidate length range is 1-300 bp, and preferably 1-20 bp. The step length of the candidate length ranges is 1-2 bp.
e. Determining at least one candidate length range or a candidate length range combination as the predetermined range based on the value of the correlation coefficient.
(2) Counting the frequency of the unique alignment sequencing sequences in the preset range based on the number of the unique alignment sequencing sequences with the length falling in the preset range;
(3) determining a female fetal nucleic acid concentration fs in the sample according to a predetermined function based on the frequency of uniquely aligned sequenced sequences within the predetermined range.
The predetermined function is obtained by:
a. determining the frequency of the unique aligned sequencing sequences in the predetermined range in the plurality of control samples respectively, wherein the predetermined range in the control samples and the frequency of the unique aligned sequencing sequences are obtained by the predetermined range determination method;
b. linearly fitting the frequency of unique aligned sequencing sequence inserts occurring within the predetermined range to known nucleic acid concentrations in the plurality of control samples to determine the predetermined function.
Preferably, the predetermined function is d-0.3215 xp +1.62562, wherein d represents the free fetal nucleic acid concentration and p represents the frequency of unique aligned sequenced sequences occurring within the predetermined range.
Thirdly, calculating the ratio rmY (fm/fy) of the concentration fm of the micro-deletion micro-repeat fragment to the concentration fy of the male fetal nucleic acid, or calculating the ratio rms (fm/fs) of the concentration fm of the micro-deletion micro-repeat fragment to the concentration fs of the female fetal nucleic acid; wherein the ratio rmY includes two aspects: the ratio rmY1 calculated from the missing copy number and rmY2 calculated from the duplicate copy number, the ratio rms includes two aspects: the ratio rms1 obtained from the calculation of the number of copies missing and the ratio rms2 obtained from the calculation of the number of copies replicated.
Judging that the copy number is not credible according to rmY1 ≧ 2 of missing or rmY2 ≧ 6 of repeated copy number, and filtering false positive results;
or if the number of missing copies rms1 is larger than or equal to 2 or the number of duplicate copies rms2 is larger than or equal to 6, judging that the result is not credible, and filtering false positive results;
false positives are filtered to remove the effect of multiple copies, making the results more accurate.
And fifthly, taking a decimal part dmY of rmY or a decimal part dms of rms, judging whether dmY or dms is positive, and otherwise, filtering out a result:
dmY is positive if dmY <0.13 or dmY > 0.85;
alternatively, dms is positive if dms <0.15 or dms > 0.791;
sixthly, calculating the sum of the concentration fm of the microdeletion-containing micro-repetitive fragment and the concentration fy of the fetal nucleic acid in males as amY ═ fm + fy, or calculating the sum of the concentration fm of the microdeletion-containing micro-repetitive fragment and the concentration fs of the fetal nucleic acid in females as ams ═ fm + fs;
seventhly, if amY is between 0.95 and 1.05, the microdeletion microreplicated fragment is considered to be from the mother, and the microdeletion microreplicated fragment is filtered;
or if the ams is between 0.93 and 1.06, the microdeletion micro-repeat fragment is from the mother, filtering the microdeletion micro-repeat fragment, and filtering to obtain the fetal chromosome microdeletion micro-repeat fragment.
Example 2
1. Sample collection and processing
100 samples were selected from 1 lot and 2ml of peripheral blood was extracted for plasma separation.
2. Library construction
Library construction may be performed with reference to plasma library construction requirements well known to those skilled in the art
3. Sequencing
The sequencing process may be performed on-board with reference to sequencing procedures well known to those skilled in the art.
4. Data analysis
The sequencing results were obtained by double-ended sequencing, and the initial microdeletion microreplication results were obtained by the following analysis, with the following steps:
4.1 alignment, aligning the sequencing result to a reference genome, and determining the position of the unique alignment sequencing sequence.
4.2 dividing the reference genome into a plurality of primary windows according to the length of 20kb, counting the number of unique aligned sequencing sequences and GC content in each primary window, and carrying out GC correction by using local weighted regression on the number of sequences falling into the primary window.
4.3 Baseline correction, batch-to-batch adjustment, was performed for all samples within a batch for each primary window.
4.4 merging the adjacent primary windows by taking 100 as a unit, and obtaining a plurality of secondary windows after merging, wherein the length of each secondary window is 2M;
4.5 calculating the T1 value of each secondary window by using Z test, and filtering out the secondary windows with the T1 values between-3 and 3;
4.6, performing run-length inspection on the filtered secondary windows to calculate a T2 value, and merging the secondary windows with two adjacent T2 values between-3 and 3 into a final window according to a T2 value;
4.7 repeat step 4.5-4.6 until no combination can be achieved;
4.8 calculating and finally merging the obtained final window according to Z test, calculating to obtain a micro-deletion micro-repetition result, and detecting the result that 19 samples have micro-deletion micro-repetition.
TABLE 119 results of sample detection
Table 1 shows the results of the detection of 19 samples, wherein the first column is the id of the sample, the second column is the chromosome on which the microdeletion microreplication occurred, the third column is the microdeletion microreplication length of the chromosome, and the fourth column is the T value detected.
4.9 calculating the concentration of the microdeletion micro-repeat fragment according to the microdeletion micro-repeat result, the specific steps are as follows:
calculating the average depth d1 of the primary window containing the microdeletion microreplication in each sample;
calculating the average depth d2 of the primary window without microdeletion of the micro-repeats;
calculating the concentration fm of the microdeletion microduplicate fragment;
calculating the fetal nucleic acid concentration. The table for the above 19 results is as follows:
TABLE 2.19 fetal nucleic acid concentration information for samples
4.10 calculating the fetal concentration according to the ratio of chrY to obtain the male fetal concentration of 8 samples, the specific steps are as follows:
removing a primary window that the number of unique alignment sequences after GC modification and adjustment in the chromosome chrY is more than 5 times of the average sequence number;
calculating the average depth dy of the primary window in chrY;
the male fetal nucleic acid concentration fy was calculated and the results are given in the following table:
TABLE 3.19 results of fetal nucleic acid concentration from chrY estimated for samples
4.11 calculating the concentration of the fetus according to the length of the fragment to obtain the concentration of the female fetus of 11 samples, the specific steps are as follows:
counting 41 male samples in the whole batch, and finding out a region with strong correlation between frequency and fetal concentration, wherein the selected region is 179bp-206bp, and the correlation coefficient R is-0.9056996.
Determining the functional relationship between the occurrence frequency of the unique alignment sequencing sequence of nucleic acids with the length range of 179bp-206bp in the remaining 11 samples and the concentration of free fetal nucleic acids, performing linear fitting by using the selected 179bp-206bp region to obtain a relationship d as a x p + b, wherein d represents the concentration, p represents the occurrence frequency, and calculating a and b as-0.3215 and 1.62562 respectively.
The results of calculating the female fetal sample from the fit are given as follows:
TABLE 4.19 fetal nucleic acid concentrations calculated from fragment lengths for samples
4.12 screening the microdeletion microduplication results.
For a male fetus:
calculating the ratio rmY of the concentration fm of the microdeletion microduplicate fragment to the male fetal nucleic acid concentration fy as fm/fy;
filtering according to copy number, filtering the missing fragments with the copy number rmY1 value of more than 2, and filtering the repeated fragments with the copy number rmY1 value of more than 6.
The decimal fraction of rmY gives dmY.
For the remaining fragments, fragments greater than 0.13 and less than 0.85 are filtered dmY.
Calculating the sum of the concentration fm of microdeletion containing microreplicated fragments and the concentration fy of male fetal nucleic acids as amY ═ fm + fy;
fragments of amY >0.95 and amY <1.05 were filtered to obtain a sample containing microdeletion of microreplication in a male fetus.
For a female fetus:
calculating the ratio rms of the concentration fm of the micro-deletion micro-repetitive fragment to the concentration fs of the female fetal nucleic acid to be fm/fs;
filtering according to copy number, filtering the missing segments with the copy number rms1 value of more than 2, and filtering the repeated segments with the copy number rms2 value of more than 6.
The rms is decimal to give dms.
For the remaining fragments, fragments with a dms of greater than 0.15 and less than 0.791 were filtered.
Calculating the sum of the concentration fm of the microdeletion-containing microreplicated fragment and the concentration fs of female fetal nucleic acid as ams-fm + fs;
fragments of ams >0.93 and ams <1.06 were filtered to give samples of female fetuses containing microdeletion microreplication.
The results obtained are positive as given in table 5 below:
TABLE 5 filtered microdeletion microreplication results
Sample(s)
|
8
|
Chromosome
|
|
Starting position
|
19394465
|
End position
|
27194537
|
Microdeletion of length of microreplication
|
7.80M
|
Value of T
|
5.030
|
Concentration of microdeletion microreplication calculation
|
0.119863
|
Fetal concentration calculated from chrY
|
|
Calculating the fetal concentration according to the fragment length
|
0.126509 |
The results of microdeletion and microduplication can be filtered to obtain a large number of false positives and accurate results by the above processing, see fig. 15. In the figure, the abscissa represents the number of the sample, the ordinate represents the concentration, where fm represents the concentration estimated from microdeletion and microreplication, fy represents the concentration estimated from chrY for a male fetus sample, and fs represents the concentration estimated from fragments for a female fetus, and it can be seen that the sample with number 28 is the fetus sample finally containing the microreplication result, which is judged by the above criteria.
EXAMPLE 3
This example of the method for determining microdeletion of microdroplets in fetal chromosomes is the same as example 2, except that the division in step 4.2 is performed in a 40kb window.
EXAMPLE 4
This example of the method for determining microdeletion of microdroplets in fetal chromosomes is the same as example 2, except that in step 4.4, the merging is performed in units of 200, and the length of the secondary window obtained after merging is 4M.
EXAMPLE 5
The method for determining microdeletion and microduplication in fetal chromosomes in this example is the same as example 2, except that 40 male samples are used in step 4.11, the selected region is 185-204 bp, and the correlation coefficient R is-0.87.
And performing linear fitting by using the selected regions 185-204 bp to obtain a relation d as a × p + b, wherein d represents concentration and p represents occurrence frequency, and calculating to obtain a and b as 0.0334 and 1.6657 respectively.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention shall fall within the protection scope defined by the claims of the present invention.