WO2019213810A1 - 检测染色体非整倍性的方法、装置及系统 - Google Patents

检测染色体非整倍性的方法、装置及系统 Download PDF

Info

Publication number
WO2019213810A1
WO2019213810A1 PCT/CN2018/085865 CN2018085865W WO2019213810A1 WO 2019213810 A1 WO2019213810 A1 WO 2019213810A1 CN 2018085865 W CN2018085865 W CN 2018085865W WO 2019213810 A1 WO2019213810 A1 WO 2019213810A1
Authority
WO
WIPO (PCT)
Prior art keywords
window
chromosome
read
sequencing
sequence
Prior art date
Application number
PCT/CN2018/085865
Other languages
English (en)
French (fr)
Inventor
曾立董
吴增丁
金欢
徐伟彬
李林森
赵陆洋
张萌
颜钦
Original Assignee
深圳市真迈生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市真迈生物科技有限公司 filed Critical 深圳市真迈生物科技有限公司
Priority to PCT/CN2018/085865 priority Critical patent/WO2019213810A1/zh
Priority to US17/053,054 priority patent/US20210130888A1/en
Priority to EP18917705.8A priority patent/EP3795692A4/en
Publication of WO2019213810A1 publication Critical patent/WO2019213810A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/166Oligonucleotides used as internal standards, controls or normalisation probes

Definitions

  • the present invention relates to the field of bioinformatics, and in particular to a method, device and system for detecting chromosome aneuploidy.
  • Down syndrome 21-trisomy
  • Edwards syndrome 13-trisomy
  • Padua syndrome 18-trisomy
  • These chromosomal aneuploidy result in high morbidity and mortality.
  • Amniocentesis and chorionic villus sampling are standard methods for diagnosing fetal chromosomal abnormalities, but these diagnostic methods alone can result in abortion rates of up to 0.6% to 1.9%. In order to avoid these risks, it is necessary to develop a safer method for detecting non-invasive fetal aneuploidy abnormalities (NIPT) that are more advanced in gestational age.
  • NIPT non-invasive fetal aneuploidy abnormalities
  • NGS next-generation sequencing
  • the detection of chromosomal aneuploidy is based on the data of each platform.
  • the sensitivity and/or accuracy of the detection has to be further improved.
  • the multi-factor relationship detection sensitivity and/or accuracy such as the data generated by different sequencing platforms.
  • the length difference is large, the offline data is also called read, and the length of the read is also called read length, ranging from tens of bp to several thousand bp.
  • the read length affects at least the data matching (positioning) of the lower machine.
  • the confidence level is high; for example, the sequencing error rate also affects the confidence of the segment positioning. Generally, the higher the error rate, the lower the confidence.
  • the embodiments of the present invention aim to at least solve one of the technical problems existing in the related art or at least provide an alternative practical solution.
  • a method for detecting chromosomal aneuploidy variation comprising: (1) sequencing at least a portion of nucleic acids in a sample to be tested, obtaining a sequencing result including a read; (2) reading The segment is aligned to the first reference sequence, and the comparison result is obtained.
  • the comparison result includes information that the read segment is located on a specific chromosome, and the first reference sequence is a set of regions on the reference genome with a comparison ability of 1, and the comparison ability is The region of 1 is the region designated to the unique position on the reference genome; (3) for the first chromosome, based on the comparison result, the amount of the segment positioned to the first chromosome is determined; (4) the comparison is positioned to the first The amount of reads of the chromosome and the amount of reads in the negative sample that are located to the corresponding first chromosome to determine the number of the first chromosome.
  • the method comprises screening and locating a read segment by using a specific reference sequence, which can quickly and simply implement chromosomal aneuploidy detection and obtain accurate detection results. It is suitable for the detection and analysis of off-machine data based on various sequencing platforms. It is especially suitable for the detection and analysis of reads containing unrecognized bases, that is, the processing analysis of reads containing gaps.
  • a device for detecting chromosomal aneuploidy variation the device for performing the method for detecting chromosomal aneuploidy in the above embodiment of the present invention, the device comprising: a sequencing module: The sequencing module is configured to sequence at least a portion of the nucleic acid in the sample to be tested, and obtain a sequencing result including the read segment; the comparison module: the comparison module is configured to compare the read segments from the sequencing module to the first reference sequence, and obtain For the alignment result, the alignment result includes information that the read position is located on the chromosome, the first reference sequence is a set of regions on the reference genome with a ratio of 1 for comparison, and the region with a ratio of 1 is specified to the reference genome.
  • a quantification module for the first chromosome, the quantification module is configured to determine an amount of a segment positioned to the first chromosome based on a comparison result from the comparison module; a determination module: the judging module is for comparing the Locating the amount of reads of the first chromosome to the amount of reads in the negative sample that are located in the corresponding first chromosome, The first number of a given chromosome.
  • a computer readable medium for storing/hosting a computer executable program, and when the program is executed, the detection of the chromosome in the embodiment of the present invention can be completed by instructing related hardware. All or part of the method of euploidy.
  • the so-called medium includes but is not limited to: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • a terminal a system for detecting chromosomal aneuploidy variation, the system comprising a computer executable program, the system comprising a processor, the processor being operative to execute the computer Executing a program, executing a computer executable program includes performing the method of detecting chromosome aneuploidy in the above-described embodiments of the present invention.
  • the method, device and/or system for detecting chromosomal aneuploidy provided by any of the above embodiments can be used for chromosomal aneuploidy variation detection, and the obtained detection result has high sensitivity and accuracy. It is suitable for detection and analysis of off-machine data based on various sequencing platforms, and is particularly suitable for detection analysis of reads containing unrecognized bases, that is, processing analysis of reads containing gaps.
  • 1 is a schematic diagram of the distances of two adjacent entries of a reference library in an alignment mode utilized by an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of the communication length of the comparison mode utilized by the embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing the relationship between coefficient of variation and window size in a specific embodiment of the present invention.
  • Figure 4 is a graphical representation of the relationship between the depth of sequencing of chromosome normalization and the GC content of chromosomes in a particular embodiment of the invention.
  • first and second are used for descriptive purposes only, and are not to be construed as indicating or implying a relative importance or implicitly indicating the number or order of the indicated technical features.
  • meaning of "a plurality” is two or more unless specifically and specifically defined otherwise.
  • Sequencing refers to nucleic acid sequence determination, including DNA sequencing and/or RNA sequencing, including long fragment sequencing and/or short fragment sequencing.
  • Sequencing can be performed through a sequencing platform that can be selected but not limited to Illumina's Hisq/Miseq/Nextseq sequencing platform, Thermo Fisher/Life Technologies' Ion Torrent platform, BGI SEQ platform and single-molecule sequencing platform; sequencing method Single-ended sequencing can also be selected, or double-end sequencing can be selected; the obtained sequencing result/data is the read-out fragment, which is called a read. The length of the read is called the read length.
  • Embodiments of the present invention provide a method for detecting chromosomal aneuploidy, wherein the chromosomal aneuploidy includes an abnormality in a quantity of a part of a chromosome or a chromosome, the method comprising: (1) at least a part of nucleic acid in a sample to be tested Sequencing to obtain sequencing results including reads; (2) aligning the reads to the first reference sequence to obtain alignment results, the alignment results including information on the specific chromosomes of the read, the first reference sequence being the reference genome a set of regions having an alignment capacity of 1, a region having a ratio of 1 is a region designated to a unique position on the reference genome; (3) for the first chromosome, based on the comparison result, determining to locate the first region The amount of reads of the chromosome; (4) comparing the amount of reads that are mapped to the first chromosome to the amount of reads in the negative sample that are located to the corresponding first chromosome to
  • the method comprises screening and locating a read segment by using a specific reference sequence, which can quickly and simply implement chromosomal aneuploidy detection and obtain accurate detection results. It is suitable for detection and analysis of off-machine data based on various sequencing platforms, and is particularly suitable for detection analysis of reads containing unrecognized bases, that is, processing analysis of reads containing gaps.
  • Sequencing can be performed on the entire genome (genome) or on several chromosomes or parts of a chromosome. Generally, this is mainly related to the characteristics of the target chromosome or region including the association of the target chromosome or region with other chromosomes or regions.
  • alignment is meant sequence alignment, including the process of locating one or more sequences to another sequence or sequences and the resulting positioning results.
  • process of including a read position on a reference sequence also includes the process of obtaining a read position/match result.
  • the reference sequence (reference, ref) is the same as the reference chromosomal sequence, and the sequence is determined.
  • the DNA and/or RNA sequence can be pre-determined by itself, or the published DNA and/or RNA sequence can be determined by others. Any predetermined reference template in the biological class to which the sample source individual/target individual belongs, such as all or at least a portion of the published genome assembly sequence of the same biological category. If the sample source individual or the target individual is a human, its genomic reference sequence (also known as a reference genome or reference genome) may select a human reference genome provided by the UCSC, NCBI or ENSEMBL database, such as HG19, HG38, GRCh36, GRCh37, GRCh38, etc.
  • a resource library including more reference sequences may be pre-configured, for example, before the comparison, the components of the target individual are selected or determined according to factors such as gender, race, and region of the target individual to assemble a closer or more characteristic feature.
  • the sequence is used as a reference sequence to facilitate subsequent acquisition of more accurate sequence analysis results.
  • the reference sequence is referred to as the chromosome number and positional information of each site on the chromosome.
  • the so-called first reference sequence is at least a part of the reference genome, which is the characteristics of the sequencing platform used by the inventor based on the characteristics of the discovered offline data set, and the data characteristics of the offline data include read length/error rate/data quality, etc.
  • the factor and the version constructed for the purpose of detecting chromosomal aneuploidy variation, using the first reference sequence for the positioning of the read segment facilitates quick acquisition of the positioning result and reduces the amount of data required for subsequent steps.
  • the ability to compare the claimed regions is determined by sliding the reference genome with a first window of size L1 to obtain a plurality of regions; comparing the regions to a reference genome, based on the regions The ability to compare the regions is calculated by comparing the number of positions to the reference genome.
  • the so-called region or window corresponds to a sequence on the reference genome.
  • the size of the first window and/or the step size of the sliding window can be set in conjunction with the purpose of the detection, the principle of variation detection employed, the length of the read, and the sequence characteristics of the reference genome.
  • the step size of the sliding window is not greater than the size of the first window, so that as many regions as possible with a comparison capability of 1 on the reference genome can be retained, which is advantageous for improving the utilization of the data.
  • L1 can be set according to the read length, for example, set to 0.5-2 times the read length or any integer of the average read length, and the sliding window step can be set to be less than 0.5 times read length, less than 0.2 times read length or less than 0.1 times the length of any integer.
  • the reference genome selected for use is HG19 of the UCSC database
  • the read length is 25 bp
  • the L1 is set to 25 bp
  • the step size of the sliding window is set to be less than 10 bp, less than 5 bp, or less than 2 bp; for example, the step size of the sliding window is set to 1 bp, which is equivalent to an overlap of (L1-1) bp between two adjacent first windows, so that it is advantageous to obtain all the regions on the reference genome that meet the specific requirements, which is beneficial to fully utilize the sequencing results to obtain more Comprehensive comparison results will help improve data utilization.
  • the comparison ability of the regions is calculated, and the reciprocal of the number of regions to the reference genome is compared as the alignment capability of the region, for example, a region is aligned to a unique position of the reference genome, The region has a ratio of 1 and a region can match 5 positions to the reference genome, and the region has a 1/5 alignment capability.
  • the first reference sequence may be constructed when the target sample is detected, or may be called when the prepared sample is saved in advance.
  • the first reference sequence is at least a portion of a human reference genome from which the region shown in Table 1 is removed. Since it takes a lot of space to display the entire sequence of removed regions, Table 1 indicates these regions with the positional information of these regions to be removed/shielded in the human reference genome HG19. It is understood that these regions may correspond differently in different versions of the human reference genome.
  • the chromosome start position information but does not prevent the person skilled in the art from determining and masking the sequence of regions of the following table to obtain the first reference sequence. The reference sequence after masking/removing these areas facilitates the rapid progress of subsequent steps and obtains accurate test results.
  • the first reference sequence is at least a portion of the reference genome from which the region corresponding to the second window is selected: the second window has a sequencing depth not less than (greater than or equal to) all of the second windows 4 times the average of the sequencing depth, preferably not less than (greater than or equal to) 6 times the average of the sequencing depths of all the second windows; that is, the removal or masking of the reference genome is far greater than the average sequencing depth Two windows.
  • sequencing depth is also called depth.
  • the number of times a region is covered can be expressed as the ratio of the number of reads in the region to the size of the region.
  • the sequencing depth of the second window is the ratio of the number of reads of the second window to the size of the second window.
  • the so-called second window can be obtained by sliding the reference genome with a window of size L2 to obtain a series of second windows of size L2. There may or may not be overlap between adjacent second windows.
  • the step size of the sliding window to acquire the second window is set to L2, that is, there is no overlap between adjacent two second windows and no interval. (zero base overlap and zero base spacing), whereby the reference genome is transformed into a series of second windows that cover the reference genome once, and the second window of the series can be utilized to represent the genome.
  • the removal/masking process is performed on the specific region of the reference genome, so that after the step (2) is compared using the processed reference sequence (first reference sequence), the influence of some abnormal data on the subsequent statistical analysis can be eliminated.
  • the depth of the second window can also be re-evaluated to obtain a relatively balanced sequencing depth.
  • a series of second windows such that after the step (2) is compared using the first reference sequence comprising a series of second windows having relatively similar sequencing depths, the effect of some abnormal data on subsequent statistical analysis can also be eliminated.
  • the sequencing depth of the second window having a percentile greater than 98 is assigned to the sequencing depth of the second window having a percentile equal to 98, or the sequencing depth of the second window having a percentile greater than 99.
  • the obtained first reference sequence facilitates the elimination of the influence of the abnormal data/area on the detection result, and is advantageous for obtaining an accurate detection result.
  • all of the second windows can be sorted from low to high according to the sequencing depth value, and the sequencing depths of all the second windows ranked in the 99th to 100th are re-assigned, for example, the second window is assigned to the 99th. Sequencing depth values, thereby eliminating the effects of abnormally high sequencing depth windows on subsequent detection.
  • the size L2 of the second window can be determined as needed and the sequencing results are adjusted.
  • the size of the second window is substantially consistent with the size of the region/window where most of the sequencing depth is abnormally high and/or low.
  • the sample is a human sample and the reference genome is a human reference genome, based on preliminary statistics on sequencing results and/or alignment results, L2 can be set to 10-20 Kbp, preferably 12-17 Kbp; in one example Among them, the inventors found that when L2 is set to 15 Kbp, the abnormal region/second window can be found more comprehensively.
  • the comparison can be performed by using a known comparison software, such as SOAP, BWA, BLAST, MAPQ, and TeraMap, etc., which is not limited in this embodiment.
  • a known comparison software such as SOAP, BWA, BLAST, MAPQ, and TeraMap, etc., which is not limited in this embodiment.
  • setting a pair or a read segment allows at most n base mismatches, for example, setting n to 1 or 2, if there are more than n in the read segment. If a base is mismatched, it is considered that the pair or the pair of reads cannot be aligned to the first reference sequence, or if the mismatched n bases are all located in one of the read pairs, then The read in the pair of reads cannot be compared to the reference sequence.
  • the alignment in step (2) comprises: (a) converting each read segment into a set of short segments corresponding to the read segment, obtaining a plurality of sets of short segments; (b) determining a short film The segment is in a corresponding position of the reference library to obtain a first positioning result, the reference library is a hash table constructed based on the first reference sequence, the reference library includes multiple entries, and one entry of the reference library corresponds to a seed sequence.
  • the seed sequence can be matched with at least one sequence on the first reference sequence, and the distance between the two seed sequences corresponding to two adjacent entries of the reference library on the first reference sequence is smaller than the length of the short segment; (c) removing the Obtaining a short segment on any one of the reference library neighboring entries in a positioning result to obtain a second positioning result; (d) extending based on the short segment from the same reading in the second positioning result to obtain Read the alignment results.
  • the so-called reference library is essentially a hash table, which can directly use the so-called seed sequence as a key (key name) and the position of the so-called seed sequence on the reference sequence as a value (key value).
  • the reference library is constructed; the so-called seed sequence can be first converted into a numeric or integer string, and the reference library is established by using the number or an integer string as a key and the position of the seed sequence on the reference sequence as a value.
  • the position of the seed sequence on the reference sequence is a value, which may be one or more positions corresponding to the seed sequence on the reference sequence/chromosome, and the position may be directly represented by a true value or a numerical range, or may be re-encoded. Custom characters and/or numeric representations.
  • Hash(seed) Vector(position)
  • the so-called vector vector is an object entity that can accommodate many other elements of the same type. And therefore also known as a container.
  • This reference library can be built by saving in binary.
  • the hash table may be divided into block storage, and a block header key and a block tail key may be set in the block header, for example, for the sequential sequence block ⁇ 5, 6, 7, 8, ..., 19, 20 ⁇ , Block head and block tail (or header and footer) 5 and 20, if there is a number of 3, because 3 ⁇ 5, it can be seen that 3 does not belong to the sequence block, if there is a number of 10, because 5 ⁇ 10 ⁇ 20, it can be seen that 10 belongs to this sequence block.
  • the so-called reference library can be built when the sequence is to be compared, or it can be pre-built and saved.
  • the method is based on the relationship between the length of the seed sequence and the reference sequence established by the inventor repeatedly hypothesized experimental verification, so that the constructed reference library can include the comprehensive seed sequence with the associated information of the corresponding positions of the various sub-sequences on the reference sequence.
  • the reference library is compact, has a small memory footprint and can be used for high-speed access queries in sequence location analysis.
  • An entry of the reference library obtained according to this embodiment contains only one key, one key corresponding to at least one value.
  • a method for generating all possible seed sequences and obtaining a seed sequence set is not limited.
  • elements in the set may be traversed to obtain a specific length and all possible element combinations, for example, Implemented using recursive algorithms and/or looping algorithms.
  • the first reference sequence is at least a portion of the human genome
  • the human genome comprises approximately 3 billion bases
  • the length of the read to be processed is no less than 25 bp
  • L is an integer in [11, 15] for efficient Comparison.
  • the first reference sequence is at least a portion of a human cDNA reference genome
  • the total number of bases of the reference sequence is totalBase
  • the length of the seed sequence is set based on the total number of bases L
  • L(seed) log(totalBase )* ⁇
  • Base types based on L and DNA sequences include A, T, C, and G.
  • the reference library is constructed by taking the position on the reference sequence as a value.
  • the first reference sequence is at least a portion of the DNA genome and transcriptome of a species
  • the total number of bases of the reference sequence is totalBase
  • the length of the seed sequence is set based on the total number of bases
  • L (seed ) log(totalBase)* ⁇
  • the base species constituting the DNA sequence include four kinds of A, T, C, and G
  • the base types constituting the RNA sequence include A, U, C, and G
  • B L B ⁇ ATCG ⁇ AUCG ⁇ ; determining a seed sequence set to match the seed sequence of the reference sequence and the seed
  • the matching position of the sequence is constructed by using a seed sequence capable of matching the reference sequence as a key and a position of the seed sequence on the reference sequence as a value to obtain the reference library.
  • the seed sequence can be converted into a string of numeric characters, and the string can be used as a key to build a library, which can improve the speed of accessing the reference library built by the query.
  • the seed sequence is encoded as follows:
  • the base coding rule can be the same as above, and the same reference code conversion can be performed on the first reference sequence, which facilitates obtaining the seed sequence quickly.
  • the corresponding location information on the reference sequence is also beneficial to improve the access query speed of the reference library built.
  • determining a seed sequence capable of matching a seed sequence of the first reference sequence and a matching position of the seed sequence includes: sliding a window of the first reference sequence by using a window of size L, and performing the seed sequence
  • the concentrated seed sequence is matched with the window sequence of the sliding window to determine the seed sequence set to match the seed sequence of the first reference sequence and the matching position of the seed, and the matching fault tolerance is ⁇ 1 .
  • the corresponding position information of the seed sequence on the first reference sequence can be quickly obtained, which facilitates rapid construction of the reference library.
  • the so-called fault tolerance is the ratio of the allowed mismatched bases, and the mismatch is selected from at least one of substitutions, insertions, and deletions.
  • the so-called match is a strict match, that is, the fault tolerance rate ⁇ 1 is zero.
  • the position of the sliding window sequence is the seed sequence in the first reference sequence. The corresponding position on the top.
  • the so-called match is fault-tolerant, and the fault tolerance ⁇ 1 is greater than zero.
  • the sliding window sequence The position is the corresponding position of the seed sequence on the first reference sequence.
  • the corresponding position of the seed sequence on the first reference sequence is encoded, and the referenced library is constructed with the encoded characters, such as numeric characters, as values.
  • the constructed reference library may be a seed as a key, or each of the seed templates corresponding to the seed may be a key, the key is different from the key, and one key corresponds to at least one value.
  • the step size of the sliding window of the first reference sequence is determined according to L and ⁇ 1 when determining the corresponding position of the seed sequence on the reference sequence.
  • the step size of the sliding window is not less than L* ⁇ 1 .
  • the first reference sequence is at least a portion of the human genome, the human genome comprises approximately 3 billion bases, the length of the read to be processed is not less than 25 bp, L is 14 bp, and ⁇ 1 is 0.2-0.3.
  • the step size of the sliding window is 3bp-5bp, so that two adjacent windows in the sliding window positioning process can cross the continuous error combination under the condition of ⁇ 1 , which is convenient for rapid positioning.
  • the distance between two adjacent entries of the constructed reference library is the step size of the sliding window.
  • (a) comprises: sliding a read window with a window of size L to obtain a set of short segments corresponding to the read, the sliding window having a step size of 1 bp.
  • a read window with a window of size L obtain (K-L+1) short segments of length L, convert the reads into short segments, and use the high-speed access query reference library to determine the corresponding positions of the short segments in the reference library. And then obtain the short segment corresponding to the information of the reads in the reference library.
  • (b) comprises: matching the short segment with the seed sequence corresponding to the entry of the reference library to determine the location of the short segment at the reference library, and matching the fault tolerance rate ⁇ 2 .
  • the so-called match is a strict match, that is, the fault tolerance rate ⁇ 2 is zero.
  • the so-called match is a fault-tolerant match
  • the fault tolerance ⁇ 2 is greater than zero
  • the proportion of bases that do not match the seed or seed template corresponding to one or more entries of the reference sequence is less than the fault tolerance ⁇ At 2 o'clock, the position information of the short sequence on the reference library is obtained.
  • ⁇ 2 ⁇ 1 and not zero, enabling as much valid data as possible.
  • the distance between the two seed sequences X1 and X2 corresponding to two adjacent entries of the reference library in the reference sequence ref can be divided into the following two.
  • the key and value of the two entries of the reference library are unique, that is, an entry corresponds to a [key, value], refer to Figure 1a, which is equivalent to when X1 and X2 are uniquely matched with the reference sequence (X1 and X2 only matches one position of the reference sequence), the distance is the distance between X1 and X2 corresponding to the two positions on the reference sequence; when the key of at least one of the two entries of the reference library corresponds to multiple The value, referring to FIG.
  • 1b corresponds to a position where at least one of the two seed sequences X1 and X2 is non-uniquely matched with the reference sequence, that is, at least one of X1 and X2 matches the reference sequence, and the distance is The distance between X1 and X2 corresponding to the nearest two positions on the reference sequence.
  • This embodiment does not limit the representation of the distance between two sequences, for example, it can be expressed as the distance from either end of one end of the sequence to either end of the other sequence, or can be expressed as a sequence. The distance from the center to the center of another sequence.
  • (c) further includes: removing the short segment whose connection length is less than a predetermined threshold, and replacing the second positioning result with the removed result, the connection length being the second alignment.
  • the short segments in the result from the same read and located to different entries of the reference library are mapped to the total length of the reference sequence. This process facilitates the removal of some of the transitionally redundant and/or relatively low quality data, which facilitates increasing the speed of the alignment.
  • the connected length can be expressed as the sum of the lengths of short segments from the same read and positioned to different entries of the reference library minus the length of the overlap between the short segments mapped onto the reference sequence.
  • the lengths of the different short segments are all L, and the predetermined threshold is said to be L.
  • the comparison speed can be increased in the case of allowing the loss of partially valid but relatively low quality data.
  • (c) further comprising: judging the positioning result of the reading according to the positioning result of the short segment from the same reading in the second positioning result, and removing A judgement whose judgment result does not meet the predetermined requirements. Removing the read also removes the short segment corresponding to the read. In this way, under the premise of satisfying certain sensitivity and accuracy, based on the second positioning result, the exact matching/local fast comparison can be directly performed, and the comparison can be accelerated.
  • This embodiment does not limit the method of evaluation, and for example, a method of quantitative scoring can be utilized.
  • the short positioning results of the short clips from the same read segment are scored according to the rule that the first reference sequence matches the score, and the first reference sequence does not match the score; After obtaining the second positioning result, the positioning result of the reading segment is scored according to the positioning result of the short segment from the same reading in the second positioning result, and the reading segment whose score is not greater than the first preset value is removed.
  • the read length is 25 bp
  • the short segments from the same read are sequence constructed to obtain a reconstructed sequence.
  • the base type of a certain site can be determined according to having more short sequence support. If a short fragment that is not supported by a locus is not aligned to the locus, then the base type of the locus is indeterminate and can be represented by N, thereby obtaining the reconstructed sequence, and the reconstructed sequence and the read sequence can be seen.
  • the length of the reconstructed sequence is the read length; the reconstructed sequence is decremented by one point matching the first reference sequence (ref), and the point that does not match the first reference sequence is added by one point, and the comparison is performed.
  • the fault tolerance rate that is, the mismatch ratio allowed by a read/reconstruction sequence is 0.12
  • the length of the comparison error is 3 bp (25*0.12)
  • the initial score Scoreinit is the read length
  • the first preset value is 22 (25-3).
  • a bit operation and a dynamic programming algorithm [G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM, 46(3): 395-415, 1999] is used.
  • the short positioning results of the short clips from the same read segment are scored according to the points that match the first reference sequence, and the points that do not match the first reference sequence are scored; After obtaining the second positioning result, according to the positioning result of the short segment from the same reading in the second positioning result, the positioning result of the reading is scored, and the reading corresponding to the reading with the score not less than the second preset value is removed. Short clip.
  • the read length is 25 bp
  • the short segments from the same read are sequence constructed to obtain a reconstructed sequence.
  • the base type of a certain site can be determined according to having more short sequence support. If a short fragment that is not supported by a locus is not aligned to the locus, then the base type of the locus is indeterminate and can be represented by N, thereby obtaining the reconstructed sequence, and the reconstructed sequence and the read sequence can be seen.
  • the length of the reconstructed sequence is the read length; the reconstructed sequence is added to the first reference sequence (ref), and the first reference sequence does not match the score.
  • the fault tolerance rate that is, the mismatch ratio allowed by a read/reconstruction sequence is 0.12
  • the length of the comparison error is 3 bp (25*0.12)
  • the initial score Scoreinit is -25
  • the second preset value is -22 (-25). -3)
  • the extending in the (d) based on the short segment from the same read segment in the second positioning result comprises: performing sequence construction based on the short segment from the same read segment to obtain the reconstructed sequence;
  • the common portion of the reference sequence corresponding to the reconstructed sequence is extended to obtain an extended sequence.
  • the short segment and the short segment positioning information are converted into the positioning information of the short segment corresponding to the read segment (herein referred to as a reconstructed sequence), which facilitates the fast and accurate processing of the subsequent alignment processing.
  • the so-called public part is the part shared by multiple sequences.
  • the so-called common parts are common substrings and/or common subsequences.
  • a common substring refers to a contiguous portion shared among multiple sequences, and a common subsequence does not have to be contiguous.
  • the common subsequence is BCBA and the common substring is AB.
  • the so-called sequence is constructed based on short fragments from the same read segment to obtain a reconstructed sequence.
  • the base type of a certain site on the reconstructed sequence can be determined according to having more short segment support, if a certain bit If a short segment that is not supported by the point, that is, no short segment is aligned to the reference sequence, the base type uncertainty of the site may be represented by N, thereby obtaining the so-called reconstructed sequence. It can be seen that the reconstructed sequence corresponds to the read segment, and the length of the reconstructed sequence is the read length.
  • the reference sequence corresponding to the reconstructed sequence is a reference sequence that matches the reconstructed sequence, and the length of the reference sequence is not less than the read length.
  • the reference sequence corresponding to the reconstructed sequence has the same length as the reconstructed sequence and is of a read length.
  • the reconstructed sequence is allowed to be fault-tolerantly matched with the corresponding reference sequence, and the length of the reference sequence corresponding to the reconstructed sequence is the length of the reconstructed sequence plus twice the length of the fault-tolerant matching, for example, the length of the reconstructed sequence is read.
  • the length is 25 bp, and the matching of the reconstructed sequence and the reference sequence allows 12% mismatch, and can be reconstructed by the reference sequence of the reconstructed sequence comparison and the 3 bp (25*12%) sequences at both ends of the reference sequence.
  • the reference sequence corresponding to the sequence.
  • the so-called common part is a common substring.
  • the so-called common part is a common subsequence.
  • edit distance also known as the Levenshtein distance
  • Levenshtein distance refers to the minimum number of edit operations required between two strings, one from one to another. Editing operations include replacing one character with another, inserting one character, and deleting one character. In general, the smaller the edit distance, the greater the similarity between the two strings.
  • the longest common substring of the reference sequence corresponding to the reconstructed sequence and the reconstructed sequence can be represented as two strings x 1 x 2 ...
  • the common substring of x i and y 1 y 2 ... y j , the lengths of the strings are m and n respectively, and the length c[i, j] of the common substring of the two strings is calculated, and the transfer equation can be obtained:
  • the hole/gap means inserting or deleting a character.
  • the gap in the formula indicates the penalty required to insert or delete a character (corresponding to the position in the sequence).
  • the match means that the two characters are the same.
  • the match in the two matches the score when the two characters are the same.
  • the mismatch indicates that the two characters are not equal/different.
  • the mismatch in the formula indicates the valve points indicating that the two characters are not equal/different.
  • d[i,j] takes the smallest of the three. In a specific example, a gap is 3 points, a continuous gap increases the valve by 1 point, a position is mismatched by 2 points, and the position matches 0 points. Thus, it facilitates efficient alignment with gap sequences.
  • the so-called common part is a common subsequence.
  • (d) includes: finding a common subsequence of short segments of the same entry in the second positioning result that are located to the reference library, determining the longest common subsequence corresponding to each read segment; based on the edit distance Extending the longest common subsequence to obtain an extended sequence.
  • the longest common subsequence of the reference sequence corresponding to the reconstructed sequence and the reconstructed sequence is searched for, based on the longest common subsequence, the one corresponding to the longest common subsequence
  • the segment reconstruction sequence is transformed into the reference sequence corresponding to the longest common subsequence, and the Smith Waterman algorithm can be used to find the edit distance of the two segments, for the two strings x 1 x 2 ... x i and y 1 y 2 ...y j can be obtained by the following formula:
  • represents the score function
  • ⁇ (i,j) represents the score of the character (site) x i and y j mismatch or match
  • ⁇ (-,j) represents the score of x i vacancy (deletion) or y j insertion
  • ⁇ (i, -) represents the score of y j deletion or x i insertion
  • a gap is 3 points for a gap, 1 point for a continuous gap, 2 points for a mismatch, and 4 points for a match. In this way, efficient alignment of gap-containing sequences can be achieved and sequences with both gaps and other sites with high accuracy can be retained.
  • (d) further comprises: truncating the extended sequence from at least one end of the extended sequence, calculating a proportion of the incorrectly located position of the truncated extended sequence, and stopping the truncation after satisfying the following conditions: the extension after the truncation
  • the proportion of the incorrect positioning position of the sequence is less than the third preset value.
  • the extension sequence is truncated based on: i, calculating a first error rate and a second error rate, and if the first error rate is less than the second error rate, the first from the extended sequence The end truncates the extended sequence. If the first error rate is greater than the second error rate, the extended sequence is truncated from the second end of the extended sequence to obtain the truncated extended sequence, and the first error rate is extended.
  • the proportion of the mis-localization site of the extended sequence; ii, replacing the extended sequence with the truncated extended sequence for i, until the proportion of the incorrectly located position of the extended sequence after the truncation is less than the fourth predetermined value.
  • the double-ended truncation and culling method can better preserve the well-matched local sequence, which is beneficial to improve the efficiency of the data.
  • the length of the extended sequence is 25 bp
  • the fourth preset value is set to 0.12.
  • (d) further includes: sliding a window from the extended sequence from at least one end of the extended sequence, calculating a proportion of the wrong positioning site of the window sequence obtained by the sliding window, and locating the site according to the error of the window sequence
  • the ratio of the extension sequence is truncated, and the truncation is stopped if the proportion of the incorrectly located position of the window sequence of the sliding window is greater than the fifth preset value.
  • the extension sequence is truncated based on: i, calculating a third error rate and a fourth error rate, and if the third error rate is less than the fourth error rate, the second from the extended sequence The end truncates the extended sequence.
  • the extended sequence is truncated from the first end of the extended sequence to obtain the truncated extended sequence, and the third error rate is extended from The ratio of the first end of the sequence to the sliding sequence of the extended sequence, the obtained error sequence of the window sequence, and the so-called fourth error rate is a sliding window obtained from the second end of the extended sequence, and the obtained window sequence
  • the proportion of the error location site; ii, replacing the extension sequence with the truncated extension sequence for i until the proportion of the error location of the window sequence is greater than the sixth preset value.
  • the window of the sliding window is no larger than the length of the extended sequence.
  • the length of the extended sequence is 25 bp
  • the window size of the sliding window is 10 bp
  • the sixth preset value is 0.12.
  • the truncation size is 1 bp, i.e., one truncation is one base removed. In this way, it is possible to efficiently obtain alignment results containing more long sequences.
  • the number of negative samples is not less than 20, preferably, not less than 30.
  • the so-called negative sample is a sample that does not have a chromosome aneuploidy abnormality.
  • the target of the mutation detection is a human or the sample to be tested is a sample from a human body
  • the negative sample is a sample obtained from a normal diploid individual.
  • Obtaining the results of the sequencing of the negative samples and obtaining the sequencing results of the samples to be tested are not limited in order, for example, may be obtained simultaneously or sequentially, preferably simultaneously under the same experimental conditions, in order to minimize the difference in test factors and cause the detection results. Impact.
  • the negative sample and the sample to be tested are the same type of sample, for example, the genetic information of the fetus in the non-invasive detection matrix, and the negative sample and the sample to be tested may all be maternal blood samples.
  • determining the amount of the read in the negative sample that is located to the corresponding first chromosome comprises: replacing the sample to be tested with the negative sample for steps (1)-(3) to determine the location of the negative sample.
  • the amount of reads of the first chromosome ; the mean of the amount of reads of the first chromosome of the plurality of negative samples as the amount of reads in the negative sample that are located to the corresponding first chromosome.
  • the amount of the so-called segment that is located to the chromosome can be either an absolute quantity or a relative quantity, for example, a value such as an integer, a ratio, or a range of values.
  • At least one, at least two or all three of the following (i)-(iii) are performed before the step (3): (i) the length of the removal sequencing result is not greater than a predetermined a read of the length; (ii) removing the read from the alignment result that is not unique to the first reference sequence; the reads that are aligned/located to the unique position of the reference sequence are called unique reads; (iii) the comparison result is removed.
  • the error rate is not less than the read of the predetermined error rate, and the error rate of the read is the ratio of at least one of the bases of the insertion, the deletion, and the mismatch on the read.
  • the error rate of the so-called read is the ratio of the number of insertions, deletions, and mismatch bases or the number of positions displayed on the read after the comparison. .
  • the so-called predetermined error rate can be set according to the sequencing platform, the amount of data to be dropped, the data quality, and the purpose of detection. It can be understood that if the amount of data to be dropped is small and/or the data quality is high, it may be suitable for setting a larger value.
  • the predetermined error rate can be set to a smaller predetermined error rate to remove relatively low quality data while satisfying the detection, facilitating rapid detection.
  • sequencing results from a single molecule sequencing platform are used to filter the sequencing results using all (i)-(iii) for rapid detection.
  • the amount of data to be dropped is 12.8 M
  • the predetermined error rate is set to 10%, that is, for a 10 bp read, at most 1 bp insertion, deletion or mismatch is allowed, and after filtering, data is obtained at 3.4 M. It can be understood that if the filtering is relatively strict at the time of comparison, (ii) may not be performed, for example, the predetermined error rate may be set to 100%.
  • the step (3) comprises: (a) sliding the first reference sequence with a window of size L3 to obtain a plurality of third windows, optionally, the step size of the sliding window is L3 (b) determining a sequencing depth of the third window based on the comparison result, the sequencing depth of the third window being a ratio of the number of reads of the third window to the third window size L3; (c) based on The depth of sequencing of the third window contained in the first chromosome determines the amount of reads that are located to the first chromosome.
  • (b) comprises: correcting the sequencing depth of the third window based on the GC content of the third window, and using the corrected sequencing depth of the third window as the sequencing depth of the third window.
  • the size of the third window that is, the setting of L3, generally needs to reflect the difference in GC content and distribution to the sequencing results of the regions (third window).
  • L3 takes less than 300 Kbp.
  • the inventor determines L3 according to the relationship between coefficient of variation (CV) and different size windows, as shown in FIG. 3, according to the curve, the window size corresponding to the CV value is significantly affected by the window size.
  • CV coefficient of variation
  • the third window size if L3 is set to 100Kbp-200Kbp, it can reflect the influence of GC content and distribution on sequencing, and it is also convenient for quick comparison.
  • coefficient of variation also known as the discrete coefficient
  • the so-called coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, which is the ratio of the standard deviation to the mean; here the GC content of a set of windows/areas of a particular window size is reflected The absolute value of the degree of dispersion.
  • the adjacent two third windows may or may not overlap.
  • L3 is set to 150 Kbp
  • adjacent two third windows have a 100 bp overlap, that is, the sliding window step is set to 149.9 Kbp.
  • the correction may be performed by establishing a relationship between the GC content of the third window and the sequencing depth of the third window; in one example, the GC content of the third window and the sequencing depth of the third window are established by using local weighted regression method. Relationship.
  • (b) further comprises, before performing the above correction, normalizing the sequencing depth of the third window to use the sequencing depth of the normalized third window as the sequencing depth of the third window.
  • the so-called normalization process is a normalization process, for example, the normalization of the third window depth can be based on the sequencing depth average or the sequencing depth median.
  • the weighting coefficient of the reading to the third window is determined based on the sequencing depth of the third window, and the reading of the reading to the first chromosome is determined based on the weighting coefficient the amount.
  • the sequencing depth of the third window is normalized or normalized, for example, the ratio of the sequencing depth of the third window to the specific value is used as the sequencing depth of the third window, and the specific value is called the third window.
  • the mean of the sequencing depth is such that the third window sequencing depth is converted into a set of values that fluctuate around 1; the relationship between the processed sequencing depth (relative sequencing depth) and the GC content is determined.
  • the weight coefficient of the read of the third window is the relative sequencing depth of the window.
  • the amount of the read that is located to the first chromosome is a relative amount and is corrected by the weight coefficient.
  • the relative amount thereby, can eliminate or reduce the influence of GC content and/or distribution difference on the detection result, and improve the detection accuracy.
  • the inventors have found that the relative depth of the third window is inversely proportional to the GC content of the window, ie the relative depth of the third window with a low GC content is high, and the relative depth of the third window with a high GC content is low. .
  • N readings are located to a third window of the first chromosome, and the relative depth of the third window of the first chromosome is w, Correcting the third window obtained to locate the first chromosome is Read paragraph.
  • the amount of the read located to the first chromosome is a relative amount, the amount of reads that are located to the first chromosome and the amount of reads that locate all or at least a portion of the autosomes.
  • the ratio is compared by z-score (z-score) to determine whether the difference between the ratio and the corresponding ratio of the negative sample is statistically significant to determine whether the first chromosome of the sample to be tested is an aneuploidy abnormality.
  • the first chromosome is selected from at least one of chromosomes 13, 18 and 21. For example, based on detecting free nucleic acids in a peripheral blood sample of a pregnant woman to obtain fetal genetic information, including screening or assisting diagnosis of the presence of aneuploidy variation of chromosomes 13, 18, and/or 21 in the fetus.
  • the GC content and distribution of different chromosomes have different characteristics. For example, based on the relative high and low GC content, the chromosomes in the genome can be classified into high GC content group, medium GC content group and low GC content group, or can be classified as relative. High GC content group, medium high GC content group, medium GC content group, medium and low GC content group and low GC content group.
  • Table 2 shows the GC content of human autosomes.
  • the inventors plotted the relationship between the sequencing depth of chromosome normalization and the GC content of chromosomes based on sequencing data of multiple control samples. As shown in Fig. 4, the GC content is relatively high and relatively low. The sequencing results were significantly affected by GC content. For chromosomes 21, 13 and 18, the 21 chromosome sequencing results were least affected by GC content, followed by chromosome 18, and 13 chromosomes were greatly affected by GC content.
  • the sample to be tested is a maternal blood sample. Since the amount of fetal free nucleic acid including fetal free DNA (cffDNA) in the parent free nucleic acid sample fluctuates greatly in different pregnant women and/or in different gestational periods. If the detection sensitivity can be improved, the earlier the pregnancy cycle can be detected under the same detection accuracy, the earlier the pregnancy can be artificially inserted, the less the impact on pregnant women; if it can improve the accuracy, false positives and false Negatives can be reduced, ultimately making it possible to apply to diagnosis, not just screening for assisted diagnosis.
  • cffDNA fetal free DNA
  • pregnant women's body fluid samples are extracted from cffDNA, constructed into a library, sequenced on the machine, and finally obtained to obtain sequencing data (for example, fastq format), and the off-machine data is compared with the reference sequence to obtain a genome containing each read in the genome.
  • the comparison result of position, comparison score, unique alignment, comparison error, etc. (for example, called sam file), the number of reads of a certain chromosome can be counted according to the comparison result, and finally the number of reads of the chromosome is calculated.
  • the proportion of the number of reads of the chromosome (hereinafter referred to as the proportion of chromosomes) to determine whether the chromosome has an abnormal number.
  • a non-invasive prenatal screening can be performed to obtain a batch of pregnant women's body fluid samples (negative samples) containing free DNA which have been confirmed to be normal, and calculate the body fluid samples of these pregnant women.
  • the chromosome for example, the proportion of the 21/18/13 chromosome, thereby determining the range or boundary of the normal and/or abnormal number of chromosomes(s).
  • Positive samples can also be used in the same way to determine the range or boundary of normal and/or abnormal chromosome numbers.
  • An embodiment of the present invention further provides a device for detecting chromosome aneuploidy, which is used to implement the method for detecting chromosome aneuploidy in any of the above embodiments or embodiments of the present invention, the device comprising: a sequencing module And sequencing the at least one part of the nucleic acid in the sample to be obtained, and obtaining a sequencing result including the reading; the comparison module is configured to compare the reading from the sequencing module to the first reference sequence to obtain a comparison result,
  • the alignment result includes information that the read is located on a chromosome, the first reference sequence is a set of regions on the reference genome that have a ratio of one, and the region with a ratio of one is located to a unique position on the reference genome.
  • a quantification module for determining, for the first chromosome, an amount of a read located to the first chromosome based on a comparison result from the comparison module; a determination module for comparing the position from the quantitative module to the first The amount of reads of a chromosome and the amount of reads in the negative sample that are located to the corresponding first chromosome to determine the number of first chromosomes.
  • determining the comparison capability of the region comprises: sliding the reference genome with a first window of size L1 to obtain a plurality of regions, and the step size of the sliding window can be set, for example, to 1 bp;
  • the regions are aligned to the reference genome and the alignment capabilities of the regions are calculated based on the number of regions aligned to the reference genome.
  • the number of negative samples is no less than 20, or preferably no less than 30.
  • the amount of reads in the negative sample that are mapped to the respective first chromosome is determined in such a way that the negative sample is substituted for the sample to be tested into the sequencing module, the alignment module, and the quantitation module to determine the localization to the negative
  • the amount of reads of the first chromosome of the sample is determined in such a way that the negative sample is substituted for the sample to be tested into the sequencing module, the alignment module, and the quantitation module to determine the localization to the negative
  • the amount of reads of the first chromosome of the sample the mean of the amount of reads of the first chromosome of the plurality of negative samples as the amount of reads in the negative sample that are located to the corresponding first chromosome.
  • the first reference sequence is said to be at least a portion of the human reference genome hg19 sequence from which the region shown in Table 1 is removed.
  • the first reference sequence is at least a portion of a reference genome from which a region corresponding to the second window is: the sequencing depth of the second window is not less than the average of the sequencing depths of all the second windows 4 times.
  • the first reference sequence is at least a portion of a reference genome from which a region corresponding to the second window is selected that has a sequencing depth that is not less than an average of sequencing depths of all of the second windows 6 times.
  • the first reference sequence is at least a portion of a reference genome that processes the region corresponding to the second window on the reference genome: the sequencing depth of the second window having a percentile greater than 98 is assigned to the percentile The number is equal to the sequencing depth of the second window of 98.
  • the sequencing depth of the second window having a percentile greater than 99 is assigned to the sequencing depth of the second window having a percentile equal to 99.
  • the so-called second window is obtained by sliding the reference genome with a window of size L2, which in one example is also L2.
  • the sequencing depth of the second window is the ratio of the number of reads of the second window to the second window size L2.
  • the apparatus further comprises a filtering module for performing at least one of (i)-(iii): (i) removing a read in the sequencing result that is no longer than a predetermined length; (ii) Removing the read from the alignment result that is not unique to the first reference sequence; (iii) removing the read in the comparison result whose error rate is not less than the predetermined error rate, and the error rate of the read is the read after the read.
  • the quantification module is configured to: (a) slide a first reference sequence with a window of size L3 to obtain a plurality of third windows; (b) determine a third based on the comparison result
  • the sequencing depth of the window, the sequencing depth of the third window is the ratio of the number of reads of the third window to the third window size L3; (c) the sequencing depth based on the third window included in the first chromosome And determining the amount of reads that are located to the first chromosome.
  • (b) further includes normalizing the sequencing depth of the third window to the sequencing depth of the normalized third window as the sequencing depth of the third window.
  • (b) further includes correcting the sequencing depth of the third window based on the GC content of the third window to determine the sequencing depth of the corrected third window as the sequencing depth of the third window.
  • the sequencing depth of the third window before correction may be the sequencing depth of the normalized third window.
  • the relationship between the GC content of the third window and the sequencing depth of the third window is used; in one example, the relationship between the GC content of the third window and the sequencing depth of the third window is established using local weighted regression.
  • (c) includes determining, based on the sequencing depth of the third window, a weighting coefficient that compares the readings to the third window, and determining an amount of the readings that are located to the first chromosome based on the weighting coefficients.
  • the sample to be tested is a maternal blood sample.
  • the first chromosome is at least one of fetal chromosomes 13, 18, and 21.
  • An embodiment of the present invention provides a computer readable storage medium for storing/hosting a program for execution by a computer, the execution of the program comprising the method for performing chromosomal aneuploidy detection in any of the above embodiments or embodiments.
  • the above description of the technical features and effects of the method and/or apparatus for detecting chromosomal aneuploidy in any embodiment or embodiment of the present invention is equally applicable to the computer readable storage medium of this embodiment of the present invention, This will not be repeated here.
  • the embodiment of the present invention further provides a computer program product, comprising instructions, when the computer executes the program, causing the computer to execute the chromosome aneuploidy detection method in any of the above embodiments or embodiments.
  • the reference sequence used is a collection of regions that do not contain the region shown in Table 1 and that simultaneously satisfy the following human reference genome: 1) the alignment capacity is 1, 2) the removal depth is less than 6 times the average sequencing depth, or the sequencing has been performed. The sequencing depth of the region with a depth percentile greater than 99 is assigned to the 99th percentile sequencing depth value.
  • Sequencing obtaining the offline data, that is, obtaining the read set; removing the read less than 25 bp;
  • the average value ⁇ i and the variance ⁇ i of the ratio values of the chromosome i are determined;
  • the Zscore ⁇ 3 of a chromosome of a sample of maternal peripheral blood to be tested is statistically significant, it can be considered that there are three such chromosomes in the fetus of the mother.
  • the distribution of the proportional values of the chromosome i of the plurality of control samples conforms to the normal distribution or approximately conforms to the normal distribution, and the z-value (corresponding distribution table) can be used to find the z-value and the corresponding confidence; for example, the confidence is 99.97%, the corresponding z value is roughly 3, exceeding the z value, indicating that the probability of 99.97% of the sample is abnormal, and can be judged as abnormal.
  • those skilled in the art can set other confidence levels, and then use the corresponding z value as a threshold to determine whether there is an abnormality.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

提供了一种检测染色体非整倍性的方法、装置及系统。方法包括:对待测样本中的至少一部分核酸进行测序,获得包括读段的测序结果;将读段比对到第一参考序列,获得比对结果,比对结果包括读段定位于具体染色体的信息;对于第一染色体,基于比对结果,确定定位到该第一染色体的读段的量;比较定位到该第一染色体的读段的量与阴性样本中的定位到相应第一染色体的读段的量,以判定该第一染色体的数目。

Description

检测染色体非整倍性的方法、装置及系统 技术领域
本发明涉及生物信息学领域,具体地,涉及一种检测染色体非整倍性的方法、装置及系统。
背景技术
唐氏综合征(21-三体)、爱德华茨综合征(13-三体)、帕陶氏综合征(18-三体)是最常见的新生儿染色体非整倍体疾病,它们的发病率分别1/700[Papageorgiou,E.A.et al.Fetal-specific DNA methylation ratio permits noninvasive prenatal diagnosis of trisomy 21.Nat.Med.17,510–513(2011).],1/6,000和1/10,000[Driscoll,D.A.& Gross,S.Prenatal Screening for Aneuploidy.N.Engl.J.Med.360,2556–2562(2009).]。这些染色体非整倍体会导致很高的发病率与死亡率,羊膜穿刺和绒毛膜取样是诊断胎儿染色体异常的标准方法,可是这些诊断方法的本身会带来高达0.6%到1.9%的流产率。为了避免这些风险,需要开发更加安全,检测孕周更加提前的非入侵的胎儿非整倍体异常(NIPT)的检测方法。
1997年卢煜明[Lo,Y.M.D.et al.Presence of fetal DNA in maternal plasma and serum.Lancet350,485–487(1997).]首次报道了在孕妇体内检测出胎儿的游离DNA(cff DNA),这使得通过母体的血液来检查胎儿的基因状况成为可能。据报道,第一孕期和第二孕期cffDNA在母体游离DNA的占比约4-10%,在第三孕期达到10-20%。2008年卢煜明[Chiu,R.W.K.et al.Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma.Proc.Natl.Acad.Sci.105,20458–20463(2008).]和SetphenQuake[Chitkara,U.et al.Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood.Proc.Natl.Acad.Sci.U.S.A.105,16266–71(2008).]分别报道应用下一代测序(NGS)技术检测出胎儿染色体非整倍体异常。目前可应用于基因检测的测序平台越来越多。
基于各平台的下机数据进行染色体非整倍性变异检测,检测的灵敏度和/或准确性一直有待进一步提高,多因素关系检测灵敏度和/或准确性,例如不同测序平台产生的下机数据的长度差异较大,下机数据也称为读段(reads),读段的长度也称为读长,从几十bp到数千bp不等,读长至少影响下机数据匹配(定位)的置信度高低;又例如,测序错误率高低也影响读段定位的置信度,一般地,错误率越高,置信度越低。
发明内容
本发明实施方式旨在至少解决相关技术中存在的技术问题之一或者至少提供一种可选择的实用方案。
依据本发明的一个实施方式,提供一种检测染色体非整倍性变异的方法,包括:(1)对待测样本中的至少一部分核酸进行测序,获得包括读段的测序结果;(2)将读段比对到第一参考序列,获得比对结果,比对结果包括读段定位于具体染色体的信息,第一参考序列为参考基因组上的比对能力为1的区域的集合,比对能力为1的区域是指定位到参考基因组上唯一位置的区域;(3)对于第一染色体,基于比对结果,确定定位到该第一染色体的读段的量;(4)比较定位到该第一染色体的读段的量与阴性样本中的定位到相应第一染色体的读段的量,以判定该第一染色体的数目。
该方法包括利用特定的参考序列对读段进行筛选以及定位,能够快速简单地实现染色体非整倍性检测,获得准确的检测结果。适用于基于各种测序平台的下机数据的检测分析,特别适用于对包 含未能识别的碱基的读段的检测分析,即包含空缺(gap)的读段的处理分析。
依据本发明的另一实施方式,提供一种检测染色体非整倍性变异的装置,该装置用以实施上述本发明实施方式中的检测染色体非整倍性的方法,该装置包括:测序模块:该测序模块用于对待测样本中的至少一部分核酸进行测序,获得包括读段的测序结果;比对模块:该比对模块用于将来自测序模块的读段比对到第一参考序列,获得比对结果,比对结果包括读段定位于染色体的信息,第一参考序列为参考基因组上的比对能力为1的区域的集合,比对能力为1的区域是指定位到参考基因组上唯一位置的区域;定量模块:对于第一染色体,该定量模块用于基于来自比对模块的比对结果,确定定位到该第一染色体的读段的量;判断模块:该判断模块用于比较来自定量模块的定位到该第一染色体的读段的量与阴性样本中的定位到相应第一染色体的读段的量,以判定该第一染色体的数目。
依据本发明的另一实施方式,还提供一种计算机可读介质,用于存储/承载计算机可执行程序,在执行该程序时,通过指令相关硬件可完成上述本发明实施方式中的检测染色体非整倍性的方法的全部或部分步骤。所称介质包括但不限于:只读存储器、随机存储器、磁盘或光盘等。
依据本发明的又一实施方式,提供一种终端,一种检测染色体非整倍性变异的系统,该系统包含计算机可执行程序,该系统包括处理器,该处理器能够用于执行上述计算机可执行程序,执行计算机可执行程序包括完成上述本发明实施方式中的检测染色体非整倍性的方法。
上述任一实施方式提供的检测染色体非整倍性的方法、装置和/或系统可以用于染色体非整倍性变异检测,获得的检测结果具有较高灵敏度和准确性。适用于基于各种测序平台的下机数据的检测分析,特别适用于对包含未能识别的碱基的读段的检测分析,即包含空缺(gap)的读段的处理分析。
本发明实施方式的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明实施方式的实践了解到。
附图说明
图1是本发明具体实施方式利用的比对方式中的参考库的相邻两个条目的距离的示意图。
图2是本发明具体实施方式利用的比对方式的连通长度示意图。
图3是本发明的具体实施方式中的变异系数与窗口大小的关系示意图。
图4是本发明具体实施方式中的染色体标准化的测序深度和染色体的GC含量的关系示意图。
具体实施方式
下面详细描述本发明的实施方式,所述实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
在本发明的描述中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者顺序。在本发明的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
本发明实施方式所称的测序,也称为序列测定,指核酸序列测定,包括DNA测序和/或RNA测序,包括长片段测序和/或短片段测序。
测序可以通过测序平台进行,测序平台可选择但不限于Illumina公司的Hisq/Miseq/Nextseq测序平台、Thermo Fisher/Life Technologies公司的Ion Torrent平台、华大基因的BGISEQ平台和单分子测序平台;测序方式可以选择单端测序,也可以选择双末端测序;获得的测序结果/数据即测读出 来的片段,称为读段(reads)。读段的长度称为读长。
本发明实施方式提供一种检测染色体非整倍性的方法,所称染色体非整倍性包括染色体或染色体的一部分区域的量的异常,该方法包括:(1)对待测样本中的至少一部分核酸进行测序,获得包括读段的测序结果;(2)将读段比对到第一参考序列,获得比对结果,比对结果包括读段定位于具体染色体的信息,第一参考序列为参考基因组上的比对能力为1的区域的集合,比对能力为1的区域是指定位到参考基因组上唯一位置的区域;(3)对于第一染色体,基于比对结果,确定定位到该第一染色体的读段的量;(4)比较定位到该第一染色体的读段的量与阴性样本中的定位到相应第一染色体的读段的量,以判定该第一染色体的数目。
该方法包括利用特定的参考序列对读段进行筛选以及定位,能够快速简单地实现染色体非整倍性检测,获得准确的检测结果。适用于基于各种测序平台的下机数据的检测分析,特别适用于对包含未能识别的碱基的读段的检测分析,即包含空缺(gap)的读段的处理分析。
可以针对整个染色体组(基因组)或者几条染色体或者染色体的部分区域进行测序,一般地,这主要与目标染色体或区域的特点包括该目标染色体或区域与其它染色体或区域的关联有关系。
所称的“比对”指序列比对,包括将一条或多条序列定位到另一条或多条序列的过程以及获得的定位结果。例如,包括读段定位到参考序列上的过程,也包括获得读段定位/匹配结果的过程。
所称的参考序列(reference,ref)同参考染色体序列,为已确定的序列,可以是自己预先测定组装的DNA和/或RNA序列,也可以是他人测定公开的DNA和/或RNA序列,可以是预先获得的样本来源个体/目标个体所属生物类别中的任意的参考模板,例如,同一生物类别的已公开的基因组组装序列的全部或者至少一部分。若样本来源个体或者目标个体是人类,其基因组参考序列(也称为参考基因组或者参考染色体组)可选择UCSC、NCBI或ENSEMBL数据库提供的人类参考基因组,例如HG19、HG38、GRCh36、GRCh37、GRCh38等,本领域技术人员可以通过数据库的说明了解上述各参考基因组版本的对应关系,选择使用的版本。进一步地,也可以预先配置包含更多参考序列的资源库,例如在进行比对前,先依据目标个体的性别、人种、地域等因素选择或测定组装出更接近或更具某方面特性的序列来作为参考序列,有助于后续获得更准确的序列分析结果。所称参考序列包含染色体编号以及各个位点在染色体上的位置信息。
所称的第一参考序列为参考基因组的至少一部分,是发明人基于挖掘发现已公开下机数据集的特点结合所使用的测序平台特点、下机数据特性包括读长/错误率/数据质量等因素以及针对检测染色体非整倍性变异的目的进行尝试而构建出的版本,利用该第一参考序列进行读段的定位,有利于快速得到定位结果以及减少后续步骤所需处理的数据量。
在某些实施方式中,通过以下方式确定所称的区域的比对能力:以大小为L1的第一窗口对参考基因组进行滑窗,获得多个区域;将区域比对到参考基因组,基于区域比对到参考基因组的位置的数目计算该区域的比对能力。
所称的区域或窗口均对应参考基因组上的一段序列。所称第一窗口的大小和/或滑窗的步长可以结合检测目的、所采用的变异检测原理、读长以及参考基因组的序列特点进行设定。较佳地,设置滑窗的步长不大于第一窗口的大小,如此,可保留尽可能多的参考基因组上比对能力为1的区域,利于提高下机数据利用率。
一般地,L1可依据读长大小来设置,例如设置为0.5-2倍读长或者平均读长中的任意整数,滑窗步长可设置为小于0.5倍读长、小于0.2倍读长或者小于0.1倍读长的任意整数。在一个示例中,选择使用的参考基因组为UCSC数据库的HG19,读长为25bp,设置L1为25bp,设置滑窗的步长小于10bp、小于5bp或者小于2bp;例如,滑窗的步长设置为1bp,相当于相邻两个第一窗口之间 具有(L1-1)bp的重叠(overlap),如此,有利于获得参考基因组上所有满足该特定要求的区域集合,利于充分利用测序结果获得更全面的比对结果,利于提高数据利用率。
具体地,在一个示例中,计算区域的比对能力,以区域比对到参考基因组的位置的数目的倒数作为该区域的比对能力,例如,一区域比对到参考基因组的唯一位置,则该区域的比对能力为1,而一区域能够比对到参考基因组的5个位置,则该区域的比对能力1/5。
第一参考序列可以在进行检测目标样本时构建,也可以预先构建保存已备检测样本时调用。
在某些实施方式中,第一参考序列为去除了表1所示区域的人参考基因组的至少一部分。由于显示全部去除区域序列需要大量篇幅,表1以这些欲去除/屏蔽区域在人参考基因组HG19中的位置信息来表示这些区域,可以理解地,这些区域在不同版本人参考基因组中可能对应不同的染色体起始位置信息,但不妨碍本领域技术人员确定及屏蔽如下表的该些区域序列来获得第一参考序列。屏蔽/去除这些区域后的参考序列,利于后续步骤的快速进行和获得准确的检测结果。
表1
Figure PCTCN2018085865-appb-000001
在另一些实施方式中,第一参考序列为去除了符合以下条件的第二窗口对应的区域的参考基因组的至少一部分:该第二窗口的测序深度不小于(大于或者等于)所有第二窗口的测序深度的平均值的4倍,较佳的不小于(大于或者等于)所有第二窗口的测序深度的平均值的6倍;亦即去除或屏蔽参考基因组上测序深度远大于平均测序深度的第二窗口。
所称的测序深度也称为深度,为某个区域被覆盖的次数,可表示为比对上该区域的读段的数量与该区域的大小的比值。第二窗口的测序深度为比对上该第二窗口的读段的数目与该第二窗口大小的比值。
所称的第二窗口可通过利用大小为L2的窗口对参考基因组进行滑窗获得,获得大小为L2的一系列第二窗口。相邻第二窗口之间可以有重叠也可以没有重叠,在一个示例中,滑窗获取第二窗口的步长设定为L2,即使得相邻两个第二窗口之间无重叠且无间隔(零碱基重叠且零碱基间隔),由 此,将参考基因组转化成一系列第二窗口,该一系列第二窗口覆盖参考基因组一次,可利用该系列第二窗口来代表基因组。
对参考基因组特定区域进行去除/屏蔽处理,使得利用处理后的参考序列(第一参考序列)进行步骤(2)比对后,能消除一些异常数据对后续统计分析的影响。
根据本发明的具体实施方式,对于测序深度明显偏离测序深度平均值或者测序深度中位数的第二窗口,也可以通过对该第二窗口的深度进行重新赋值,以获得由测序深度相对均衡的一系列第二窗口,使得利用包含测序深度相对均衡的一系列第二窗口的第一参考序列进行步骤(2)比对后,同样能够消除一些异常数据对后续统计分析的影响。在一个示例中,对百分位数大于98的第二窗口的测序深度赋值为百分位数等于98的第二窗口的测序深度,或者对百分位数大于99的第二窗口的测序深度赋值为百分位数等于99的第二窗口的测序深度,以此获得的第一参考序列利于消除异常数据/区域对检测结果带来的影响,利于获得准确的检测结果。例如,可将所有的第二窗口按测序深度值从低到高进行排序,对排位在第99~100的所有第二窗口的测序深度进行重新赋值,例如均赋值为第99的第二窗口的测序深度值,从而消除异常高测序深度的窗口对后续检测的影响。
第二窗口的大小L2可以根据需要以及测序结果调整确定,较佳地,希望第二窗口的大小与大部分测序深度异常高和/或低的区域/窗口的大小基本一致。在某些实施方式中,样本为人类样本,参考基因组为人参考基因组,基于对测序结果和/或比对结果的初步统计,L2可设为10-20Kbp,较佳的12-17Kbp;在一个示例中,发明人发现,当L2设定为15Kbp时,能够较全面找出异常区域/第二窗口。
发明人对测序结果初步统计发现,重复序列区域通常属于测序深度异常区域。去除这些测序深度异常区域/窗口或者对该些区域进行重新赋值,相较于对这些区域不作处理,检测结果的准确性和/或灵敏度有明显提升。
可以利用已知比对软件进行比对,例如SOAP、BWA、BLAST、MAPQ和TeraMap等,本实施方式对此不作限制。在比对过程中,可根据比对参数的设置,例如设置一对或一条读段最多允许有n个碱基错配(mismatch),例如设置n为1或2,若读段中有超过n个碱基发生错配,则视为该条或该对读段无法比对到第一参考序列,或者,若错配的n个碱基全部位于读段对中的一个读段,则视为该读段对中的该读段无法比对到该参考序列。
根据本发明的具体实施方式,步骤(2)中的比对包括:(a)将每条读段转化成与该读段对应的一组短片段,获得多组短片段;(b)确定短片段在参考库的对应位置,以获得第一定位结果,所称的参考库为基于第一参考序列构建的哈希表,参考库包含多个条目,参考库的一个条目对应一条种子序列,所称的种子序列能够与第一参考序列上的至少一段序列匹配,参考库的相邻两个条目对应的两条种子序列在第一参考序列上的距离小于短片段的长度;(c)去除第一定位结果中定位到参考库相邻条目中的任一条目上的短片段,获得第二定位结果;(d)基于所述第二定位结果中来自相同读段的短片段进行延伸,以获得读段的比对结果。利用上述比对方法,通过将读段转化成短片段以及将读段序列信息转化成位置信息,即将序列形态转成数字形态,利于快速准确地实现各种测序平台的下机数据的比对定位。特别是对于包含未能识别的碱基的读段,即包含gap或者N的读段的快速准确比对,比如由于测序质量不佳、碱基识别不佳等所获得的读段的比对分析,尤其适用。
所称的参考库实质为哈希表(hash table),可以直接以所称的种子序列为键(键名)、以所称的种子序列在参考序列上的位置(position)为值(键值)构建该参考库;也可以先将所称的种子序列转成数字或者整数字符串,以该数字或者整数字符串为键、以种子序列在参考序列上的位置为值建立该参考库。所称的以种子序列在参考序列上的位置为值,可以是该种子序列在参考序列/染色体上 对应的一个或多个位置,位置可直接以真实数值或者数值范围表示,也可以重新编码以自定义的字符和/或数字表示。
在一个示例中,利用C++的向量vector实现哈希表的构建,可表示为:Hash(seed)=Vector(position),所称的向量vector是一种对象实体,能够容纳许多其他类型相同的元素,因此也被称为容器。可用二进制保存,以此建成该参考库。另外,也可以将哈希表分成块(block)存储,在block头设置块头键和块尾键,例如,对于顺序序列块{5,6,7,8...,19,20},设置块头和块尾(或者说表头和表尾)5和20,若有个数为3,因3<5,可知3不属于该顺序序列块,若有个数为10,因5<10<20,可知10属于这个序列块。如此在查询的时候可以选择全局索引,也可以通过比较块头键和块尾键快速定位到所在的block,可不需要全局索引。
所称的参考库可以在要进行序列比对时构建,也可以预先构建保存。根据本发明的具体实施方式,预先构建参考库保存备用,参考库的构建包括:依据参考序列的碱基总数totalBase,确定种子序列的长度L,L=μ*log(totalBase),
Figure PCTCN2018085865-appb-000002
且L小于待比对分析的读段的长度(读长);基于种子序列的长度,生成所有可能的种子序列,获得种子序列集;确定种子序列集中能够匹配到参考序列的种子序列以及该种子序列的匹配位置,以获得该参考库。该方法基于发明人多次假设试验验证而建立的种子序列长度与参考序列的关系,能够使构建得的参考库包含全面的种子序列以各种子序列在参考序列上对应的位置的关联信息,该参考库结构紧凑,内存占用小且能够用于序列定位分析中的高速访问查询。根据该实施方式获得的参考库的一个条目只包含一个键,一个键对应至少一个值。
本发明该具体实施方式,对生成所有可能的种子序列、获得种子序列集的方法不作限制,对于输入一个集合,可遍历该集合中的元素来获得特定长度的、所有可能的元素组合,例如可以利用递归算法和/或循环算法来实现。
在一个例子中,第一参考序列为人基因组的至少一部分,人基因组包含大约30亿个碱基,待处理的读段的长度为不小于25bp,L取[11,15]中的整数,利于高效比对。
在一个例子中,第一参考序列为人cDNA参考基因组的至少一部分,统计该参考序列的碱基总数totalBase,基于碱基总数设定种子序列(seed)的长度L,L(seed)=log(totalBase)*μ,
Figure PCTCN2018085865-appb-000003
基于L以及DNA序列的碱基种类包括A、T、C和G四种,利用递归算法,生成所有可能的种子序列的集合,获得种子序列集,该过程可表示为seed=B 1B 2...B L,B∈{ATCG};确定种子序列集中能够匹配到该参考序列的种子序列以及该种子序列的匹配位置,以能够匹配到该参考序列的种子序列为键、以该种子序列在参考序列上的位置为值来构建获得该参考库。
在一个例子中,第一参考序列为某物种的DNA基因组和转录组的至少一部分,统计该参考序列的碱基总数totalBase,基于碱基总数设定种子序列(seed)的长度L,L(seed)=log(totalBase)*μ,
Figure PCTCN2018085865-appb-000004
基于L、组成DNA序列的碱基种类包括A、T、C和G四种以及组成RNA序列的碱基种类包括A、U、C和G四种,利用递归算法,生成所有可能的种子序列的集合,获得种子序列集,该过程可表示为seed=B 1B 2...B L,B∈{ATCG}∪{AUCG};确定种子序列集中能够匹配到该参考序列的种子序列以及该种子序列的匹配位置,以能够匹配到该参考序列的种子序列为键、以该种子序列在参考序列上的位置位为值来构建获得该参考库。
在一个例子中,可将种子序列转化成由数字字符组成的字符串,以该字符串为键来建库,能够 提高访问查询所建参考库的速度。例如,在获得能够匹配到第一参考序列的种子序列后,对种子序列进行如下编码:
Figure PCTCN2018085865-appb-000005
又例如,在获得种子序列集后,对种子序列集中的种子序列进行编码,碱基编码规则可与上面相同,且对第一参考序列也可以进行同样规则的编码转换,利于快速获得种子序列在参考序列上对应的位置信息,也利于提高所建参考库的访问查询速度。
根据本发明的具体实施方式,确定种子序列集中能够匹配到第一参考序列的种子序列以及该种子序列的匹配位置,包括:利用大小为L的窗口对第一参考序列进行滑窗,将种子序列集中的种子序列与滑窗得的窗口序列进行匹配,以确定种子序列集中能够匹配到第一参考序列的种子序列以及该种子的匹配位置,进行匹配的容错率为ε 1。如此,能够快速获得种子序列在第一参考序列上的对应位置信息,利于快速构建获得参考库。所称的容错率为允许的错配碱基所占的比例,错配选自置换、插入和缺失中的至少一种。
在一个例子中,所称的匹配为严格匹配,即容错率ε 1为零,当种子序列与一条或多条滑窗序列完全一致时,滑窗序列的位置为该种子序列在第一参考序列上对应的位置。在另一个例子中,所称的匹配为容错匹配,容错率ε 1大于零,当种子序列与一条或多条滑窗序列的相同位置的碱基不一致的比例小于容错率时,滑窗序列的位置为该种子序列在第一参考序列上对应的位置。在一个例子中,对种子序列在第一参考序列上对应的位置进行编码,以编码后的字符例如数字字符为值进行参考库的构建。
换个角度,容错率ε 1为不为零,相当于将一条种子序列转变成ε 1允许下的一组种子模板序列(seed template),例如seed=ATCG,ε 1为0.25即四个碱基中允许一个错误,则seed template可以为ATCG、TTCG、CTCG、GTCG、AACG、ACCG、AGCG等等。在ε 1为0.25下确定seed=ATCG在参考序列上的位置时,相当于确定该seed对应的所有seed template在第一参考序列的位置,例如ref=ATCG,前面所示的所有seed template都可匹配到该位置,ref=TTCG,seed template为ATCG、TTCG、CTCG或者GTCG可以匹配到该位置。进而,构建得的参考库可以以一条seed为键,也可以以该条seed对应的所有seed template中的每一条为键,键与键不同,一个键至少对应一个值。
根据本发明的具体实施方式,在确定种子序列在参考序列上的对应位置时,对第一参考序列进行滑窗的步长依据L和ε 1来确定。在一个例子中,进行滑窗的步长不小于L*ε 1。在一个具体例子中,第一参考序列为人基因组的至少一部分,人基因组包含大约30亿个碱基,待处理的读段的长度为不小于25bp,L为14bp,ε 1取0.2-0.3,进行滑窗的步长取3bp-5bp,使滑窗定位过程中相邻两个窗口能够跨过ε 1条件下的连续错误组合,利于快速定位。在一个例子中,构建得的参考库的相邻两个条目之间的距离为滑窗的步长。
根据本发明的具体实施方式,(a)包括:利用大小为L的窗口对读段进行滑窗,以获得与该读段对应的一组短片段,该滑窗的步长为1bp。如此,对于一条长度为K的reads,获得(K-L+1)条长度为L的短片段,将reads转成短片段,利用高速访问查询参考库,确定各短片段在参考库的对应位置,进而获得短片段对应的reads在参考库的信息。
根据本发明的具体实施方式,(b)包括:将短片段与参考库的条目对应的种子序列进行匹配, 以确定短片段在参考库的位置,进行匹配的容错率为ε 2
在一个例子中,所称的匹配为严格匹配,即容错率ε 2为零,当一条短片段与参考库的一个条目对应的seed或者seed template完全一致时,获得该短序列在参考库上的位置信息。
在另一个例子中,所称的匹配为容错匹配,容错率ε 2大于零,当短序列与参考库的一个或多个条目对应的seed或者seed template不匹配的碱基的比例小于容错率ε 2时,获得该短序列在参考库上的位置信息。
在一个例子中,ε 2=ε 1且不为零,使能够获得尽可能多的有效数据。
根据本发明的具体实施方式,参考图1,(b)中,所称的参考库的相邻两个条目对应的两条种子序列X1和X2在参考序列ref上的距离,可分为以下两种情形:当参考库的两个条目的键和值均为唯一,即一个条目对应一[键,值],参考图1a,相当于该X1和X2与参考序列均为唯一匹配时(X1和X2都只匹配到参考序列一个位置),所称的距离为X1和X2在参考序列上对应的这两个位置之间的距离;当参考库的两个条目中至少一个条目的键对应多个值,参考图1b,相当于该两条种子序列X1和X2中的至少一条与参考序列为非唯一匹配即X1和X2中至少有一条匹配到参考序列的多个位置,所称的距离为该X1和X2在参考序列上对应的相距最近的两个位置之间的距离。该实施方式对两条序列之间的距离的表示方法不作限制,例如,可以表示为一条序列的两个末端中的任一末端到另一条序列的任一末端的距离,也可以表示为一条序列的中心到另一条序列的中心的距离。
根据本发明的具体实施方式,在获得第二定位结果之后,(c)还包括:去除连通长度小于预定阈值的短片段,以去除后的结果替代第二定位结果,连通长度为第二比对结果中的来自相同读段且定位到参考库不同条目的短片段映射到参考序列的总长度。该处理有利于去除一些过渡冗余的和/或相对低质量的数据,利于提高比对速度。
连通长度可表示为来自相同读段且定位到参考库不同条目的短片段的长度总和减去映射到参考序列上短片段之间的重叠部分的长度。在一个例子中,来自一条读段且定位到参考库不同条目的短片段有4条,表示为Y1、Y2、Y3和Y4,各自的长度分别为S1、S2、S3和S4,其中的X1和X2映射到第一参考序列的位置有重叠,重叠部分的长度为J,连通长度为(S1+S2+S3+S4-J)。在一个例子中,不同短片段的长度均为L,所称的预定阈值为L,如此,可在允许丢失部分有效但质量相对低的数据的情况下,提高比对速度。
根据本发明的具体实施方式,在获得第二定位结果之后,(c)还包括:依据第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行评判,去除评判结果不符合预定要求的读段。去除读段同时也是去除了该读段对应的短片段。如此,在满足一定的敏感性和准确性的前提下,基于第二定位结果,直接进行精确匹配/局部快速比对,能够加速比对。
该实施方式对评判的方法不作限定,例如可以利用量化打分的方式。在一个例子中,对来自相同读段的短片短的定位结果进行打分,打分规则是:与第一参考序列匹配的位点作减分,与第一参考序列不匹配的位点作加分;在获得第二定位结果之后,依据第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不大于第一预设值的读段。
在一个具体示例中,读长为25bp,对来自相同读段的短片段进行序列构建,以获得重构序列,例如,可以根据具有更多短序列支持来确定某位点的碱基类型,若某位点没有支持的短片段即没有短片段比对到该位点,则该位点碱基类型不确定可以N来表示,以此来获得重构序列,可看出重构序列与读段是对应的,重构序列的长度为读长;重构序列与第一参考序列(ref)匹配的位点作减一分,与第一参考序列不匹配的位点作加一分,比对容错率即一条读段/重构序列允许的错配比例为0.12,比对容许错误的长度为3bp(25*0.12),初始分数Scoreinit为读长,第一预设值为22(25-3), 如此,去除得分小于22即匹配不上第一参考序列的位点占比超过比对容错率的重构序列,利于在允许丢失部分有效但质量相对低的数据的情况下,加速比对。
根据一个具体例子,使用位运算和动态规划算法[G.Myers.A fast bit-vector algorithm for approximate string matching based on dynamic progamming.Journal of the ACM,46(3):395-415,1999],对于每条重构序列,读入每个位点i的位置,利用64位的二进制掩模进行快速匹配计分,每个位点一分,初始分数Scoreinit为读长,可表示为Score init=length(read),匹配计分获得分数Score,可表示为:
Figure PCTCN2018085865-appb-000006
在一个例子中,对来自相同读段的短片短的定位结果进行打分,打分规则是:与第一参考序列匹配的位点作加分,与第一参考序列不匹配的位点作减分;在获得第二定位结果之后,依据第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不小于第二预设值的读段对应的短片段。
在一个具体示例中,读长为25bp,对来自相同读段的短片段进行序列构建,以获得重构序列,例如,可以根据具有更多短序列支持来确定某位点的碱基类型,若某位点没有支持的短片段即没有短片段比对到该位点,则该位点碱基类型不确定可以N来表示,以此来获得重构序列,可看出重构序列与读段是对应的,重构序列的长度为读长;重构序列与第一参考序列(ref)匹配的位点作加一分,与第一参考序列不匹配的位点作减一分,比对容错率即一条读段/重构序列允许的错配比例为0.12,比对容许错误的长度为3bp(25*0.12),初始分数Scoreinit为-25,第二预设值为-22(-25-3),如此,去除得分大于-22的重构序列,在允许丢失部分有效但相对低质量的数据的情况下,加速比对。
根据本发明的具体实施方式,(d)中的基于第二定位结果中来自相同读段的短片段进行延伸,包括:基于来自相同读段的短片段进行序列构建,获得重构序列;基于重构序列与该重构序列对应的参考序列的公共部分进行延伸,以获得延伸序列。如此,将短片段及短片段定位信息转化成短片段对应的读段(在此称为重构序列)的定位信息,利于后续比对处理快速准确的进行。
所称的公共部分,为多条序列共有的部分。根据本发明的具体实施方式,所称的公共部分为公共子串和/或公共子序列。公共子串指多条序列中共有的连续部分,公共子序列则不须连续。例如,对于ABCBDAB和BDCABA,公共子序列是BCBA,公共子串是AB。
所称的基于来自相同读段的短片段进行序列构建,获得重构序列,在一个例子中,可根据具有更多短片段支持来确定重构序列上某位点的碱基类型,若某位点没有支持的短片段即没有短片段比对到参考序列该位点,则该位点碱基类型不确定可以以N来表示,以此来获得所称的重构序列。可看出,重构序列与读段是对应的,重构序列的长度为读长。
所称的重构序列对应的参考序列,为与重构序列匹配的一段参考序列,该段参考序列的长度不小于读长。在一个例子中,重构序列对应的参考序列的长度与重构序列相同,均为读长。在另外一个例子中,允许重构序列与对应的参考序列容错匹配,重构序列对应的参考序列的长度为重构序列的长度加上两倍的容错匹配长度,例如,重构序列长度即读长为25bp,重构序列与参考序列的匹配允许错配12%,可以以重构序列对比上的那段参考序列以及该段参考序列两端各3bp(25*12%)序列来作为重构序列对应的参考序列。
根据本发明的一个具体例子,所称的公共部分为公共子串。(d)中的基于第二定位结果中来自相同读段的短片段进行延伸,包括:查找所述重构序列与所述重构序列对应的参考序列的公共子串,确定该重构序列和该重构序列对应的参考序列的最长公共子串;基于编辑距离,延伸该最长公共子串以获得延伸序列。如此,能够更加准确获得包含更长匹配序列的比对结果。
根据本发明的一个具体例子,所称的公共部分为公共子序列。(d)中的基于第二定位结果中来自相同读段的短片段进行延伸,包括:查找该重构序列与该重构序列对应的参考序列的公共子序列,确定该重构序列和该重构序列对应的参考序列的最长公共子序列;基于编辑距离,延伸所述最长公共子序列以获得延伸序列。
所称的编辑距离,也叫Levenshtein距离,是指两个字符串之间,由一个转成另一个所需的最少编辑操作次数。编辑操作包括将一个字符替换成另一个字符、插入一个字符以及删除一个字符。一般来说,编辑距离越小,两个串的相似度越大。
在一个例子中,对于一条重构序列/读段,查找该重构序列与该重构序列对应的参考序列的最长公共子串,可表示为求两个字符串x 1x 2...x i和y 1y 2...y j的公共子串,字符串的长度分别为m和n,计算这两字符串的公共子串的长度c[i,j],可以得到转移方程:
Figure PCTCN2018085865-appb-000007
解方程可得这两条序列的最长公共子串的长度为max(c[i,j]),i∈{1,...,m},j∈{1,...,n};接着利用编辑距离,将最长公共子串转化成对应的参考序列,可使最长公共子串两端不断生长,找出两个字符串之间需要的最小字符操作(替换,删除,插入)。可以使用动态规划算法确定编辑距离,该问题具备最优子结构,编辑距离d[i,j]的计算可表示为下列公式:
Figure PCTCN2018085865-appb-000008
其中,洞/空缺(gap)表示插入或者删除一个字符,公式中的gap表示插入或者删除一个字符(对应序列中的位点)所需的罚分,匹配(match)表示两个字符一样,公式中的match表示两个字符一样时的得分,错配(mismatch)表示两个字符不相等/不同,公式中的mismatch表示表示两个字符不相等/不同时的阀分。d[i,j]取三者中最小的一项。在一个具体例子中,一gap罚3分,连续gap增加阀1分,一个位点错配罚2分,位点匹配得0分。如此,利于含gap序列的高效比对。
根据本发明的具体实施方式,所称的公共部分为公共子序列。根据本发明的具体实施方式,(d)包括:查找第二定位结果中定位到参考库的相同条目的短片段的公共子序列,确定每条读段对应的最长公共子序列;基于编辑距离,延伸最长公共子序列以获得延伸序列。
在一个例子中,对于一条重构序列/读段,查找重构序列与该重构序列对应的参考序列的最长公共子序列,基于最长公共子序列,将最长公共子序列对应的那段重构序列转化为最长公共子序列对应的那段参考序列,可利用Smith Waterman算法找出这两段序列的编辑距离,对两个字符串x 1x 2...x i和y 1y 2...y j,可以通过以下公式求得:
Figure PCTCN2018085865-appb-000009
其中,
σ表示记分函数,σ(i,j)表示字符(位点)x i和y j错配或者匹配的得分,σ(-,j)表示x i空 缺(删除)或者y j插入的得分,σ(i,-)表示y j删除或者x i插入的得分;接着,可利用前面示例中的计算编辑距离的方法,将最长公共子序列对应的那段重构序列转化成重构序列对应的参考序列,可在最长公共子序列对应的那段重构序列的两端不断生长,找出最小字符操作(替换,删除,插入)。
在一个具体例子中,一gap罚3分,连续gap增加罚1分,一个位点错配罚2分,位点匹配得4分。如此,能够实现含gap的序列的高效比对且能够保留既含gap而其它位点准确度高的序列。
根据本发明的具体实施方式,(d)还包括:从延伸序列的至少一端对延伸序列进行截断,计算截断后的延伸序列的错误定位位点的比例,满足以下条件停止截断:截断后的延伸序列的错误定位位点的比例小于第三预设值。如此,采用截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。
具体地,根据本发明的具体实施方式,基于以下对延伸序列进行截断:i、计算第一错误率和第二错误率,若第一错误率小于第二错误率,则从延伸序列的第一端对延伸序列进行截断,若第一错误率大于第二错误率,则从延伸序列的第二端对延伸序列进行截断,以获得截断后的延伸序列,所称的第一错误率为从延伸序列的第一端对延伸序列进行截断获得的截断后的延伸序列的错误定位位点的比例,所称的第二错误率为从延伸序列的第二端对延伸序列进行截断、获得的截断后的延伸序列的错误定位位点的比例;ii、以截断后的延伸序列替代延伸序列进行i,直至截断后的延伸序列的错误定位位点的比例小于第四预设值。如此,采用双端截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。根据一个具体例子,延伸序列的长度为25bp,第四预设值为第三预设置为0.12。
根据本发明的具体实施方式,(d)还包括:从延伸序列的至少一端对延伸序列进行滑窗,计算滑窗得的窗口序列的错误定位位点的比例,根据窗口序列的错误定位位点的比例对延伸序列进行截断,满足以下条件停止截断:滑窗得的窗口序列的错误定位位点的比例大于第五预设值。如此,采用截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。
具体地,根据本发明的具体实施方式,基于以下对延伸序列进行截断:i、计算第三错误率和第四错误率,若第三错误率小于第四错误率,则从延伸序列的第二端对延伸序列进行截断,若第三错误率大于第四错误率,则从延伸序列的第一端对延伸序列进行截断,以获得截断后的延伸序列,所称的第三错误率为从延伸序列的第一端对延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例,所称的第四错误率为从延伸序列的第二端对延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例;ii、以截断后的延伸序列替代延伸序列进行i,直至窗口序列的错误定位位点的比例大于第六预设值。如此,采用双端截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。
根据本发明的具体实施方式,滑窗的窗口不大于延伸序列的长度。根据一个具体例子,延伸序列的长度为25bp,滑窗的窗口大小为10bp,第六预设值为第五预设值为0.12。
根据本发明的具体实施方式,截断的大小为1bp,即一次截断为去掉1个碱基。如此,能够高效的获得包含更多长序列的比对结果。
为使差异比较结果具有统计意义,一般地,阴性样本为多个,例如阴性样本的数目不小于20个,较佳地,不小于30个。
所称的阴性样本为不具有染色体非整倍性异常的样本,例如变异检测的目标是人或者待测样本为来自人体的样本,则阴性样本为获自正常二倍体个体的样本。阴性样本测序结果的获得和待测样本测序结果的获得没有顺序限制,例如可同时获得,也可先后获得,较佳地,在相同试验条件下同时获得,以尽量降低试验因素差异对检测结果造成的影响。另外,较佳地,阴性样本和待测样本为 同一类型样本,例如为无创检测母体中的胎儿的遗传信息,阴性样本和待测样本可均为母体血液样本。
根据本发明的具体实施方式,确定阴性样本中定位到相应第一染色体的读段的量包括:以阴性样本替代待测样本进行步骤(1)-(3),以确定定位到该阴性样本的第一染色体的读段的量;以多个阴性样本的第一染色体的读段的量的均值作为阴性样本中定位到相应第一染色体的读段的量。
所称的定位到染色体的读段的量,可以是绝对的量,也可以是相对的量,例如表现为一个数值如整数、比例,或者表现为一个数值范围。
根据本发明的具体实施方式,在进行步骤(3)之前,进行如下(i)-(iii)中至少一项、至少两项或者全部三项:(i)去除测序结果中的长度不大于预定长度的读段;(ii)去除比对结果中非定位到第一参考序列唯一位置的读段;比对/定位到参考序列唯一位置的reads称为unique reads;(iii)去除比对结果中错误率不小于预定错误率的读段,读段的错误率为比对后该读段上为插入、缺失和错配的碱基中的至少一种所占的比例。
在一个示例中,所称的读段的错误率为比对后的该读段上显示为插入(insertion)、缺失(deletion)和错配(mismatch)碱基数目或者说位置数目所占的比例。
所称的预定错误率可以根据测序平台、下机数据量、数据质量以及检测目的等进行设定,可以理解的,若是下机数据量小和/或数据质量较高,可能适合设定较大的预定错误率,反之,可设定较小的预定错误率以在满足检测的同时去除质量相对低的数据,利于快速检测。
在一个示例中,对来自单分子测序平台的测序结果,利用全部(i)-(iii)对测序结果进行过滤,利于快速检测。具体的,下机数据量12.8M,设定预定错误率为10%,即对于一条10bp的reads,最多允许1bp的插入、缺失或错配,过滤后获得数据3.4M。可以理解地,若是比对时已相对严格过滤,可不进行(ii),例如可设定该预定错误率为100%。
根据本发明的具体实施方式,步骤(3)包括:(a)以大小为L3的窗口对第一参考序列进行滑窗,获得多个第三窗口,任选的,滑窗的步长为L3;(b)基于比对结果,确定第三窗口的测序深度,第三窗口的测序深度为比对上该第三窗口的读段的数目与该第三窗口大小L3的比值;(c)基于第一染色体所包含的第三窗口的测序深度,确定定位到该第一染色体的读段的量。
根据本发明的具体实施方式,(b)包括:基于第三窗口的GC含量对第三窗口的测序深度进行校正,以校正后的第三窗口的测序深度作为第三窗口的测序深度。
第三窗口的大小即L3的设置,一般地,需要能反映出GC含量和分布的差异给该些区域(第三窗口)测序结果带来的差异。对于人基因组,一般地,L3取值小于300Kbp。在一个示例中,发明人依据变异系数(coefficients of variation,CV)与不同大小窗口的关系来确定L3,如图3所示,根据该曲线,选择CV值受窗口大小影响明显所对应的窗口大小作为第三窗口大小,如设置L3为100Kbp-200Kbp,能够反映GC含量和分布给测序带来的影响,也利于快速比对。所称的变异系数又称为离散系数,是概率分布离散程度的一个归一化量度,其为标准差与平均值之比;在此反映了特定窗口大小的一组窗口/区域的GC含量的离散程度的绝对值。
相邻两个第三窗口可以重叠也可以不重叠,在一个示例中,设置L3为150Kbp,相邻两个第三窗口具有100bp重叠(overlap),即滑窗步长设置为149.9Kbp。
具体的,可通过建立第三窗口的GC含量与第三窗口的测序深度的关系,来进行校正;在一个示例中,利用局部加权回归法建立第三窗口的GC含量与第三窗口的测序深度的关系。
根据本发明的具体实施方式,(b)还包括,在进行上述校正之前,对第三窗口的测序深度进行标准化处理,以标准化后的第三窗口的测序深度作为第三窗口的测序深度。
在一个示例中,所称的标准化处理为归一化处理,例如可基于测序深度平均值或测序深度中位数来进行第三窗口深度的归一化。
根据本发明的具体实施方式,在(c)中,基于第三窗口的测序深度确定比对到该第三窗口的读段的权重系数,基于权重系数确定定位到该第一染色体的读段的量。
在一个示例中,对第三窗口的测序深度进行标准化或归一化处理,例如以第三窗口的测序深度与特定值的比值作为该第三窗口的测序深度,所称特定值为第三窗口测序深度的均值,使第三窗口测序深度转变成围绕1波动的一组数值;确定处理后的测序深度(相对测序深度)与GC含量的关系。
所称的第三窗口的读段的权重系数为该窗口的相对测序深度,在一个示例中,所称的定位到该第一染色体的读段的量为相对量,且为经过权重系数修正过的相对量,由此,能够消除或减少GC含量和/或分布差异对检测结果的影响,提高检测准确性。
在某些示例中,发明人发现第三窗口相对测序深度与该窗口的GC含量成反比,即GC含量低的第三窗口的相对测序深度高,GC含量高的第三窗口的相对测序深度低。由此,对于所称的经过权重系数修正过的相对量,例如,N条读段定位到第一染色体的某第三窗口,该第一染色体的该第三窗口的相对测序深度为w,则修正后获得定位到该第一染色体的该第三窗口为
Figure PCTCN2018085865-appb-000010
条读段。
在一个示例中,所称的定位到该第一染色体的读段的量为相对量,为定位到该第一染色体的读段的量与定位到全部或至少一部分常染色体的读段的量的比值,通过z检验(z-score)比较该比值与阴性样本的相应比值的差异是否具有统计学意义,来判断待测样本的该第一染色体是否存为非整倍性异常。
根据本发明的具体实施方式,第一染色体选自13、18和21号染色体中的至少一种。例如,基于检测孕妇外周血样本中的游离核酸,以获得胎儿遗传信息,包括筛查或辅助诊断胎儿是否存在13、18和/或21号染色体非整倍性变异。
一般地,不同染色体的GC含量和分布具有不同特点,例如基于GC含量的相对高低,染色体组中的染色体可归至高GC含量组、中GC含量组和低GC含量组,或者可归至相对的高GC含量组、中高GC含量组、中GC含量组、中低GC含量组和低GC含量组。
表2显示了人常染色体的GC含量,发明人基于多个对照样本测序数据绘制染色体标准化的测序深度和染色体的GC含量的关系曲线,如图4所示,GC含量相对高和相对低的染色体的测序结果受GC含量影响较明显,对于21、13和18号染色体,相比较来看,21染色体测序结果受GC含量影响最小,18号染色体次之,13染色体受GC含量影响较大。
表2
Chr 4 5 6 3 18 8 2 7 12 21
GC含量 0.3825 0.3952 0.3961 0.3969 0.3979 0.4018 0.4024 0.4075 0.4081 0.4083
Chr 14 11 10 1 15 20 16 17 22 19
GC含量 0.4089 0.4157 0.4158 0.4174 0.4220 0.4413 0.4479 0.4554 0.4799 0.4836
根据本发明的具体实施方式,待测样本为孕妇血液样本。由于胎儿游离核酸包括胎儿游离DNA(cffDNA)在母体游离核酸样本中的含量在不同孕妇和/或在不同的孕周期内波动很大。若能提高检测灵敏度,在相同的检测准确性下可检测孕周期越早的样本,则妊娠可人工介入的时间就越早,对孕妇的影响越小;若能提高准确度,假阳性和假阴性都可降低,最终使得应用于诊断变为可能,而不仅是筛查用于辅助诊断。一般地,孕妇体液样本经过提取cffDNA、构建文库、上机测序、最后下 机得到测序数据(例如fastq格式),将下机数据与参考序列进行比对,得到包含了每条读段在基因组的位置、比对得分、是否唯一比对、比对错误等信息的比对结果(例如称为sam文件),可根据比对结果统计某一个染色体的读段数,最后计算该染色体的读段数在常染色体的读段数的占比(以下简称染色体占比),以判断该染色体是否存在数目异常。
根据本发明的具体实施方式,进行无创产前筛查(NIPT或者NIPD),可以先获得一批经过检测确认胎儿正常的包含游离DNA的孕妇体液样本(阴性样本),并计算这些孕妇体液样本中的染色体例如21/18/13染色体的占比,从而确定该(些)染色体数目正常和/或异常的范围或分界线。也可以利用阳性样本以同样的方法来确定染色体数目正常和/或异常的范围或分界线。
本发明实施方式还提供一种检测染色体非整倍性的装置,该装置用以实施上述本发明任一实施例或具体实施方式中的检测染色体非整倍性的方法,该装置包括:测序模块,用于对待测样本中的至少一部分核酸进行测序,获得包括读段的测序结果;比对模块,用于将来自测序模块的读段比对到第一参考序列,获得比对结果,所述比对结果包括所述读段定位于染色体的信息,所述第一参考序列为参考基因组上的比对能力为1的区域的集合,比对能力为1的区域为定位到参考基因组上唯一位置的区域;定量模块,用于对于第一染色体,基于来自比对模块的比对结果,确定定位到该第一染色体的读段的量;判断模块,用于比较来自定量模块的定位到该第一染色体的读段的量与阴性样本中的定位到相应第一染色体的读段的量,以判定该第一染色体的数目。
上述对本发明任一实施例或具体实施方式中的染色体非整倍性的检测方法的技术特征和效果的描述,同样适用本发明这一实施方式中的装置,在此不再赘述。
例如,在某些实施方式中,区域的比对能力的确定包括:以大小为L1的第一窗口对参考基因组进行滑窗,获得多个区域,滑窗的步长例如可设置为1bp;将区域比对到参考基因组,基于区域比对到参考基因组的位置的数目计算该区域的比对能力。
在某些实施方式中,阴性样本的数目不小于20个,或者较佳的不小于30个。
在某些实施方式中,以如下方式确定阴性样本中定位到相应第一染色体的读段的量:以阴性样本替代待测样本进入测序模块、比对模块和定量模块,以确定定位到该阴性样本的第一染色体的读段的量;以多个阴性样本的第一染色体的读段的量的均值作为阴性样本中定位到相应第一染色体的读段的量。
在某些实施方式中,所称第一参考序列为去除了表1所示区域的人参考基因组hg19序列中的至少一部分。
在一些实施方式中,第一参考序列为去除了符合以下条件的第二窗口对应的区域的参考基因组的至少一部分:该第二窗口的测序深度不小于所有第二窗口的测序深度的平均值的4倍。
在另一些实施方式中,第一参考序列为去除了符合以下条件的第二窗口对应的区域的参考基因组的至少一部分:该第二窗口的测序深度不小于所有第二窗口的测序深度的平均值的6倍。
在一些实施方式中,第一参考序列为对参考基因组上第二窗口对应的区域进行以下处理的参考基因组的至少一部分:对百分位数大于98的第二窗口的测序深度赋值为百分位数等于98的第二窗口的测序深度。
在另一些实施方式中,对百分位数大于99的第二窗口的测序深度赋值为百分位数等于99的第二窗口的测序深度。所称第二窗口通过利用大小为L2的窗口对参考基因组进行滑窗获得,在一个示例中,滑窗的步长也为L2。第二窗口的测序深度为比对上该第二窗口的读段的数目与该第二窗口大小L2的比值。
在一些实施方式中,装置还包括过滤模块,该过滤模块用于进行如下(i)-(iii)中至少一项:(i)去除测序结果中的长度不大于预定长度的读段;(ii)去除比对结果中非定位到第一参考序列唯一位置的读段;(iii)去除比对结果中错误率不小于预定错误率的读段,读段的错误率为比对后该读段上为插入、缺失和错配的碱基中的至少一种所占的比例。
在某些实施方式中,定量模块用于进行以下,(a)以大小为L3的窗口对第一参考序列进行滑窗,获得多个第三窗口;(b)基于比对结果,确定第三窗口的测序深度,第三窗口的测序深度为比对上该第三窗口的读段的数目与该第三窗口大小L3的比值;(c)基于第一染色体所包含的第三窗口的测序深度,确定定位到该第一染色体的读段的量。
在一些示例中,(b)还包括,对第三窗口的测序深度进行标准化处理,以标准化后的第三窗口的测序深度作为第三窗口的测序深度。
在另一些示例中,(b)还包括,基于第三窗口的GC含量对第三窗口的测序深度进行校正,以校正后的第三窗口的测序深度作为第三窗口的测序深度。校正前的第三窗口的测序深度可以是标准化的第三窗口的测序深度。
具体的,利用第三窗口的GC含量与第三窗口的测序深度的关系进行校正;在一个示例中,利用局部加权回归法建立第三窗口的GC含量与第三窗口的测序深度的关系。
在一些示例中,(c)包括,基于第三窗口的测序深度确定比对到该第三窗口的读段的权重系数,基于权重系数确定定位到该第一染色体的读段的量。
在某些实施方式中,待测样本为孕妇血液样本。
在某些实施方式中,第一染色体为胎儿13、18和21号染色体中的至少一个。
本发明实施方式提供一种计算机可读存储介质,用于存储/承载供计算机执行的程序,程序的执行包括完成上述任一实施例或具体实施方式中的染色体非整倍性检测方法。上述对本发明任一实施例或具体实施方式中的染色体非整倍性的检测方法和/或装置的技术特征和效果的描述,同样适用本发明这一实施方式中的计算机可读存储介质,在此不再赘述。
本发明实施方式还提供一种计算机程序产品,该包括指令,指令在计算机执行程序时,使计算机执行上述任一实施例或具体实施方式中的染色体非整倍性检测方法。
实施例
使用的参考序列为不含表1所示区域且同时满足以下的人参考基因组的区域的集合:1)比对能力为1,2)去除测序深度小于平均测序深度的6倍,或者已将测序深度百分位数大于99的区域的测序深度赋值为第99百分位数测序深度值。
对照样本和待测样本均通过以下处理:
1、测序,获得下机数据,即获得读段集合;去除掉小于25bp的读段;
2、获得比对结果(sam文件),包括获得唯一读段(比对到参考序列唯一位置的读段)及这些读段在参考序列/参考染色体上的位置;
3、进行GC校正,包括:将参考序列切割成150Kbp大小的窗口/区域(Bin=150K);依据唯一读段计算每个Bin的测序深度,对每个Bin的测序深度进行归一化处理,获得归一化的测序深度;统计各个Bin的GC含量;建立归一化测序深度与GC含量的关系,例如以Bin的GC含量为x,以对应Bin的归一化的测序深度为y,拟合获得二者关系的方程。
4、以y作为权重系数w修正比对结果中唯一比对到该窗口/区域的读段的数量,表示为该条读段的得分或贡献值为1/w;
5、确定修正后的染色体i的unique reads的数量,表示为染色体i的所有unique reads的得分之和
Figure PCTCN2018085865-appb-000011
6、计算染色体i的unique reads的得分之和占所有常染色体的比例
Figure PCTCN2018085865-appb-000012
接着,基于多个对照样本的染色体i的比例值(Ratio i),确定染色体i的比例值的平均值μ i和方差σ i
利用z检验公式
Figure PCTCN2018085865-appb-000013
计算待测样本的染色体i的Z分值(Zscore),比较该值与阈值的大小,以判断该待测样本的染色体i是否存在数目异常。
若一待测母体外周血样本的某染色体的Zscore≥3,差异具有统计意义,可认为该母体所怀胎儿存在三条该染色体。
对于阈值,多个对照样本的染色体i的比例值的分布符合正态分布或者近似符合正态分布,可通过z表(正态分布表)查找z值和对应的置信度;例如取置信度为99.97%,对应的z值大致为3,超过该z值,说明99.97%的概率该样本非正常样本,可判断为异常。当然,根据需要,本领域技术人员可以设定其它置信度,进而以对应的z值作为阈值来判断是否存在异常。
利用上述方法对十一例已经经过核型分析确认阳性类型的样本进行检测,所有样本均能被检出,结果如表3所示。
表3
Figure PCTCN2018085865-appb-000014
在本说明书的描述中,一个实施方式、一些实施方式、一个或一些具体实施方式、一个或一些实施例、示例等的描述意指结合该实施方式或示例描述的具体特征、结构或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构等特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同限定。

Claims (32)

  1. 一种检测染色体非整倍性的方法,其特征在于,包括:
    (1)对待测样本中的至少一部分核酸进行测序,获得包括读段的测序结果;
    (2)将所述读段比对到第一参考序列,获得比对结果,所述比对结果包括所述读段定位于具体染色体的信息,所述第一参考序列为参考基因组上的比对能力为1的区域的集合,比对能力为1的区域为定位到所述参考基因组上唯一位置的区域;
    (3)对于第一染色体,基于所述比对结果,确定定位到该第一染色体的读段的量;
    (4)比较所述定位到该第一染色体的读段的量与阴性样本中的定位到相应第一染色体的读段的量,以判定该第一染色体的数目。
  2. 如权利要求1所述的方法,其特征在于,所述区域的比对能力的确定包括:
    以大小为L1的第一窗口对所述参考基因组进行滑窗,获得多个所述区域,任选的,所述滑窗的步长为1bp;
    将所述区域比对到所述参考基因组,基于所述区域比对到所述参考基因组的位置的数目计算该区域的比对能力。
  3. 如权利要求1或2所述的方法,其特征在于,所述阴性样本的数目不小于20个,任选的不小于30个。
  4. 如权利要求3所述的方法,其特征在于,以如下方式确定阴性样本中定位到相应第一染色体的读段的量:
    以阴性样本替代待测样本进行(1)-(3),以确定定位到该阴性样本的第一染色体的读段的量;
    以多个阴性样本的第一染色体的读段的量的均值作为阴性样本中定位到相应第一染色体的读段的量。
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述第一参考序列为去除了下表所示区域的人参考基因组hg19序列中的至少一部分:
    Figure PCTCN2018085865-appb-100001
    Figure PCTCN2018085865-appb-100002
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述第一参考序列为去除了符合以下条件的第二窗口对应的区域的参考基因组的至少一部分:该第二窗口的测序深度不小于所有第二窗口的测序深度的平均值的4倍,任选的,该第二窗口的测序深度不小于所有第二窗口的测序深度的平均值的6倍;
    所述第二窗口通过利用大小为L2的窗口对所述参考基因组进行滑窗获得,任选的,所述滑窗的步长为L2,
    所述第二窗口的测序深度为比对上该第二窗口的读段的数目与该第二窗口大小的比值。
  7. 如权利要求1-5任一项所述的方法,其特征在于,所述第一参考序列为对参考基因组上第二窗口对应的区域进行以下处理的参考基因组的至少一部分进行以下处理:对百分位数大于98的第二窗口的测序深度赋值为百分位数等于98的第二窗口的测序深度,任选的,对百分位数大于99的第二窗口的测序深度赋值为百分位数等于99的第二窗口的测序深度;
    所述第二窗口通过利用大小为L2的窗口对所述参考基因组进行滑窗获得,任选的,所述滑窗的步长为L2,
    所述第二窗口的测序深度为比对上该第二窗口的读段的数目与该第二窗口大小L2的比值。
  8. 如权利要求1-7任一项所述的方法,其特征在于,所述方法包括,在进行步骤(3)之前,进行如下(i)-(iii)中至少一项:
    (i)去除所述测序结果中的长度不大于预定长度的读段;
    (ii)去除所述比对结果中非定位到第一参考序列唯一位置的读段;
    (iii)去除所述比对结果中错误率不小于预定错误率的读段,读段的错误率为比对后该读段上为插入、缺失和错配的碱基中的至少一种所占的比例。
  9. 如权利要求1-8任一项所述的方法,其特征在于,(3)包括,
    (a)以大小为L3的窗口对所述第一参考序列进行滑窗,获得多个第三窗口;
    (b)基于所述比对结果,确定所述第三窗口的测序深度,所述第三窗口的测序深度为比对上该第三窗口的读段的数目与该第三窗口大小L3的比值;
    (c)基于所述第一染色体所包含的第三窗口的测序深度,确定所述定位到该第一染色体的读段的量。
  10. 如权利要求9所述的方法,其特征在于,(b)还包括,
    对所述第三窗口的测序深度进行标准化处理,以标准化后的第三窗口的测序深度作为所述第三窗口的测序深度。
  11. 如权利要求10所述的方法,其特征在于,(b)还包括,
    基于所述第三窗口的GC含量对所述第三窗口的测序深度进行校正,以校正后的第三窗口的测序深度作为所述第三窗口的测序深度。
  12. 如权利要求11所述的方法,其特征在于,利用所述第三窗口的GC含量与所述第三窗口的测序深度的关系进行所述校正;
    任选的,利用局部加权回归法建立所述第三窗口的GC含量与第三窗口的测序深度的关系。
  13. 如权利要求11或12所述的方法,其特征在于,(c)包括,
    基于所述第三窗口的测序深度确定比对到该第三窗口的读段的权重系数,
    基于所述权重系数确定所述定位到该第一染色体的读段的量。
  14. 如权利要求1-13任一项所述的方法,其特征在于,所述待测样本为孕妇血液样本。
  15. 如权利要求1-14任一项所述的方法,其特征在于,所述第一染色体为胎儿13、18和21号染色体中的至少一个。
  16. 一种检测染色体非整倍性的装置,其特征在于,包括:
    测序模块,用于对待测样本中的至少一部分核酸进行测序,获得包括读段的测序结果;
    比对模块,用于将来自测序模块的读段比对到第一参考序列,获得比对结果,所述比对结果包括所述读段定位于染色体的信息,所述第一参考序列为参考基因组上的比对能力为1的区域的集合,比对能力为1的区域为定位到所述参考基因组上唯一位置的区域;
    定量模块,用于对于第一染色体,基于来自比对模块的比对结果,确定定位到该第一染色体的读段的量;
    判断模块,用于比较来自定量模块的定位到该第一染色体的读段的量与阴性样本中的定位到相应第一染色体的读段的量,以判定该第一染色体的数目。
  17. 如权利要求16所述的装置,其特征在于,所述区域的比对能力的确定包括:
    以大小为L1的第一窗口对所述参考基因组进行滑窗,获得多个所述区域,任选的,所述滑窗的步长为1bp;
    将所述区域比对到所述参考基因组,基于所述区域比对到所述参考基因组的位置的数目计算该区域的比对能力。
  18. 如权利要求16或17所述的装置,其特征在于,所述阴性样本的数目不小于20个,任选的不小于30个。
  19. 如权利要求18所述的装置,其特征在于,以如下方式确定阴性样本中定位到相应第一染色体的读段的量:
    以阴性样本替代待测样本进入通过测序模块、比对模块和定量模块,以确定定位到该阴性样本的第一染色体的读段的量;
    以多个阴性样本的第一染色体的读段的量的均值作为阴性样本中定位到相应第一染色体的读段的量。
  20. 如权利要求16-19任一项所述的装置,其特征在于,所述第一参考序列为去除了下表所示区域的人参考基因组hg19序列中的至少一部分:
    Figure PCTCN2018085865-appb-100003
    Figure PCTCN2018085865-appb-100004
  21. 如权利要求16-20任一项所述的装置,其特征在于,所述第一参考序列为去除了符合以下条件的第二窗口对应的区域的参考基因组的至少一部分:该第二窗口的测序深度不小于所有第二窗口的测序深度的平均值的4倍,任选的,该第二窗口的测序深度不小于所有第二窗口的测序深度的平均值的6倍;
    所述第二窗口通过利用大小为L2的窗口对所述参考基因组进行滑窗获得,任选的,所述滑窗的步长为L2,
    所述第二窗口的测序深度为比对上该第二窗口的读段的数目与该第二窗口大小的比值。
  22. 如权利要求16-20任一项所述的装置,其特征在于,所述第一参考序列为对参考基因组上第二窗口对应的区域进行以下处理的参考基因组的至少一部分:对百分位数大于98的第二窗口的测序深度赋值为百分位数等于98的第二窗口的测序深度,任选的,对百分位数大于99的第二窗口的测序深度赋值为百分位数等于99的第二窗口的测序深度;
    所述第二窗口通过利用大小为L2的窗口对所述参考基因组进行滑窗获得,任选的,所述滑窗的步长为L2,
    所述第二窗口的测序深度为比对上该第二窗口的读段的数目与该第二窗口大小L2的比值。
  23. 如权利要求16-22任一项所述的装置,其特征在于,还包括过滤模块,所述过滤模块用于进行如下(i)-(iii)中至少一项:
    (i)去除所述测序结果中的长度不大于预定长度的读段;
    (ii)去除所述比对结果中非定位到第一参考序列唯一位置的读段;
    (iii)去除所述比对结果中错误率不小于预定错误率的读段,读段的错误率为比对后该读段上为插入、缺失和错配的碱基中的至少一种所占的比例。
  24. 如权利要求16-23任一项所述的装置,其特征在于,所述定量模块用于进行以下,
    (a)以大小为L3的窗口对所述第一参考序列进行滑窗,获得多个第三窗口;
    (b)基于所述比对结果,确定所述第三窗口的测序深度,所述第三窗口的测序深度为比对上该第三窗口的读段的数目与该第三窗口大小L3的比值;
    (c)基于所述第一染色体所包含的第三窗口的测序深度,确定所述定位到该第一染色体的读段的量。
  25. 如权利要求24所述的装置,其特征在于,(b)还包括,
    对所述第三窗口的测序深度进行标准化处理,以标准化后的第三窗口的测序深度作为所述第三窗口的测序深度。
  26. 如权利要求25所述的装置,其特征在于,(b)还包括,
    基于所述第三窗口的GC含量对所述第三窗口的测序深度进行校正,以校正后的第三窗口的测序深度作为所述第三窗口的测序深度。
  27. 如权利要求26所述的装置,其特征在于,利用所述第三窗口的GC含量与所述第三窗口的测序深度的关系进行所述校正;
    任选的,利用局部加权回归法建立所述第三窗口的GC含量与第三窗口的测序深度的关系。
  28. 如权利要求26或27所述的装置,其特征在于,(c)包括,
    基于所述第三窗口的测序深度确定比对到该第三窗口的读段的权重系数,
    基于所述权重系数确定所述定位到该第一染色体的读段的量。
  29. 如权利要求16-28任一项所述的装置,其特征在于,所述待测样本为孕妇血液样本。
  30. 如权利要求16-29任一项所述的装置,其特征在于,所述第一染色体为胎儿13、18和21号染色体中的至少一个。
  31. 一种计算机可读存储介质,其特征在于,用于存储供计算机执行的程序,所述程序的执行包括完成如权利要求1-15任一项所述的方法。
  32. 一种计算机程序产品,其特征在于,包括指令,所述指令在所述计算机执行所述程序时,使所述计算机执行如权利要求1-15任一项所述的方法。
PCT/CN2018/085865 2018-05-07 2018-05-07 检测染色体非整倍性的方法、装置及系统 WO2019213810A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/085865 WO2019213810A1 (zh) 2018-05-07 2018-05-07 检测染色体非整倍性的方法、装置及系统
US17/053,054 US20210130888A1 (en) 2018-05-07 2018-05-07 Method, apparatus, and system for detecting chromosome aneuploidy
EP18917705.8A EP3795692A4 (en) 2018-05-07 2018-05-07 METHOD, DEVICE AND SYSTEM FOR DETECTION OF CHROMOSOMAL ANEUPLOIDY

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/085865 WO2019213810A1 (zh) 2018-05-07 2018-05-07 检测染色体非整倍性的方法、装置及系统

Publications (1)

Publication Number Publication Date
WO2019213810A1 true WO2019213810A1 (zh) 2019-11-14

Family

ID=68468420

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/085865 WO2019213810A1 (zh) 2018-05-07 2018-05-07 检测染色体非整倍性的方法、装置及系统

Country Status (3)

Country Link
US (1) US20210130888A1 (zh)
EP (1) EP3795692A4 (zh)
WO (1) WO2019213810A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593629B (zh) * 2021-06-29 2024-02-13 广东博奥医学检验所有限公司 基于半导体测序的降低无创产前检测假阳性假阴性的方法
CN113990393B (zh) * 2021-12-28 2022-04-22 北京优迅医疗器械有限公司 基因检测用数据处理方法、装置和电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015184404A1 (en) * 2014-05-30 2015-12-03 Verinata Health, Inc. Detecting fetal sub-chromosomal aneuploidies and copy number variations
CN107133495A (zh) * 2017-05-04 2017-09-05 北京医院 一种非整倍性生物信息的分析方法和分析系统
CN107949845A (zh) * 2015-08-06 2018-04-20 伊万基因诊断中心有限公司 能够在多个平台上区分胎儿性别和胎儿性染色体异常的新方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2939547T3 (es) * 2013-04-03 2023-04-24 Sequenom Inc Métodos y procedimientos para la evaluación no invasiva de variaciones genéticas
IL304949A (en) * 2013-10-04 2023-10-01 Sequenom Inc Methods and processes for non-invasive evaluation of genetic variations
AU2016293025A1 (en) * 2015-07-13 2017-11-02 Agilent Technologies Belgium Nv System and methodology for the analysis of genomic data obtained from a subject

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015184404A1 (en) * 2014-05-30 2015-12-03 Verinata Health, Inc. Detecting fetal sub-chromosomal aneuploidies and copy number variations
CN107949845A (zh) * 2015-08-06 2018-04-20 伊万基因诊断中心有限公司 能够在多个平台上区分胎儿性别和胎儿性染色体异常的新方法
CN107133495A (zh) * 2017-05-04 2017-09-05 北京医院 一种非整倍性生物信息的分析方法和分析系统

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHITKARA, U. ET AL.: "Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood", PROC. NATL. ACAD. SCI. U. S. A, vol. 105, 2008, pages 16266 - 71, XP055523982, DOI: 10.1073/pnas.0808319105
CHIU, R. W. K. ET AL.: "Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma", PROC. NATL. ACAD. SCI., vol. 105, 2008, pages 20458 - 20463, XP002620454, DOI: 10.1073/pnas.0810641105
DRISCOLL, D. A.GROSS, S.: "Prenatal Screening for Aneuploidy", N. ENGL. J. MED., vol. 360, 2009, pages 2556 - 2562
G. MYERS: "A fast bit-vector algorithm for approximate string matching based on dynamic programming", JOURNAL OF THE ACM, vol. 46, no. 3, 1999, pages 395 - 415, XP058340666, DOI: 10.1145/316542.316550
LO, Y. M. D. ET AL.: "Presence of fetal DNA in maternal plasma and serum", LANCET, vol. 350, 1997, pages 485 - 487, XP005106839, DOI: 10.1016/S0140-6736(97)02174-0
PAPAGEORGIOU, E. A. ET AL.: "Fetal-specific DNA methylation ratio permits noninvasive prenatal diagnosis of trisomy 21", NAT. MED., vol. 17, 2011, pages 510 - 513, XP055358566, DOI: 10.1038/nm.2312
See also references of EP3795692A4

Also Published As

Publication number Publication date
US20210130888A1 (en) 2021-05-06
EP3795692A1 (en) 2021-03-24
EP3795692A4 (en) 2021-07-21

Similar Documents

Publication Publication Date Title
EP3464626B1 (en) Methods for detecting genetic variations
JP6971845B2 (ja) 遺伝子の変動の非侵襲的評価のための方法および処理
EP3143537B1 (en) Rare variant calls in ultra-deep sequencing
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
WO2019213811A1 (zh) 检测染色体非整倍性的方法、装置及系统
RU2654575C2 (ru) Способ и устройство для детектирования хромосомных структурных аномалий
KR20170125044A (ko) 암 스크리닝 및 태아 분석을 위한 돌연변이 검출법
CN108595912B (zh) 检测染色体非整倍性的方法、装置及系统
US20190338349A1 (en) Methods and systems for high fidelity sequencing
IL258999A (en) Methods for detecting copy-number variations in next-generation sequencing
WO2019213810A1 (zh) 检测染色体非整倍性的方法、装置及系统
CN109461473B (zh) 胎儿游离dna浓度获取方法和装置
CN117253539B (zh) 基于胚系突变检测高通量测序中样本污染的方法和系统
EP3977459A1 (en) Limit of detection based quality control metric
WO2019132010A1 (ja) 塩基配列における塩基種を推定する方法、装置及びプログラム
US20210151126A1 (en) Methods for fingerprinting of biological samples
EP3118323A1 (en) System and methodology for the analysis of genomic data obtained from a subject
Zhu et al. A novel graphic-aided algorithm (gNIPT) improves the accuracy of noninvasive prenatal testing
CN118098345A (zh) 一种染色体非整倍体的检测方法、装置、设备及存储介质
WO2023031641A1 (en) Methods and devices for non-invasive prenatal testing
US20220101947A1 (en) Method for determining fetal fraction in maternal sample
CN116323981A (zh) 线粒体dna质量控制
CN114708905A (zh) 基于ngs的染色体非整倍体检测方法、装置、介质和设备
CN108629152A (zh) 检测染色体非整倍性的方法、装置及系统
CN117106870A (zh) 胎儿浓度的确定方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18917705

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018917705

Country of ref document: EP

Effective date: 20201207