CN108629152A

CN108629152A - Detect the method, apparatus and system of chromosomal aneuploidy

Info

Publication number: CN108629152A
Application number: CN201810425695.6A
Authority: CN
Inventors: 曾立董; 吴增丁; 金欢; 徐伟彬; 李林森; 赵陆洋; 张萌; 颜钦
Original assignee: SHENZHEN HANHAI GENE BIOTECHNOLOGY CO Ltd
Current assignee: SHENZHEN HANHAI GENE BIOTECHNOLOGY CO Ltd
Priority date: 2018-05-07
Filing date: 2018-05-07
Publication date: 2018-10-09

Abstract

The invention discloses a kind of method, apparatus and system of detection chromosomal aneuploidy.Method includes：At least part nucleic acid in sample to be tested is sequenced, acquisition includes the sequencing result of read；Read is compared to the first reference sequences, obtains comparison result, comparison result includes the information that read is positioned at specific chromosome；For the first chromosome, it is based on comparison result, navigates to the amount of the read of the first chromosome；Comparison and location to the amount and the read for navigating to corresponding the first chromosome in negative sample of the read of the first chromosome amount, to judge the number of the first chromosome.Chromosomal aneuploidy detection is carried out using this method, the testing result of acquisition has higher sensitivity and accuracy.

Description

Detect the method, apparatus and system of chromosomal aneuploidy

Technical field

The present invention relates to field of bioinformatics, and in particular, to a kind of method, apparatus of detection chromosomal aneuploidy And system.

Background technology

Down syndrome (tri- bodies of 21-), Edward thatch syndrome (tri- bodies of 13-), pa pottery Cotard (tri- bodies of 18-) are most Common newborn's prenatal diagnosis, their incidence distinguish 1/700 [Papageorgiou, E.A.et al.Fetal-specific DNA methylation ratio permits noninvasive prenatal Diagnosis of trisomy 21.Nat.Med.17,510-513 (2011)], 1/6,000 and 1/10,000 [Driscoll,D.A.&Gross,S.Prenatal Screening for Aneuploidy.N.Engl.J.Med.360, 2556–2562(2009).].These chromosome aneuploids can lead to very high incidence and mortality, amniocentesis and villus Film sampling be diagnosing fetal chromosomal exception standard method, but these diagnostic methods in itself up to 0.6% can be brought to arrive 1.9% abortion ratio.In order to avoid these risks, need exploitation safer, the non-intrusive tire of detection pregnant week more in advance The detection method of youngster's aneuploid abnormal (NIPT).

Bright [Lo, Y.M.D.et al.Presence of fetal DNA in the maternal plasma of Lu Yu in 1997 And serum.Lancet350,485-487 (1997)] dissociative DNA for going out fetus in pregnant woman's vivo detection is reported for the first time (cff DNA), this makes it possible the blood by parent to check the gene status of fetus.It is reported that the first pregnancy period and Second pregnancy period cffDNA reaches 10-20% in the accounting about 4-10% of parent dissociative DNA in the third pregnancy period.Lu Yu is bright within 2008 [Chiu,R.W.K.et al.Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal Plasma.Proc.Natl.Acad.Sci.105,20458-20463 (2008)] and SetphenQuake [Chitkara, U.et al.Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from Maternal blood.Proc.Natl.Acad.Sci.U.S.A.105,16266-71 (2008)] report that application is next respectively Generation sequencing (NGS) technology detects foetal chromosome aneuploidy exception.The microarray dataset that can be applied to genetic test at present is got over Come more.

Lower machine data based on each platform carry out chromosomal aneuploidy variation detection, the sensitivity of detection and/or accurate Property need to be further increased always, multifactor relationship detection sensitivity and/or accuracy, such as under different microarray dataset generates The difference in length of machine data is larger, and lower machine data are also referred to as read (reads), and the length of read also referred to as reads length, from tens bp To thousands of bp etc., the confidence level height of machine Data Matching (positioning) under the influence of reading length at least；In another example sequencing error rate height Also the confidence level of read positioning is influenced, usually, error rate is higher, and confidence level is lower.

Invention content

Embodiment of the present invention aims to solve at least one of technical problem present in the relevant technologies or at least provides one The selectable practical plan of kind.

An embodiment according to the present invention provides a kind of method of detection chromosomal aneuploidy variation, including： (1) at least part nucleic acid in sample to be tested is sequenced, acquisition includes the sequencing result of read；(2) read is compared To the first reference sequences, comparison result is obtained, comparison result includes the information that read is positioned at specific chromosome, and first refers to sequence The set in the region that the comparison ability being classified as in reference gene group is 1, the region that comparison ability is 1 refers to navigating to reference gene The region of unique positions in group；(3) for the first chromosome, it is based on comparison result, navigates to the reading of the first chromosome The amount of section；(4) amount of comparison and location to the read of the first chromosome navigates to corresponding the first chromosome to negative sample Read amount, to judge the number of the first chromosome.

This method is screened and is positioned to read including the use of specific reference sequences, can quickly and easily be realized Chromosomal aneuploidy detects, and obtains accurate testing result.Detection suitable for the lower machine data based on various microarray datasets Analysis includes the place of the read of vacancy (gap) especially suitable for the detection and analysis to the read comprising the base for failing identification Reason analysis.

Another embodiment according to the present invention provides a kind of device of detection chromosomal aneuploidy variation, the device Method to implement the detection chromosomal aneuploidy in aforementioned present invention embodiment, the device include：Sequencer module：It should For sequencer module at least part nucleic acid in sample to be tested to be sequenced, acquisition includes the sequencing result of read；It compares Module：The comparing module obtains comparison result, compares knot for will be compared to the first reference sequences from the read of sequencer module Fruit includes the information that read is positioned at chromosome, and the first reference sequences are the region that the comparison ability in reference gene group is 1 Set, the region that comparison ability is 1 refers to the region for navigating to unique positions in reference gene group；Quantitative module：For first Chromosome, the quantitative module are used to, based on the comparison result from comparing module, navigate to the read of the first chromosome Amount；Judgment module：The judgment module be used to compare the amount of the read for navigating to the first chromosome come self-quantitatively module with The amount of the read for navigating to corresponding the first chromosome in negative sample, to judge the number of the first chromosome.

Another embodiment according to the present invention, also provides a kind of computer-readable medium, for storing/load capacity calculation machine Executable program, when executing the program, by instructing related hardware that can complete the dye of the detection in aforementioned present invention embodiment The all or part of step of the method for colour solid aneuploidy.Alleged medium includes but not limited to：Read-only memory, random storage Device, disk or CD etc..

Another embodiment according to the present invention, provides a kind of terminal, and a kind of detection chromosomal aneuploidy variation is System, the system include computer executable program, which includes processor, which can be used in executing above computer Executable program, it includes the detection chromosome aneuploidy completed in aforementioned present invention embodiment to execute computer executable program The method of property.

The method, apparatus and/or system for the detection chromosomal aneuploidy that any of the above-described embodiment provides can be used for Chromosomal aneuploidy variation detection, the testing result of acquisition have higher sensitivity and accuracy.Suitable for being based on various surveys The detection and analysis of the lower machine data of sequence platform, especially suitable for the detection and analysis to the read comprising the base for failing identification, i.e., Include the processing analysis of the read of vacancy (gap).

The additional aspect and advantage of embodiment of the present invention will be set forth in part in the description, partly will be from following Become apparent in description, or the practice of embodiment is recognized through the invention.

Description of the drawings

Fig. 1 is the distance of the two neighboring entry of the reference library in the alignments that the specific embodiment of the invention utilizes Schematic diagram.

Fig. 2 is the connection length schematic diagram for the alignments that the specific embodiment of the invention utilizes.

Fig. 3 is the relation schematic diagram of the coefficient of variation and window size in the specific implementation mode of the present invention.

Fig. 4 is the pass of the G/C content of the standardized sequencing depth of the chromosome in the specific embodiment of the invention and chromosome It is schematic diagram.

Specific implementation mode

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and is only used for explaining the present invention, and is not considered as limiting the invention.

In the description of the present invention, term " first ", " second " are used for description purposes only, and should not be understood as instruction or dark Show that relative importance either implicitly indicates the quantity or sequence of indicated technical characteristic.In the description of the present invention, " more It is a " it is meant that two or more, unless otherwise specifically defined.

The so-called sequencing of embodiment of the present invention, also referred to as sequencing, refer to determining nucleic acid sequence, including DNA sequencing and/ Or RNA sequencings, including long segment sequencing and/or short-movie section sequencing.

Sequencing can be carried out by microarray dataset, and microarray dataset may be selected but be not limited to the Hisq/ of Illumina companies Miseq/Nextseq microarray datasets, the Ion Torrent platforms of Thermo Fisher/Life Technologies companies, China The BGISEQ platforms and single-molecule sequencing platform of big gene；Sequencing mode can select single-ended sequencing, can also select double ends Sequencing；Sequencing result/data of acquisition survey the segment read out, referred to as read (reads).The length of read is known as reading length.

Embodiment of the present invention provides a kind of method of detection chromosomal aneuploidy, and alleged chromosomal aneuploidy includes The exception of the amount in a part of region of chromosome or chromosome, this method include：(1) at least part core in sample to be tested Acid is sequenced, and acquisition includes the sequencing result of read；(2) read is compared to the first reference sequences, obtains comparison result, than Include the information that read is positioned at specific chromosome to result, the first reference sequences are that the comparison ability in reference gene group is 1 Region set, comparison ability be 1 region refer to the region for navigating to unique positions in reference gene group；(3) for One chromosome is based on comparison result, navigates to the amount of the read of the first chromosome；(4) comparison and location to this first dye The amount of the amount of the read of colour solid and the read for navigating to corresponding the first chromosome in negative sample, to judge the first chromosome Number.

Whole chromosome group (genome) can be directed to, and either the subregion of several chromosomes or chromosome is surveyed Sequence, usually, this include mainly the target chromosome or region with the characteristics of target chromosome or region with other chromosomes or Region is associated with relationship.

So-called " comparison " refers to sequence alignment, includes the mistake that one or more sequence is navigated to another one or more sequence Journey and the positioning result of acquisition.E.g., including read navigates to the process on reference sequences, also positioned including obtaining read/ The process of matching result.

So-called reference sequences (reference, ref) are fixed sequence with chromosome sequence is referred to, and can be certainly The DNA and/or RNA sequence of own measured in advance assembling can also be that other people measure disclosed DNA and/or RNA sequence, Ke Yishi Arbitrary reference template in the samples sources individual being obtained ahead of time/affiliated category of target individual, for example, same biological species The whole or at least part of other published genome assembling sequence.If samples sources individual or target individual are people UCSC, NCBI or ENSEMBL may be selected in class, genome reference sequences (be also referred to as reference gene group or refer to genome) Mankind's reference gene group that database provides, such as HG19, HG38, GRCh36, GRCh37, GRCh38 etc., people in the art Member can pass through the correspondence for illustrating to understand above-mentioned each reference gene group version of database, version selected to use.Into one Step ground, can also be pre-configured with the resources bank for including more reference sequences, such as before being compared, first according to target individual The factors such as gender, ethnic group, region select or measurement assembles closer or more certain aspect characteristic sequence and is used as with reference to sequence Row, help subsequently to obtain more accurate the sequencing results.Alleged reference sequences include chromosome numbers and each site Location information on chromosome.

So-called first reference sequences are at least part of reference gene group, are that inventor is disclosed based on discovery is excavated It includes reading length/error rate/quality of data that the characteristics of lower machine data set, which combines used microarray dataset feature, lower machine data characteristic, Etc. factors and attempted for the purpose of detection chromosomal aneuploidy variation and the version that constructs, utilize first ginseng The positioning that sequence carries out read is examined, be conducive to quickly obtain positioning result and reduces the data volume handled needed for subsequent step.

In some embodiments, the comparison ability in so-called region is determined in the following manner：It is the of L1 with size One window carries out sliding window to reference gene group, obtains multiple regions；Region is compared to reference gene group, is compared and is arrived based on region The number of the position of reference gene group calculates the comparison ability in the region.

So-called region or window correspond to one section of sequence in reference gene group.The size of alleged first window and/or The step-length of sliding window can be in conjunction with testing goal, used variation testing principle, the sequence characteristic for reading length and reference gene group It is set.Preferably, the step-length of setting sliding window is not more than the size of first window, in this way, can retain reference as much as possible The region that ability is 1 is compared on genome, is conducive to improve lower machine data user rate.

Usually, L1 can be arranged according to the long size of reading, such as be set as 0.5-2 times and read appointing in long or average reading length Meaning integer, sliding window step-length may be configured as being less than long 0.5 times of reading, less than 0.2 times reading length or read the arbitrary whole of length less than 0.1 times Number.In one example, reference gene group selected to use is the HG19 of UCSC databases, reads a length of 25bp, and setting L1 is 25bp, the step-length that sliding window is arranged are less than 10bp, are less than 5bp or are less than 2bp；For example, the step-length of sliding window is set as 1bp, quite The overlapping (overlap) with (L1-1) bp between two neighboring first window, in this way, being conducive to obtain in reference gene group All regional ensembles for meeting the particular requirement, conducive to making full use of sequencing result to obtain more fully comparison result, conducive to carrying High data user rate.

Specifically, in one example, the comparison ability of zoning is compared with region to the position of reference gene group Comparison ability of the inverse of number as the region a, for example, region compares the unique positions to reference gene group, the then region Comparison ability be 1, and a region can compare 5 positions of reference gene group, then the comparison ability 1/5 in the region.

First reference sequences can be built when being detected target sample, can also build preservation standby detection sample in advance This when, calls.

In some embodiments, the first reference sequences are to eliminate the ginseng in region shown in table 1 to examine genome at least A part.Since all removal regional sequences need a large amount of lengths for display, table 1 is intended to remove with these/and shielding area examines in ginseng Location information in genome HG19 indicates these regions, it is possible to understand that ground, these regions are in different editions people's reference gene Different chromosome start position informations may be corresponded in group, but is not interfered those skilled in the art determining and shielded such as following table Those regional sequences obtain the first reference sequences.Shielding/remove the reference sequences behind these regions is conducive to the fast of subsequent step Speed carries out and obtains accurate testing result.

Table 1

In other embodiments, the first reference sequences are to eliminate the corresponding area of the second window for meeting the following conditions At least part of the reference gene group in domain：The sequencing depth of second window is not less than (being more than or equal to) all second windows 4 times of the average value of the sequencing depth of mouth, preferably not less than the sequencing depth of (being more than or equal to) all second windows 6 times of average value；That is, sequencing depth is much larger than the second window of average sequencing depth in removal or shielding reference gene group.

So-called sequencing depth is also referred to as depth, for the number that some region is capped, is represented by the upper region of comparison Read quantity and the region size ratio.The sequencing depth of second window is to compare the read of upper second window The ratio of number and second window size.

So-called second window can carry out sliding window acquisition by using the window that size is L2 to reference gene group, obtain big A series of small second windows for L2.There can be overlapping that can also not be overlapped between adjacent second window, in one example, The step size settings that sliding window obtains the second window are L2, i.e., so that non-overlapping and without interval (zero alkali between two neighboring second window Base weight is folded and zero base intervals), reference gene group is converted to a series of second windows as a result, this series of second window covers Lid reference gene group is primary, can represent genome using second window of series.

/ shielding processing is removed to reference gene group specific region so that utilize treated reference sequences (the first ginseng Examine sequence) carry out step (2) compare after, influence of some abnormal datas to subsequent statistical analysis can be eliminated.

Specific implementation mode according to the present invention deviates considerably from sequencing depth-averaged value for sequencing depth or sequencing is deep The second window of median is spent, assignment again can also be carried out by the depth to second window, to obtain by sequencing depth A series of second windows of relative equilibrium so that utilize a series of the first ginseng of second windows comprising sequencing depth relative equilibrium After examining sequence progress step (2) comparison, influence of some abnormal datas to subsequent statistical analysis can be equally eliminated.Show at one In example, the sequencing depth of the second window to percentile more than 98 is assigned a value of the sequencing of second window of the percentile equal to 98 Depth, or the sequencing depth of the second window to percentile more than 99 are assigned a value of second window that percentile is equal to 99 Depth is sequenced, the first reference sequences obtained with this are conducive to eliminate the influence that testing result is brought in abnormal data/region, are conducive to Obtain accurate testing result.For example, the second all windows can be ranked up from low to high by sequencing depth value, to ranking Assignment again is carried out in the sequencing depth of the 99th~100 all second windows, such as be assigned a value of the 99th the second window Depth value is sequenced, to eliminate influence of the window of abnormal high sequencing depth to subsequent detection.

The size L2 of second window can as needed and sequencing result adjustment determine, preferably, wish the second window Size and the high and/or low Region/Window of most of sequencing deep anomalies size it is almost the same.In certain embodiments In, sample is human sample, and reference gene group is people's reference gene group, based on to the preliminary of sequencing result and/or comparison result Statistics, L2 can be set as 10-20Kbp, preferable 12-17Kbp；In one example, inventor has found, when L2 is set as 15Kbp When, the window of abnormal area/second can be found out more comprehensively.

Inventor has found sequencing result rough estimates repetitive sequence region generally falls into sequencing deep anomalies region.It goes Assignment again is carried out except these sequencing deep anomalies Region/Windows or to those regions, does not make to locate compared to these regions Reason, the accuracy of testing result and/or sensitivity are obviously improved.

It can be compared using known comparison software, such as SOAP, BWA, BLAST, MAPQ and TeraMap etc., this reality Mode is applied not to be restricted this., can be according to the setting of alignment parameters in comparison process, such as setting a pair or a read are most Allow have n base mispairing (mismatch), such as setting n is 1 or 2 more, if having more than n base in read occurs mispairing, Read can not be compared to the first reference sequences by being then considered as this or this, if alternatively, n base of mispairing is entirely located in read pair In a read, then the reference sequences can not be compared by being considered as the read of the read centering.

Specific implementation mode according to the present invention, the comparison in step (2) include：(a) every read is converted to and is somebody's turn to do The corresponding one group of short-movie section of read, obtains multigroup short-movie section；(b) determine short-movie section reference library corresponding position, to obtain As a result, so-called reference library is the Hash table built based on the first reference sequences, reference library includes multiple entries, reference for one positioning One entry in library corresponds to a seed sequence, and so-called seed sequence can be at least one section of sequence on the first reference sequences Matching, the distance of corresponding two seed sequences of two neighboring entry of reference library on the first reference sequences are less than short-movie section Length；(c) it removes and navigates to the short-movie section on any entry in reference library adjacent entries in the first positioning result, obtain second Positioning result；(d) extended based on the short-movie section from identical read in second positioning result, to obtain the ratio of read To result.Using above-mentioned comparison method, by the way that read is converted to short-movie section and read sequence information is converted to position letter Breath, i.e., change into digital morphological by series modality, fixed conducive to the comparison for the lower machine data for rapidly and accurately realizing various microarray datasets Position.Especially for the read for including the base for failing identification, that is, include the quick and precisely comparison of the read of gap or N, such as It is especially suitable since sequencing quality is bad, base identifies that the comparison of the read obtained such as bad is analyzed.

So-called reference library is substantially Hash table (hash table), can be directly using so-called seed sequence as key (key Name), with position (position) of the so-called seed sequence on reference sequences be that value (key assignments) builds the reference library；It can also First by so-called seed sequence change into number either integer character string using the number or integer character string as key, with seed sequence The position being listed on reference sequences is that value establishes the reference library.The so-called position with seed sequence on reference sequences is value, Can be the seed sequence corresponding one or more position on reference sequences/chromosome, position can be directly with actual value Or numberical range indicates, can also recompile with customized character and/or digital representation.

In one example, the structure that Hash table is realized using the vector v ector of C++, is represented by：Hash(seed) =Vector (position), so-called vector v ector are a kind of object entities, and it is identical can to accommodate many other types Element, therefore also referred to as container.It can be preserved with binary system, which is built up with this.Alternatively, it is also possible to which Hash table is divided into Block (block) stores, the setting build key and block tail key at block, for example, for generic sequence block 5,6,7,8..., 19, 20 }, build and block tail (headers and footers in other words) 5 and 20 are set, if it is 3 to have number, because of 3<5, it is known that 3 are not belonging to the sequence Sequence blocks, if it is 10 to have number, because of 5<10<20, it is known that 10 belong to this sequence blocks.It can so be selected when inquiry complete Office's index, the block where can also quickly being navigated to by comparing build key and block tail key, it may be unnecessary to global index.

So-called reference library can in sequence alignment to be carried out structure, preservation can also be built in advance.According to the present invention Specific implementation mode, in advance build reference library save backup, the structure of reference library includes：Base sum according to reference sequences TotalBase determines length L, L=the μ * log (totalBase) of seed sequence,And L is less than analysis to be compared The length (reading length) of read；Length based on seed sequence generates all possible seed sequence, obtains seed sequence collection；Really The seed sequence of reference sequences and the matching position of the seed sequence can be matched to by determining seed sequence concentration, to obtain the ginseng Examine library.The relationship for the seed sequence length and reference sequences that this method is established based on the multiple hypothesis test verification of inventor, energy Enough make the reference library built comprising comprehensive seed sequence with the association of each seed sequence corresponding position on reference sequences Information, the reference library is compact-sized, and EMS memory occupation is small and can be used in the inquiry of the high speed access in sequence positioning analysis.According to this One entry of the reference library that embodiment obtains includes only a key, and a key corresponds at least one value.

Present invention specific implementation mode, to generating all possible seed sequence, obtaining the method for seed sequence collection not It is restricted, for one set of input, the member in the set can be traversed and usually obtain specific length, all possible element group It closes, such as can be realized using recursive algorithm and/or round-robin algorithm.

In one example, the first reference sequences are at least part of human genome, and human genome includes about 3,000,000,000 A base, the length of pending read are to take the integer in [11,15] not less than 25bp, L, are conducive to efficiently compare.

In one example, at least part of the first reference sequences behaviour cDNA reference gene groups counts this and refers to sequence The base sum totalBase of row sets the length L of seed sequence (seed) based on base sum,Base type based on L and DNA sequence dna includes A, T, C and G tetra- Kind, using recursive algorithm, the set of all possible seed sequence is generated, obtains seed sequence collection, which is represented by Seed=B₁B₂...B_L,B∈{ATCG}；Determine that seed sequence is concentrated the seed sequence that can be matched to the reference sequences and is somebody's turn to do The matching position of seed sequence, the seed sequences of the reference sequences can be matched to as key, with the seed sequence with reference to sequence Position on row is that value obtains the reference library to build.

In one example, the first reference sequences are at least part of the DNA genomes and transcript profile of certain species, statistics The base sum totalBase of the reference sequences sets the length L of seed sequence (seed) based on base sum,Base type based on L, composition DNA sequence dna includes A, T, C and G tetra- Kind and the base type of composition RNA sequence include that tetra- kinds of A, U, C and G generate all possible seed sequence using recursive algorithm The set of row, obtains seed sequence collection, which is represented by seed=B₁B₂...B_L,B∈{ATCG}∪{AUCG}；Determining kind The seed sequence of the reference sequences and the matching position of the seed sequence can be matched in son sequence set, can be matched to The seed sequence of the reference sequences is key, is that value obtains the reference to build with position position of the seed sequence on reference sequences Library.

In one example, seed sequence can be converted to the character string being made of numerical character, using the character string as key Library is built, can improve the speed of the built reference library of access queries.For example, obtaining the kind that can be matched to the first reference sequences After subsequence, seed sequence is encoded as follows：

In another example after obtaining seed sequence collection, to the seed sequence of seed sequence concentration It is encoded, alkali yl coding rule can be same as above, and the coding that the first reference sequences can also be carried out with same rule turns It changes, conducive to seed sequence corresponding location information on reference sequences is quickly obtained, is also conducive to improve the access of built reference library Inquiry velocity.

Specific implementation mode according to the present invention determines that seed sequence concentrates the seed that can be matched to the first reference sequences The matching position of sequence and the seed sequence, including：Sliding window is carried out using the first reference sequences of window pair that size is L, it will The seed sequence that seed sequence is concentrated is matched with the series of windows that sliding window obtains, to determine that seed sequence concentration can be matched to The seed sequence of first reference sequences and the matching position of the seed, it is ε to carry out matched serious forgiveness₁.It so, it is possible quickly Corresponding position information of the seed sequence on the first reference sequences is obtained, is conducive to rapid build and obtains reference library.It is so-called fault-tolerant Rate is the ratio shared by the base mismatch of permission, and mispairing is selected from least one of displacement, insertion and missing.

In one example, so-called matching is stringent matching, i.e. serious forgiveness ε₁It is zero, when seed sequence and one or more When sliding window sequence is completely the same, the position of sliding window sequence is the seed sequence corresponding position on the first reference sequences. In another example, so-called matching is fault-tolerant matching, serious forgiveness ε₁More than zero, when seed sequence and one or more sliding window sequence When the inconsistent ratio of the base of the same position of row is less than serious forgiveness, the position of sliding window sequence is the seed sequence in the first ginseng Examine corresponding position in sequence.In one example, to seed sequence, corresponding position encodes on the first reference sequences, It is the structure that value carries out reference library with such as numerical character of the character after coding.

Change an angle, serious forgiveness ε₁To be not zero, it is equivalent to and a seed sequence is transformed into ε₁One group of seed under allowing Template sequence (seed template), such as seed=ATCG, ε₁To allow a mistake in 0.25 i.e. four bases, then Seed template can be ATCG, TTCG, CTCG, GTCG, AACG, ACCG, AGCG etc..In ε₁It is determined for 0.25 time Seed=ATCG is equivalent at the position on reference sequences and determines the corresponding all seed template of the seed first The position of reference sequences, such as ref=ATCG, all seed template indicated before can be matched to the position, ref =TTCG, seed template, which are ATCG, TTCG, CTCG or GTCG, can be matched to the position.In turn, the ginseng built Examine library can using a seed as key, can also using each in the corresponding all seed template of this seed as key, Key is different from key, at least corresponding value of a key.

Specific implementation mode according to the present invention, when determining seed sequence in the corresponding position on reference sequences, to One reference sequences carry out the step-length of sliding window according to L and ε₁To determine.In one example, the step-length for carrying out sliding window is not less than L* ε₁。 In a specific example, the first reference sequences are at least part of human genome, and human genome includes about 3,000,000,000 alkali The length of base, pending read is not less than 25bp, L 14bp, ε₁0.2-0.3 is taken, the step-length for carrying out sliding window takes 3bp- 5bp enables in sliding window position fixing process two neighboring window across ε₁Under the conditions of continuous faulty combination, be conducive to quickly positioning. In one example, the distance between two neighboring entry of the reference library built is the step-length of sliding window.

Specific implementation mode according to the present invention, (a) include：Sliding window is carried out to read using the window that size is L, to obtain One group of short-movie section corresponding with the read is obtained, the step-length of the sliding window is 1bp.In this way, for the reads that a length is K, obtain (K-L+1) length is the short-movie section of L, and reads is changed into short-movie section, inquires reference library using high speed access, determines each short-movie Section reference library corresponding position, and then obtain the corresponding reads of short-movie section reference library information.

Specific implementation mode according to the present invention includes (b)：By short-movie section seed sequence corresponding with the entry of reference library It is matched, to determine short-movie section in the position of reference library, it is ε to carry out matched serious forgiveness₂。

In one example, so-called matching is stringent matching, i.e. serious forgiveness ε₂It is zero, when a short-movie section and reference library The corresponding seed seed template of an entry it is completely the same when, obtain the position of the short sequence in reference library Information.

In another example, so-called matching is fault-tolerant matching, serious forgiveness ε₂More than zero, when short sequence and reference library The ratio of the corresponding unmatched bases of seed seed template of one or more entries is less than serious forgiveness ε₂When, it obtains Obtain the location information of the short sequence in reference library.

In one example, ε₂=ε₁And be not zero, allow to obtain valid data as much as possible.

Specific implementation mode according to the present invention, with reference to figure 1, (b) in, the two neighboring entry of so-called reference library corresponds to Distances of two seed sequence X1 and X2 on reference sequences ref, two following situations can be divided into：When two items of reference library Purpose key and value are unique, i.e. an entry corresponds to one [key, value], with reference to figure 1a, is equivalent to the X1 and X2 and reference sequences When being unique match (X1 and X2 are only matched to one position of reference sequences), so-called distance is X1 and X2 in reference sequences The distance between the two upper corresponding positions；When the key of at least one entry in two entries of reference library corresponds to multiple values, With reference to figure 1b, what is be equivalent in two the seed sequences X1 and X2 at least one with reference sequences is not exclusive to match i.e. X1 and X2 In at least one multiple positions for being matched to reference sequences, so-called distance be the X1 and X2 corresponding phases on reference sequences Away from nearest the distance between two positions.The embodiment is not restricted the representation method of the distance between two sequences, For example, distance of the either end that can be expressed as in two ends of a sequence to the either end of another sequence, The center of a sequence can be expressed as to the distance at the center of another sequence.

Specific implementation mode according to the present invention further includes (c) after obtaining the second positioning result：Removal connection length Degree substitutes the second positioning result, connection length is the second comparison result less than the short-movie section of predetermined threshold with the result after removal In from identical read and navigate to the short-movie section of reference library difference entry and be mapped to the total length of reference sequences.The processing has Conducive to remove some transition redundancies and/or relatively low-quality data, it is conducive to improve and compares speed.

Connection length be represented by from identical read and navigate to reference library difference entry short-movie section length summation Subtract the length for the lap being mapped on reference sequences between short-movie section.In one example, it comes from a read and determines Position has 4 to the short-movie section of reference library difference entry, is expressed as Y1, Y2, Y3 and Y4, respective length be respectively S1, S2, S3 and There are overlapping in S4, the position that X1 and X2 therein are mapped to the first reference sequences, and the length of lap is J, and connection length is (S1 +S2+S3+S4-J).In one example, the length of different short-movie sections is L, and so-called predetermined threshold is L, in this way, can permit Perhaps it in the case of the data that lost part is effective but quality is relatively low, improves and compares speed.

Specific implementation mode according to the present invention further includes (c) after obtaining the second positioning result：It is fixed according to second The positioning result of the short-movie section from identical read, judges the positioning result of the read in the result of position, and knot is judged in removal Fruit does not meet the read of pre-provisioning request.Removal read is also to eliminate the corresponding short-movie section of the read simultaneously.In this way, meeting one Under the premise of fixed sensibility and accuracy, it is based on the second positioning result, accurately matching/part is directly carried out and quickly compares, energy Enough speed-up ratios pair.

The embodiment is not construed as limiting the method for judge, such as can be in the way of quantization marking.In an example In, the positioning result short to the short-movie from identical read is given a mark, and marking rule is：With the matched position of the first reference sequences Point makees deduction, makees bonus point with the unmatched site of the first reference sequences；After obtaining the second positioning result, positioned according to second As a result the positioning result of the short-movie section from identical read in, scores to the positioning result of the read, and removal score is little In the read of the first preset value.

In a specific example, a length of 25bp is read, sequence construct is carried out to the short-movie section from identical read, to obtain Reproducing sequence, for example, the base type in certain site can be determined according to being supported with more short sequences, if certain site does not have branch The short-movie section held i.e. no short-movie section compares arrives the site, then the site base type is uncertain to be indicated with N, is obtained with this Obtain reproducing sequence, it can be seen that reproducing sequence is corresponding with read, and the length of reproducing sequence is to read to grow；Reproducing sequence and the first ginseng It examines the matched site sequence (ref) and makees minus fifteen, make to add one point with the unmatched site of the first reference sequences, comparing serious forgiveness is The mispairing ratio that one read/reproducing sequence allows is 0.12, compares and allows the length of mistake for 3bp (25*0.12), initial point Number Scoreinit is to read to grow, and the first preset value is 22 (25-3), in this way, removal score, which is less than 22, unmatches the first reference The site accounting of sequence is more than to compare the reproducing sequence of serious forgiveness, is conducive in the number for allowing lost part effectively but quality is relatively low In the case of, speed-up ratio pair.

According to a specific example, bit arithmetic and dynamic programming algorithm [G.Myers.A fast bit-vector are used algorithm for approximate string matching based on dynamic progamming.Journal of the ACM,46(3):395-415,1999], for every reproducing sequence, the position of each site i is read in, utilizes 64 Binary mask carry out Rapid matching score, each site one divides, and initial score Scoreinit is to read to grow, and is represented by Score_init=length (read), matching score obtain score Score, are represented by：

In one example, the positioning result short to the short-movie from identical read is given a mark, and marking rule is：With Bonus point is made in the matched site of one reference sequences, makees deduction with the unmatched site of the first reference sequences；Obtaining the second positioning knot After fruit, according to the positioning result of the short-movie section from identical read in the second positioning result, to the positioning result of the read into Row score, removal score are not less than the corresponding short-movie section of read of the second preset value.

In a specific example, a length of 25bp is read, sequence construct is carried out to the short-movie section from identical read, to obtain Reproducing sequence, for example, the base type in certain site can be determined according to being supported with more short sequences, if certain site does not have branch The short-movie section held i.e. no short-movie section compares arrives the site, then the site base type is uncertain to be indicated with N, is obtained with this Obtain reproducing sequence, it can be seen that reproducing sequence is corresponding with read, and the length of reproducing sequence is to read to grow；Reproducing sequence and the first ginseng It examines the matched site of sequence (ref) to make to add one point, makees minus fifteen with the unmatched site of the first reference sequences, comparing serious forgiveness is The mispairing ratio that one read/reproducing sequence allows is 0.12, compares and allows the length of mistake for 3bp (25*0.12), initial point Number Scoreinit is -25, and the second preset value is -22 (- 25-3), in this way, removal score is more than -22 reproducing sequence, is being allowed In the case that lost part is effective but relatively low-quality data, speed-up ratio pair.

Specific implementation mode according to the present invention, (d) in based on the short-movie from identical read in the second positioning result Duan Jinhang extends, including：Sequence construct is carried out based on the short-movie section from identical read, obtains reproducing sequence；Based on reconstruct sequence The common portion of row reference sequences corresponding with the reproducing sequence is extended, to obtain extension sequence.In this way, by short-movie section and Short-movie section location information is converted to the location information of the corresponding read of short-movie section (referred to here as reproducing sequence), is conducive to follow-up compare Processing fast and accurately carries out.

So-called common portion, the part shared for a plurality of sequence.Specific implementation mode according to the present invention, so-called public affairs Part is public substring and/or common subsequence altogether.Public substring refers to the continuous part shared in a plurality of sequence, common subsequence It is then not necessary to continuous.For example, for ABCBDAB and BDCABA, common subsequence is BCBA, and public substring is AB.

It is so-called that sequence construct is carried out based on the short-movie section from identical read, reproducing sequence is obtained, in one example, Can according to supporting determine the base type in certain site on reproducing sequence with more short-movie sections, if certain site do not support it is short Segment, that is, no short-movie section compares to the reference sequences site, then the site base type is uncertain to be indicated with N, with this To obtain so-called reproducing sequence.It can be seen that, reproducing sequence is corresponding with read, and the length of reproducing sequence is to read to grow.

The corresponding reference sequences of so-called reproducing sequence are and reproducing sequence matched one section of reference sequences section reference The length of sequence is grown not less than reading.In one example, the length of the corresponding reference sequences of reproducing sequence is identical as reproducing sequence, It is to read length.In another example, allow reproducing sequence and the fault-tolerant matching of corresponding reference sequences, reproducing sequence corresponding The length of reference sequences is that the length of reproducing sequence adds twice of fault-tolerant matching length, for example, reproducing sequence length reads length For 25bp, the matching of reproducing sequence and reference sequences allows mispairing 12%, that section that can be compared with reproducing sequence to refer to sequence Row and this section of reference sequences both ends each 3bp (25*12%) sequence are used as the corresponding reference sequences of reproducing sequence.

A specific example according to the present invention, so-called common portion are public substring.(d) being positioned based on second in As a result the short-movie section from identical read in is extended, including：It is corresponding with the reproducing sequence to search the reproducing sequence The public substring of reference sequences determines the Longest Common Substring of reproducing sequence reference sequences corresponding with the reproducing sequence；Base In editing distance, extend the Longest Common Substring to obtain extension sequence.It so, it is possible more accurately to obtain comprising longer matching The comparison result of sequence.

A specific example according to the present invention, so-called common portion are common subsequence.(d) fixed based on second in The short-movie section from identical read is extended in the result of position, including：Search reproducing sequence ginseng corresponding with the reproducing sequence The common subsequence for examining sequence determines the longest common subsequence of reproducing sequence reference sequences corresponding with the reproducing sequence； Based on editing distance, extend the longest common subsequence to obtain extension sequence.

So-called editing distance is also Levenshtein distances, refers to being changed by one another between two character strings A required minimum edit operation number.Edit operation includes that a character is substituted for another character, is inserted into a character And delete a character.In general, editing distance is smaller, and the similarity of two strings is bigger.

In one example, for a reproducing sequence/read, reproducing sequence ginseng corresponding with the reproducing sequence is searched The Longest Common Substring for examining sequence is represented by and seeks two character string x₁x₂...x_iAnd y₁y₂...y_jPublic substring, character string Length be respectively m and n, calculate the length c [i, j] of the public substring of this two character string, equation of transfer can be obtained：

Solve equation the length for the Longest Common Substring that can obtain this two sequences For max (c [i, j]), i ∈ 1 ..., m }, j ∈ 1 ..., n }；Followed by editing distance, Longest Common Substring is converted to Corresponding reference sequences can be such that Longest Common Substring both ends constantly grow, and find out the minimum character needed between two character strings Operation (is replaced, is deleted, be inserted into).Editing distance can be determined using dynamic programming algorithm, which has optimal minor structure, compiles The calculating for collecting distance d [i, j] is represented by following equation：

Wherein, Hole/vacancy (gap) indicates that insertion either deletes the gap in a character formula and indicates that being inserted into or delete a character (corresponds to Site in sequence) needed for point penalty, matching (match) indicates that two characters are the same, and match in formula indicates two characters Score when the same, mispairing (mismatch) indicate that two characters are unequal/different, and the mismatch in formula indicates two Unequal/asynchronous valve of a character point.D [i, j] takes minimum in three one.In a specific example, a gap penalizes 3 Point, continuous gap increases valve 1 and divides, and a site mispairing penalizes 2 points, and site matches to obtain 0 point.In this way, conducive to the efficient of the sequence containing gap It compares.

Specific implementation mode according to the present invention, so-called common portion are common subsequence.It is according to the present invention specific Embodiment includes (d)：Search the public sub- sequence of the short-movie section for the same item that reference library is navigated in the second positioning result Row, determine the corresponding longest common subsequence of every read；Based on editing distance, extend longest common subsequence to be extended Sequence.

In one example, for a reproducing sequence/read, reproducing sequence reference corresponding with the reproducing sequence is searched The longest common subsequence of sequence is based on longest common subsequence, and corresponding that section of reproducing sequence of longest common subsequence is turned Corresponding that section of reference sequences of longest common subsequence are turned to, this two sections of sequences are found out using Smith Waterman algorithms Editing distance, to two character string x₁x₂...x_iAnd y₁y₂...y_j, can be acquired by following formula：

Wherein,

σ indicates that scoring function, σ (i, j) indicate character (site) x_iAnd y_jMispairing or matched score, σ (-, j) it indicates x_iVacancy (deletion) or y_jThe score of insertion, σ (i, -) indicate y_jDeletion or x_iThe score of insertion；Then, using front The method of calculating editing distance in example, reproducing sequence pair is converted to by corresponding that section of reproducing sequence of longest common subsequence The reference sequences answered can constantly grow at the both ends of corresponding that section of reproducing sequence of longest common subsequence, find out minimum character Operation (is replaced, is deleted, be inserted into).

In a specific example, a gap penalizes 3 points, and continuous gap increases penalize 1 point, and a site mispairing penalizes 2 points, site With 4 points.It so, it is possible to realize the efficient comparison of the sequence containing gap and both other site accuracy height containing gap can be retained Sequence.

Specific implementation mode according to the present invention further includes (d)：Extension sequence is carried out from least one end of extension sequence It blocks, calculates the ratio in the location of mistake site of the extension sequence after blocking, meet the following conditions stopping and block：Prolonging after blocking The ratio for stretching the location of mistake site of sequence is less than third preset value.In this way, by the way of blocking and rejecting, it can be preferable Retain the good local sequence of matching, is conducive to improve the effective percentage of data.

Specifically, specific implementation mode according to the present invention, based on being blocked below to extension sequence：I, first is calculated Error rate and the second error rate, if the first error rate is less than the second error rate, from the first end of extension sequence to extension sequence It is blocked, if the first error rate is more than the second error rate, extension sequence is blocked from the second end of extension sequence, with Extension sequence after being blocked, so-called first error rate are block obtaining to extension sequence from the first end of extension sequence Block after extension sequence location of mistake site ratio, so-called second error rate is from the second end of extension sequence Extension sequence is blocked, obtain block after extension sequence location of mistake site ratio；Ii, with prolonging after blocking It stretches sequence replacing extension sequence and carries out i, preset until the ratio in the location of mistake site of the extension sequence after blocking is less than the 4th Value.In this way, in such a way that both-end blocks and rejects, it can preferably retain the good local sequence of matching, be conducive to improve number According to effective percentage.According to a specific example, the length of extension sequence is 25bp, and the 4th preset value, which is that third is default, to be set to 0.12。

Specific implementation mode according to the present invention further includes (d)：Extension sequence is carried out from least one end of extension sequence Sliding window calculates the ratio in the location of mistake site for the series of windows that sliding window obtains, according to the ratio in the location of mistake site of series of windows Example blocks extension sequence, meets the following conditions stopping and blocks：The ratio in the location of mistake site for the series of windows that sliding window obtains Example is more than the 5th preset value.In this way, by the way of blocking and rejecting, it can preferably retain the good local sequence of matching, Conducive to the effective percentage of raising data.

Specifically, specific implementation mode according to the present invention, based on being blocked below to extension sequence：I, third is calculated Error rate and the 4th error rate, if third error rate is less than the 4th error rate, from the second end of extension sequence to extension sequence It is blocked, if third error rate is more than the 4th error rate, extension sequence is blocked from the first end of extension sequence, with Extension sequence after being blocked, so-called third error rate be from the first end of extension sequence to extension sequence carry out sliding window, The ratio in the location of mistake site of the series of windows of acquisition, so-called 4th error rate are the second end from extension sequence to extending Sequence carries out the ratio in the location of mistake site of sliding window, the series of windows of acquisition；Ii, extension is substituted with the extension sequence after blocking Sequence carries out i, until the ratio in the location of mistake site of series of windows is more than the 6th preset value.In this way, blocked using both-end and The mode of rejecting can preferably retain the good local sequence of matching, be conducive to improve the effective percentage of data.

Specific implementation mode according to the present invention, the window of sliding window are not more than the length of extension sequence.It is specific according to one The length of example, extension sequence is 25bp, and the window size of sliding window is 10bp, and it is 0.12 that the 6th preset value, which is the 5th preset value,.

Specific implementation mode according to the present invention, the size blocked are 1bp, i.e., once block to remove 1 base.Such as This, can efficiently obtain comprising more how long the comparison result of sequence.

For make comparison in difference result have statistical significance, usually, negative sample be it is multiple, such as negative sample number Not less than 20, preferably, being not less than 30.

So-called negative sample is the sample without chromosomal aneuploidy exception, such as the target for the detection that makes a variation is people Or sample to be tested is the sample from human body, then negative sample is the sample obtained from normal diploid individual.Negative sample is surveyed The acquisition of sequence result and the acquisition of sample to be tested sequencing result are limited without sequence, such as can be obtained simultaneously, also can successively be obtained, Preferably, being obtained simultaneously under same test conditions, influenced caused by testing result with reducing experimental factor difference as possible.Separately Outside, preferably, negative sample and sample to be tested are same type sample, the heredity of for example, fetus in Non-invasive detection parent is believed Breath, negative sample and sample to be tested can be maternal blood sample.

Specific implementation mode according to the present invention determines the amount for the read that corresponding the first chromosome is navigated in negative sample Including：Sample to be tested is substituted with negative sample and carries out step (1)-(3), to navigate to the first chromosome of the negative sample Read amount；It is navigated in using the mean value of the amount of the read of the first chromosome of multiple negative samples as negative sample corresponding The amount of the read of the first chromosome.

The amount of the so-called read for navigating to chromosome can be absolute amount, can also be opposite amount, such as show For a numerical value such as integer, ratio, or show as a numberical range.

Specific implementation mode according to the present invention carries out before carrying out step (3) at least one in following (i)-(iii) Item, at least two or whole three：(i) read that the length in sequencing result is not more than predetermined length is removed；(ii) it removes Read of the non-locating to the first reference sequences unique positions in comparison result；Compare/navigate to reference sequences unique positions Reads is known as unique reads；(iii) read that error rate in comparison result is not less than predetermined error rate is removed, read Error rate is after comparing be in the read be inserted into, the ratio shared by least one of base of missing and mispairing.

In one example, the error rate of so-called read is to be shown inserted into the read after comparing (insertion), missing (deletion) and mispairing (mismatch) base number ratio shared by the number of position in other words.

So-called predetermined error rate can according to microarray dataset, lower machine data volume, the quality of data and testing goal etc. into Row setting, it is possible to understand that, if lower machine data volume is small and/or the quality of data is higher, may be suitble to set larger predetermined mistake Accidentally rate is conducive to fast conversely, smaller predetermined error rate can be set with the relatively low data of removal quality while meeting detection Speed detection.

In one example, to the sequencing result from single-molecule sequencing platform, sequencing is tied using whole (i)-(iii) Fruit is filtered, and is conducive to quickly detection.Specifically, lower machine data volume 12.8M, predetermined error rate is set as 10%, i.e., for one The reads of 10bp at most allows insertion, missing or the mispairing of 1bp, and data 3.4M is obtained after filtering.If it is to be appreciated that Relatively stringent filtering when comparison, can be without (ii), such as can set the predetermined error rate as 100%.

Specific implementation mode according to the present invention, step (3) include：(a) sequence is referred to for the window pair first of L3 with size Row carry out sliding window, obtain multiple third windows, optional, the step-length of sliding window is L3；(b) it is based on comparison result, determines third window The sequencing depth of mouth, the sequencing depth of third window are the number and the third window size for the read for comparing the upper third window The ratio of L3；(c) the sequencing depth for the third window for being included based on the first chromosome, navigates to the first chromosome The amount of read.

Specific implementation mode according to the present invention includes (b)：Sequencing of the G/C content based on third window to third window Depth is corrected, using the sequencing depth of the third window after correction as the sequencing depth of third window.

The setting of size, that is, L3 of third window, usually, it would be desirable to be able to reflect the difference of G/C content and distribution to those The difference that region (third window) sequencing result is brought.For human genome, usually, L3 values are less than 300Kbp.At one In example, inventor comes according to the coefficient of variation (coefficients of variation, CV) and the relationship of different size window It determines L3, as shown in figure 3, according to the curve, CV values is selected to be influenced apparent corresponding window size as the by window size Three window sizes, it is 100Kbp-200Kbp that L3, which is such as arranged, can reflect G/C content and be distributed to the influence that sequencing is brought, also sharp It is compared in quick.The so-called coefficient of variation is also known as coefficient of dispersion, is a normalization measurement of probability distribution dispersion degree, For the ratio between standard deviation and average value；The dispersion degree of the G/C content in one group window/region of certain window size is reflected herein Absolute value.

Two neighboring third window can be overlapped and can not also be overlapped, and in one example, setting L3 is 150Kbp, adjacent There is two third windows 100bp to be overlapped (overlap), i.e. sliding window step-length is set as 149.9Kbp.

Specifically, can be by establishing the relationship of the G/C content of third window and the sequencing depth of third window, to carry out school Just；In one example, the G/C content of third window and the sequencing depth of third window are established using the local weighted Return Law Relationship.

Specific implementation mode according to the present invention further includes (b), before carrying out above-mentioned correction, the survey to third window Sequence depth is standardized, using the sequencing depth of the third window after standardization as the sequencing depth of third window.

In one example, so-called standardization be normalized, such as can be based on sequencing depth-averaged value or Depth median is sequenced to carry out the normalization of third window depth.

Specific implementation mode according to the present invention, in (c), based on third window sequencing depth determine compare to this The weight coefficient of the read of three windows navigates to the amount of the read of the first chromosome based on weight coefficient.

In one example, the sequencing depth of third window is standardized or normalized, such as with third window The sequencing depth of the sequencing depth of mouth and the ratio of particular value as the third window, alleged particular value are that the sequencing of third window is deep The mean value of degree makes third window sequencing depth be transformed into one group of numerical value around 1 fluctuation；It determines that treated and depth (phase is sequenced To depth is sequenced) with the relationship of G/C content.

The weight coefficient of the read of so-called third window is the opposite sequencing depth of the window, in one example, institute The amount of the read for navigating to the first chromosome be referred to as relative quantity, and for by the corrected relative quantity of weight coefficient, as a result, The influence of G/C content and/or distributional difference to testing result can be eliminated or be reduced, detection accuracy is improved.

In some examples, inventor has found that the opposite sequencing depth of third window and the G/C content of the window are inversely proportional, i.e., The opposite sequencing depth of the low third window of G/C content is high, and the opposite sequencing depth of the high third window of G/C content is low.It is right as a result, Pass through the corrected relative quantity of weight coefficient in so-called, for example, N read navigates to certain third window of the first chromosome, The depth of sequencing relatively of the third window of the first chromosome is w, then is obtained after correcting and navigate to being somebody's turn to do for the first chromosome Third window isRead.

In one example, the amount of the so-called read for navigating to the first chromosome is relative quantity, for navigate to this The ratio of the amount of the read of one chromosome and the amount for navigating to the autosomal read of all or at least a portion, is examined by z (z-score) whether the difference for comparing ratio ratio corresponding to negative sample has statistical significance, to judge to wait for test sample Whether this first chromosome saves as aneuploidy exception.

Specific implementation mode according to the present invention, the first chromosome are selected from least one of 13,18 and No. 21 chromosome. For example, based on the free nucleic acid in detection maternal blood sample, to obtain Fetal genetic information, including screening or auxiliary diagnosis Fetus makes a variation with the presence or absence of 13,18 and/or No. 21 chromosomal aneuploidies.

Usually, the G/C content of different chromosomes and distribution have different characteristics, such as the opposite height based on G/C content, Chromosome in genome can return to high GC content group, middle G/C content group and low G/C content group, or can return to opposite height G/C content group, middle high GC content group, middle G/C content group, in low G/C content group and low G/C content group.

Table 2 shows that the G/C content of huamn autosomal, inventor are based on multiple check sample sequencing datas and draw chromosome The relation curve of the G/C content of standardized sequencing depth and chromosome, as shown in figure 4, the dye that G/C content is relatively high and relatively low The sequencing result of colour solid influenced by G/C content it is more apparent, for 21,13 and No. 18 chromosomes, from the point of view of relatively, 21 dyeing body examinations Sequence result is influenced minimum by G/C content, and No. 18 chromosomes take second place, and 13 chromosomes are affected by G/C content.

Table 2

Chr	4	5	6	3	18	8	2	7	12	21
											G/C content	0.3825	0.3952	0.3961	0.3969	0.3979	0.4018	0.4024	0.4075	0.4081	0.4083
Chr	14	11	10	1	15	20	16	17	22	19
											G/C content	0.4089	0.4157	0.4158	0.4174	0.4220	0.4413	0.4479	0.4554	0.4799	0.4836

Specific implementation mode according to the present invention, sample to be tested are maternal blood sample.Since fetus dissociative nucleic acid includes Content of the fetus dissociative DNA (cffDNA) in parent free nucleic acid sample is in different pregnant woman and/or within the different pregnant week phases Fluctuation is very big.If detection sensitivity can be improved, pregnant week phase more early sample is can detect under identical detection accuracy, then it is pregnant Can manpower intervention time it is more early, the influence to pregnant woman is smaller；If accuracy can be improved, false positive and false negative can all drop It is low, it is final to be become possible to so that being applied to diagnosis, and screening is not only for auxiliary diagnosis.Usually, pregnant woman's body fluid sample passes through It crosses extraction cffDNA, structure library, the sequencing of upper machine, machine finally descended to obtain sequencing data (such as fastq formats), by lower machine data It is compared with reference sequences, obtains containing every read in the position of genome, comparison score, whether uniquely comparison, ratio To the comparison result (such as being known as sam files) of the information such as mistake, the read of some chromosome can be counted according to comparison result Number, finally calculates accounting (hereinafter referred to as chromosome accounting) of the read number in autosomal read number of the chromosome, to sentence The chromosome break with the presence or absence of numerical abnormality.

Specific implementation mode according to the present invention carries out noninvasive Prenatal Screening (NIPT or NIPD), can first obtain one The pregnant woman's body fluid sample (negative sample) comprising dissociative DNA for confirming normal fetal by detection is criticized, and calculates these pregnant woman's bodies The accounting of chromosome such as 21/18/13 chromosome in liquid sample, so that it is determined that it is normal and/or different to be somebody's turn to do (a little) chromosome number Normal range or line of demarcation.It can also determine that chromosome number is normal and/or abnormal in the same way using positive sample Range or line of demarcation.

Embodiment of the present invention also provides a kind of device of detection chromosomal aneuploidy, and the device is implementing above-mentioned The method of detection chromosomal aneuploidy in invention any embodiment or specific implementation mode, the device include：Sequencer module, For at least part nucleic acid in sample to be tested to be sequenced, acquisition includes the sequencing result of read；Comparing module is used for Read from sequencer module is compared to the first reference sequences, obtains comparison result, the comparison result includes the read It is positioned at the information of chromosome, first reference sequences are the set in the region that the comparison ability in reference gene group is 1, than The region for being 1 to ability is the region for navigating to unique positions in reference gene group；Quantitative module, for for the first dyeing Body navigates to the amount of the read of the first chromosome based on the comparison result from comparing module；Judgment module is used for The amount for comparing the read for navigating to the first chromosome come self-quantitatively module navigates to corresponding first dye to negative sample The amount of the read of colour solid, to judge the number of the first chromosome.

The skill of the above-mentioned detection method to the chromosomal aneuploidy in any embodiment of the present invention or specific implementation mode The description of art feature and effect, it is equally applicable the present invention this embodiment in device, details are not described herein.

For example, in some embodiments, the determination of the comparison ability in region includes：It is the first window pair of L1 with size Reference gene group carries out sliding window, obtains multiple regions, and the step-length of sliding window for example may be configured as 1bp；Region is compared to reference to base Because of group, the comparison ability in the region is calculated based on the number that region is compared to the position of reference gene group.

In some embodiments, the number of negative sample is not less than 20, or is preferably not less than 30.

In some embodiments, the read that corresponding the first chromosome is navigated in negative sample is determined as follows Amount：Sample to be tested is substituted with negative sample and enters sequencer module, comparing module and quantitative module, to navigate to the feminine gender sample The amount of the read of this first chromosome；Using the mean value of the amount of the read of the first chromosome of multiple negative samples as negative sample The amount of the read of corresponding the first chromosome is navigated in this.

In some embodiments, alleged first reference sequences are to eliminate the ginseng in region shown in table 1 to examine genome At least part in hg19 sequences.

In some embodiments, the first reference sequences are to eliminate the corresponding region of the second window for meeting the following conditions Reference gene group at least part：The sequencing depth of second window is flat not less than the sequencing depth of all second windows 4 times of mean value.

In other embodiments, the first reference sequences are to eliminate the corresponding area of the second window for meeting the following conditions At least part of the reference gene group in domain：The sequencing depth of second window is not less than the sequencing depth of all second windows 6 times of average value.

In some embodiments, the first reference sequences be to the corresponding region of the second window in reference gene group carry out with At least part of the reference gene group of lower processing：The sequencing depth of the second window to percentile more than 98 is assigned a value of percentage The sequencing depth of second window of the digit equal to 98.

In other embodiments, the sequencing depth of the second window to percentile more than 99 is assigned a value of percentile The sequencing depth of the second window equal to 99.Alleged second window carries out reference gene group by using the window that size is L2 Sliding window obtains, and in one example, the step-length of sliding window is also L2.The sequencing depth of second window is to compare upper second window The ratio of the number of read and second window size L2.

In some embodiments, device further includes filtering module, and the filtering module is for carrying out in following (i)-(iii) At least one of：(i) read that the length in sequencing result is not more than predetermined length is removed；(ii) non-locating in comparison result is removed To the read of the first reference sequences unique positions；(iii) read that error rate in comparison result is not less than predetermined error rate is removed, The error rate of read is after comparing be in the read be inserted into, the ratio shared by least one of base of missing and mispairing.

In some embodiments, quantitative module is used to carry out hereinafter, (a) refers to sequence with size for the window pair first of L3 Row carry out sliding window, obtain multiple third windows；(b) it is based on comparison result, determines the sequencing depth of third window, third window Sequencing depth is the ratio of the number and third window size L3 for the read for comparing the upper third window；(c) it is based on the first dye The sequencing depth for the third window that colour solid is included, navigates to the amount of the read of the first chromosome.

In some instances, further include (b) being standardized to the sequencing depth of third window, after standardization Third window sequencing depth of the sequencing depth as third window.

In other examples, further include (b) that the G/C content based on third window carries out the sequencing depth of third window Correction, using the sequencing depth of the third window after correction as the sequencing depth of third window.The survey of third window before correction Sequence depth can be the sequencing depth of standardized third window.

Specifically, being corrected using the relationship of the sequencing depth of the G/C content and third window of third window；At one In example, the relationship of the G/C content of third window and the sequencing depth of third window is established using the local weighted Return Law.

In some instances, include (c) that the read compared to the third window is determined based on the sequencing depth of third window Weight coefficient, the amount of the read of the first chromosome is navigated to based on weight coefficient.

In some embodiments, sample to be tested is maternal blood sample.

In some embodiments, the first chromosome is at least one of 13,18 and No. 21 chromosome of fetus.

Embodiment of the present invention provides a kind of computer readable storage medium, is executed for computer for storing/carrying The execution of program, program includes the chromosomal aneuploidy detection side completed in any of the above-described embodiment or specific implementation mode Method.Above-mentioned detection method and/or device to the chromosomal aneuploidy in any embodiment of the present invention or specific implementation mode Technical characteristic and effect description, it is equally applicable the present invention this embodiment in computer readable storage medium, herein It repeats no more.

Embodiment of the present invention also provides a kind of computer program product, should include instruction, and instruction executes journey in computer When sequence, computer is made to execute the chromosomal aneuploidy detection method in any of the above-described embodiment or specific implementation mode.

Embodiment

The reference sequences used are to meet without region shown in table 1 and simultaneously the collection that ginseng below examines the region of genome It closes：1) it is 1,2 to compare ability) removal is sequenced 6 times that depth is less than average sequencing depth, or depth percentile will be sequenced The sequencing depth in the region more than 99 is assigned a value of the 99th percentile sequencing depth value.

Check sample and sample to be tested pass through following processing：

1, it is sequenced, obtains lower machine data, that is, obtain read set；Get rid of the read less than 25bp；

2, comparison result (sam files) is obtained, including obtains unique read and (compares the reading to reference sequences unique positions Section) and these reads in reference sequences/with reference to the position on chromosome；

3, GC corrections are carried out, including：Reference sequences are cut into the window/area (Bin=150K) of 150Kbp sizes；According to The sequencing depth that each Bin is calculated according to unique read is normalized the sequencing depth of each Bin, is normalized Sequencing depth；Count the G/C content of each Bin；The relationship of normalization sequencing depth and G/C content is established, such as with the GC of Bin Content is x, and the normalized sequencing depth to correspond to Bin is fitted the equation for obtaining the relationship of the two for y.

4, it is corrected using y as weight coefficient w and uniquely compares the quantity to the read of the window/area, table in comparison result It is 1/w to be shown as the score of this read or contribution margin；

5, the quantity for determining the unique reads of revised chromosome i is expressed as all unique of chromosome i The sum of score of reads

6, the sum of the score for calculating the unique reads of chromosome i accounts for all autosomal ratios

Then, the ratio value (Ratio of the chromosome i based on multiple check samples_i), determine the ratio value of chromosome i Average value mu_iAnd variances sigma_i；

Formula is examined using zThe Z score values (Zscore) for calculating the chromosome i of sample to be tested, compare The size of the value and threshold value, to judge that the chromosome i of the sample to be tested whether there is numerical abnormality.

If Zscore >=3 of certain chromosome of a maternal peripheral blood sample to be measured, difference has statistical significance, it is believed that should There are three chromosomes by the pregnant youngster of parent institute.

For threshold value, the distribution of the ratio value of the chromosome i of multiple check samples meets normal distribution or approximation meets Normal distribution can search z values and corresponding confidence level by z tables (gaussian distribution table)；Such as it is 99.97% to take confidence level, it is right The z values substantially 3 answered are more than the z values, illustrate the improper sample of the 99.97% probability sample, can determine whether as exception.Certainly, As needed, those skilled in the art can set other confidence levels, and then using corresponding z values as threshold value to determine whether depositing In exception.

Karyotyping, which is had already passed through, using ten an example of the above method pair confirms that the sample of male style is detected, all samples This can be detected, and the results are shown in Table 3.

Table 3

In the description of this specification, an embodiment, some embodiments, one or some specific implementation modes, The description of one or some embodiments, example etc. mean the specific features for combining the embodiment or example to describe, structure or Feature is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms It may not refer to the same embodiment or example.Moreover, the features such as specific features of description, structure, can be at any one Or it can be combined in any suitable manner in multiple embodiments or example.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that：Not In the case of being detached from the principle of the present invention and objective a variety of change, modification, replacement and modification can be carried out to these embodiments, this The range of invention is by claim and its equivalent limits.

Claims

1. a kind of method of detection chromosomal aneuploidy, which is characterized in that including：

(1) at least part nucleic acid in sample to be tested is sequenced, acquisition includes the sequencing result of read；

(2) read is compared to the first reference sequences, obtains comparison result, the comparison result includes the read positioning In the information of specific chromosome, first reference sequences are the set in the region that the comparison ability in reference gene group is 1, than The region for being 1 to ability is the region for navigating to unique positions in the reference gene group；

(3) for the first chromosome, it is based on the comparison result, navigates to the amount of the read of the first chromosome；

(4) amount for the read for navigating to the first chromosome described in comparing navigates to corresponding the first chromosome to negative sample Read amount, to judge the number of the first chromosome.

2. the method as described in claim 1, which is characterized in that the determination of the comparison ability in the region includes：

Sliding window is carried out to the reference gene group for the first window of L1 with size, obtains multiple regions, optional, it is described The step-length of sliding window is 1bp；

The region is compared to the reference gene group, the number to the position of the reference gene group is compared based on the region Mesh calculates the comparison ability in the region；

Optional,

The number of the negative sample is not less than 20, and optional is not less than 30；

Optional,

The amount for the read that corresponding the first chromosome is navigated in negative sample is determined as follows：

Sample to be tested is substituted with negative sample and carries out (1)-(3), to navigate to the reading of the first chromosome of the negative sample The amount of section；

Corresponding first dye is navigated in using the mean value of the amount of the read of the first chromosome of multiple negative samples as negative sample The amount of the read of colour solid；

Optional,

First reference sequences are to eliminate the ginseng in region shown in following table to examine at least part in genome hg19 sequences：

Optional,

First reference sequences are to eliminate the reference gene group in the corresponding region of the second window for meeting the following conditions extremely A few part：The sequencing depth of second window is not less than 4 times of the average value of the sequencing depth of all second windows, optionally , the sequencing depth of second window is not less than 6 times of the average value of the sequencing depth of all second windows；

Second window carries out sliding window acquisition by using the window that size is L2 to the reference gene group, optional, institute The step-length for stating sliding window is L2, the sequencing depth of second window be compare the read for going up second window number and this second The ratio of window size；

Optional,

First reference sequences are to carry out the following reference gene handled to the corresponding region of the second window in reference gene group At least part of group carries out following processing：The sequencing depth of the second window to percentile more than 98 is assigned a value of percentile The sequencing depth of the second window equal to 98, optional, the sequencing depth of the second window to percentile more than 99 is assigned a value of The sequencing depth of second window of the percentile equal to 99；

Second window carries out sliding window acquisition by using the window that size is L2 to the reference gene group, optional, institute The step-length for stating sliding window is L2, the sequencing depth of second window be compare the read for going up second window number and this second The ratio of window size L2；

Optional,

The method includes before carrying out step (3), carrying out at least one of following (i)-(iii)：

(i) read that the length in the sequencing result is not more than predetermined length is removed；

(ii) read of the non-locating to the first reference sequences unique positions in the removal comparison result；

(iii) read that error rate in the comparison result is not less than predetermined error rate is removed, the error rate of read is after comparing It is the ratio shared by least one of base of insertion, missing and mispairing in the read.

3. method as claimed in claim 1 or 2, which is characterized in that step (3) includes,

(a) sliding window is carried out to first reference sequences for the window of L3 with size, obtains multiple third windows；

(b) be based on the comparison result, determine the sequencing depth of the third window, the sequencing depth of the third window be than To the ratio of the above number of the read of the third window and third window size L3；

(c) the sequencing depth for the third window for being included based on the first chromosome, determine described in navigate to this first dyeing The amount of the read of body；

Optional,

(b) further include being standardized to the sequencing depth of the third window, with the survey of the third window after standardization Sequencing depth of the sequence depth as the third window；

Optional,

(b) further include that the G/C content based on the third window is corrected the sequencing depth of the third window, with correction Sequencing depth of the sequencing depth of third window afterwards as the third window；

Optional,

The correction is carried out using the relationship of the sequencing depth of the G/C content and third window of the third window；Optionally , the relationship of the G/C content of the third window and the sequencing depth of third window is established using the local weighted Return Law；

Optional,

(c) include that the weight coefficient compared to the read of the third window is determined based on the sequencing depth of the third window,

Based on the amount for the read for navigating to the first chromosome described in weight coefficient determination.

4. method as described in any one of claims 1-3, which is characterized in that the sample to be tested is maternal blood sample；

Optional, the first chromosome is at least one of 13,18 and No. 21 chromosome of fetus.

5. a kind of device of detection chromosomal aneuploidy, which is characterized in that including：

Sequencer module, at least part nucleic acid in sample to be tested to be sequenced, acquisition includes the sequencing result of read；

Comparing module obtains comparison result, the comparison for will be compared to the first reference sequences from the read of sequencer module As a result include the information that the read is positioned at chromosome, first reference sequences are that the comparison ability in reference gene group is The set in 1 region, the region that comparison ability is 1 are the region for navigating to unique positions in the reference gene group；

Quantitative module, for for the first chromosome, based on the comparison result from comparing module, navigating to first dye The amount of the read of colour solid；

Judgment module, for comparing come in the amount and negative sample of the read for navigating to the first chromosome of self-quantitatively module The amount for navigating to the read of corresponding the first chromosome, to judge the number of the first chromosome.

6. device as claimed in claim 5, which is characterized in that the determination of the comparison ability in the region includes：

Optional,

Sample to be tested is substituted with negative sample and enters through sequencer module, comparing module and quantitative module, to navigate to this The amount of the read of the first chromosome of negative sample；

Optional,

First reference sequences are to carry out the following reference gene handled to the corresponding region of the second window in reference gene group At least part of group：The sequencing depth of the second window to percentile more than 98 is assigned a value of second that percentile is equal to 98 The sequencing depth of window, optional, the sequencing depth of the second window to percentile more than 99 is assigned a value of percentile and is equal to The sequencing depth of 99 the second window；

Optional,

Further include filtering module, the filtering module is for carrying out at least one of following (i)-(iii)：

7. such as device described in claim 5 or 6, which is characterized in that the quantitative module for carrying out hereinafter,

Optional,

8. such as claim 5-7 any one of them devices, which is characterized in that the sample to be tested is maternal blood sample；

9. a kind of computer readable storage medium, which is characterized in that for storing the program executed for computer, described program Execution includes completing method according to any one of claims 1-4.

10. a kind of computer program product, which is characterized in that including instruction, described instruction executes the journey in the computer When sequence, the computer is made to execute method according to any one of claims 1-4.