CN108629152A - Detect the method, apparatus and system of chromosomal aneuploidy - Google Patents
Detect the method, apparatus and system of chromosomal aneuploidy Download PDFInfo
- Publication number
- CN108629152A CN108629152A CN201810425695.6A CN201810425695A CN108629152A CN 108629152 A CN108629152 A CN 108629152A CN 201810425695 A CN201810425695 A CN 201810425695A CN 108629152 A CN108629152 A CN 108629152A
- Authority
- CN
- China
- Prior art keywords
- window
- read
- chromosome
- sequencing depth
- optional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Theoretical Computer Science (AREA)
- Organic Chemistry (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Wood Science & Technology (AREA)
- Analytical Chemistry (AREA)
- Zoology (AREA)
- Databases & Information Systems (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Bioethics (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a kind of method, apparatus and system of detection chromosomal aneuploidy.Method includes:At least part nucleic acid in sample to be tested is sequenced, acquisition includes the sequencing result of read;Read is compared to the first reference sequences, obtains comparison result, comparison result includes the information that read is positioned at specific chromosome;For the first chromosome, it is based on comparison result, navigates to the amount of the read of the first chromosome;Comparison and location to the amount and the read for navigating to corresponding the first chromosome in negative sample of the read of the first chromosome amount, to judge the number of the first chromosome.Chromosomal aneuploidy detection is carried out using this method, the testing result of acquisition has higher sensitivity and accuracy.
Description
Technical field
The present invention relates to field of bioinformatics, and in particular, to a kind of method, apparatus of detection chromosomal aneuploidy
And system.
Background technology
Down syndrome (tri- bodies of 21-), Edward thatch syndrome (tri- bodies of 13-), pa pottery Cotard (tri- bodies of 18-) are most
Common newborn's prenatal diagnosis, their incidence distinguish 1/700 [Papageorgiou, E.A.et
al.Fetal-specific DNA methylation ratio permits noninvasive prenatal
Diagnosis of trisomy 21.Nat.Med.17,510-513 (2011)], 1/6,000 and 1/10,000
[Driscoll,D.A.&Gross,S.Prenatal Screening for Aneuploidy.N.Engl.J.Med.360,
2556–2562(2009).].These chromosome aneuploids can lead to very high incidence and mortality, amniocentesis and villus
Film sampling be diagnosing fetal chromosomal exception standard method, but these diagnostic methods in itself up to 0.6% can be brought to arrive
1.9% abortion ratio.In order to avoid these risks, need exploitation safer, the non-intrusive tire of detection pregnant week more in advance
The detection method of youngster's aneuploid abnormal (NIPT).
Bright [Lo, Y.M.D.et al.Presence of fetal DNA in the maternal plasma of Lu Yu in 1997
And serum.Lancet350,485-487 (1997)] dissociative DNA for going out fetus in pregnant woman's vivo detection is reported for the first time
(cff DNA), this makes it possible the blood by parent to check the gene status of fetus.It is reported that the first pregnancy period and
Second pregnancy period cffDNA reaches 10-20% in the accounting about 4-10% of parent dissociative DNA in the third pregnancy period.Lu Yu is bright within 2008
[Chiu,R.W.K.et al.Noninvasive prenatal diagnosis of fetal chromosomal
aneuploidy by massively parallel genomic sequencing of DNA in maternal
Plasma.Proc.Natl.Acad.Sci.105,20458-20463 (2008)] and SetphenQuake [Chitkara, U.et
al.Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from
Maternal blood.Proc.Natl.Acad.Sci.U.S.A.105,16266-71 (2008)] report that application is next respectively
Generation sequencing (NGS) technology detects foetal chromosome aneuploidy exception.The microarray dataset that can be applied to genetic test at present is got over
Come more.
Lower machine data based on each platform carry out chromosomal aneuploidy variation detection, the sensitivity of detection and/or accurate
Property need to be further increased always, multifactor relationship detection sensitivity and/or accuracy, such as under different microarray dataset generates
The difference in length of machine data is larger, and lower machine data are also referred to as read (reads), and the length of read also referred to as reads length, from tens bp
To thousands of bp etc., the confidence level height of machine Data Matching (positioning) under the influence of reading length at least;In another example sequencing error rate height
Also the confidence level of read positioning is influenced, usually, error rate is higher, and confidence level is lower.
Invention content
Embodiment of the present invention aims to solve at least one of technical problem present in the relevant technologies or at least provides one
The selectable practical plan of kind.
An embodiment according to the present invention provides a kind of method of detection chromosomal aneuploidy variation, including:
(1) at least part nucleic acid in sample to be tested is sequenced, acquisition includes the sequencing result of read;(2) read is compared
To the first reference sequences, comparison result is obtained, comparison result includes the information that read is positioned at specific chromosome, and first refers to sequence
The set in the region that the comparison ability being classified as in reference gene group is 1, the region that comparison ability is 1 refers to navigating to reference gene
The region of unique positions in group;(3) for the first chromosome, it is based on comparison result, navigates to the reading of the first chromosome
The amount of section;(4) amount of comparison and location to the read of the first chromosome navigates to corresponding the first chromosome to negative sample
Read amount, to judge the number of the first chromosome.
This method is screened and is positioned to read including the use of specific reference sequences, can quickly and easily be realized
Chromosomal aneuploidy detects, and obtains accurate testing result.Detection suitable for the lower machine data based on various microarray datasets
Analysis includes the place of the read of vacancy (gap) especially suitable for the detection and analysis to the read comprising the base for failing identification
Reason analysis.
Another embodiment according to the present invention provides a kind of device of detection chromosomal aneuploidy variation, the device
Method to implement the detection chromosomal aneuploidy in aforementioned present invention embodiment, the device include:Sequencer module:It should
For sequencer module at least part nucleic acid in sample to be tested to be sequenced, acquisition includes the sequencing result of read;It compares
Module:The comparing module obtains comparison result, compares knot for will be compared to the first reference sequences from the read of sequencer module
Fruit includes the information that read is positioned at chromosome, and the first reference sequences are the region that the comparison ability in reference gene group is 1
Set, the region that comparison ability is 1 refers to the region for navigating to unique positions in reference gene group;Quantitative module:For first
Chromosome, the quantitative module are used to, based on the comparison result from comparing module, navigate to the read of the first chromosome
Amount;Judgment module:The judgment module be used to compare the amount of the read for navigating to the first chromosome come self-quantitatively module with
The amount of the read for navigating to corresponding the first chromosome in negative sample, to judge the number of the first chromosome.
Another embodiment according to the present invention, also provides a kind of computer-readable medium, for storing/load capacity calculation machine
Executable program, when executing the program, by instructing related hardware that can complete the dye of the detection in aforementioned present invention embodiment
The all or part of step of the method for colour solid aneuploidy.Alleged medium includes but not limited to:Read-only memory, random storage
Device, disk or CD etc..
Another embodiment according to the present invention, provides a kind of terminal, and a kind of detection chromosomal aneuploidy variation is
System, the system include computer executable program, which includes processor, which can be used in executing above computer
Executable program, it includes the detection chromosome aneuploidy completed in aforementioned present invention embodiment to execute computer executable program
The method of property.
The method, apparatus and/or system for the detection chromosomal aneuploidy that any of the above-described embodiment provides can be used for
Chromosomal aneuploidy variation detection, the testing result of acquisition have higher sensitivity and accuracy.Suitable for being based on various surveys
The detection and analysis of the lower machine data of sequence platform, especially suitable for the detection and analysis to the read comprising the base for failing identification, i.e.,
Include the processing analysis of the read of vacancy (gap).
The additional aspect and advantage of embodiment of the present invention will be set forth in part in the description, partly will be from following
Become apparent in description, or the practice of embodiment is recognized through the invention.
Description of the drawings
Fig. 1 is the distance of the two neighboring entry of the reference library in the alignments that the specific embodiment of the invention utilizes
Schematic diagram.
Fig. 2 is the connection length schematic diagram for the alignments that the specific embodiment of the invention utilizes.
Fig. 3 is the relation schematic diagram of the coefficient of variation and window size in the specific implementation mode of the present invention.
Fig. 4 is the pass of the G/C content of the standardized sequencing depth of the chromosome in the specific embodiment of the invention and chromosome
It is schematic diagram.
Specific implementation mode
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning
Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng
The embodiment for examining attached drawing description is exemplary, and is only used for explaining the present invention, and is not considered as limiting the invention.
In the description of the present invention, term " first ", " second " are used for description purposes only, and should not be understood as instruction or dark
Show that relative importance either implicitly indicates the quantity or sequence of indicated technical characteristic.In the description of the present invention, " more
It is a " it is meant that two or more, unless otherwise specifically defined.
The so-called sequencing of embodiment of the present invention, also referred to as sequencing, refer to determining nucleic acid sequence, including DNA sequencing and/
Or RNA sequencings, including long segment sequencing and/or short-movie section sequencing.
Sequencing can be carried out by microarray dataset, and microarray dataset may be selected but be not limited to the Hisq/ of Illumina companies
Miseq/Nextseq microarray datasets, the Ion Torrent platforms of Thermo Fisher/Life Technologies companies, China
The BGISEQ platforms and single-molecule sequencing platform of big gene;Sequencing mode can select single-ended sequencing, can also select double ends
Sequencing;Sequencing result/data of acquisition survey the segment read out, referred to as read (reads).The length of read is known as reading length.
Embodiment of the present invention provides a kind of method of detection chromosomal aneuploidy, and alleged chromosomal aneuploidy includes
The exception of the amount in a part of region of chromosome or chromosome, this method include:(1) at least part core in sample to be tested
Acid is sequenced, and acquisition includes the sequencing result of read;(2) read is compared to the first reference sequences, obtains comparison result, than
Include the information that read is positioned at specific chromosome to result, the first reference sequences are that the comparison ability in reference gene group is 1
Region set, comparison ability be 1 region refer to the region for navigating to unique positions in reference gene group;(3) for
One chromosome is based on comparison result, navigates to the amount of the read of the first chromosome;(4) comparison and location to this first dye
The amount of the amount of the read of colour solid and the read for navigating to corresponding the first chromosome in negative sample, to judge the first chromosome
Number.
This method is screened and is positioned to read including the use of specific reference sequences, can quickly and easily be realized
Chromosomal aneuploidy detects, and obtains accurate testing result.Detection suitable for the lower machine data based on various microarray datasets
Analysis includes the place of the read of vacancy (gap) especially suitable for the detection and analysis to the read comprising the base for failing identification
Reason analysis.
Whole chromosome group (genome) can be directed to, and either the subregion of several chromosomes or chromosome is surveyed
Sequence, usually, this include mainly the target chromosome or region with the characteristics of target chromosome or region with other chromosomes or
Region is associated with relationship.
So-called " comparison " refers to sequence alignment, includes the mistake that one or more sequence is navigated to another one or more sequence
Journey and the positioning result of acquisition.E.g., including read navigates to the process on reference sequences, also positioned including obtaining read/
The process of matching result.
So-called reference sequences (reference, ref) are fixed sequence with chromosome sequence is referred to, and can be certainly
The DNA and/or RNA sequence of own measured in advance assembling can also be that other people measure disclosed DNA and/or RNA sequence, Ke Yishi
Arbitrary reference template in the samples sources individual being obtained ahead of time/affiliated category of target individual, for example, same biological species
The whole or at least part of other published genome assembling sequence.If samples sources individual or target individual are people
UCSC, NCBI or ENSEMBL may be selected in class, genome reference sequences (be also referred to as reference gene group or refer to genome)
Mankind's reference gene group that database provides, such as HG19, HG38, GRCh36, GRCh37, GRCh38 etc., people in the art
Member can pass through the correspondence for illustrating to understand above-mentioned each reference gene group version of database, version selected to use.Into one
Step ground, can also be pre-configured with the resources bank for including more reference sequences, such as before being compared, first according to target individual
The factors such as gender, ethnic group, region select or measurement assembles closer or more certain aspect characteristic sequence and is used as with reference to sequence
Row, help subsequently to obtain more accurate the sequencing results.Alleged reference sequences include chromosome numbers and each site
Location information on chromosome.
So-called first reference sequences are at least part of reference gene group, are that inventor is disclosed based on discovery is excavated
It includes reading length/error rate/quality of data that the characteristics of lower machine data set, which combines used microarray dataset feature, lower machine data characteristic,
Etc. factors and attempted for the purpose of detection chromosomal aneuploidy variation and the version that constructs, utilize first ginseng
The positioning that sequence carries out read is examined, be conducive to quickly obtain positioning result and reduces the data volume handled needed for subsequent step.
In some embodiments, the comparison ability in so-called region is determined in the following manner:It is the of L1 with size
One window carries out sliding window to reference gene group, obtains multiple regions;Region is compared to reference gene group, is compared and is arrived based on region
The number of the position of reference gene group calculates the comparison ability in the region.
So-called region or window correspond to one section of sequence in reference gene group.The size of alleged first window and/or
The step-length of sliding window can be in conjunction with testing goal, used variation testing principle, the sequence characteristic for reading length and reference gene group
It is set.Preferably, the step-length of setting sliding window is not more than the size of first window, in this way, can retain reference as much as possible
The region that ability is 1 is compared on genome, is conducive to improve lower machine data user rate.
Usually, L1 can be arranged according to the long size of reading, such as be set as 0.5-2 times and read appointing in long or average reading length
Meaning integer, sliding window step-length may be configured as being less than long 0.5 times of reading, less than 0.2 times reading length or read the arbitrary whole of length less than 0.1 times
Number.In one example, reference gene group selected to use is the HG19 of UCSC databases, reads a length of 25bp, and setting L1 is
25bp, the step-length that sliding window is arranged are less than 10bp, are less than 5bp or are less than 2bp;For example, the step-length of sliding window is set as 1bp, quite
The overlapping (overlap) with (L1-1) bp between two neighboring first window, in this way, being conducive to obtain in reference gene group
All regional ensembles for meeting the particular requirement, conducive to making full use of sequencing result to obtain more fully comparison result, conducive to carrying
High data user rate.
Specifically, in one example, the comparison ability of zoning is compared with region to the position of reference gene group
Comparison ability of the inverse of number as the region a, for example, region compares the unique positions to reference gene group, the then region
Comparison ability be 1, and a region can compare 5 positions of reference gene group, then the comparison ability 1/5 in the region.
First reference sequences can be built when being detected target sample, can also build preservation standby detection sample in advance
This when, calls.
In some embodiments, the first reference sequences are to eliminate the ginseng in region shown in table 1 to examine genome at least
A part.Since all removal regional sequences need a large amount of lengths for display, table 1 is intended to remove with these/and shielding area examines in ginseng
Location information in genome HG19 indicates these regions, it is possible to understand that ground, these regions are in different editions people's reference gene
Different chromosome start position informations may be corresponded in group, but is not interfered those skilled in the art determining and shielded such as following table
Those regional sequences obtain the first reference sequences.Shielding/remove the reference sequences behind these regions is conducive to the fast of subsequent step
Speed carries out and obtains accurate testing result.
Table 1
In other embodiments, the first reference sequences are to eliminate the corresponding area of the second window for meeting the following conditions
At least part of the reference gene group in domain:The sequencing depth of second window is not less than (being more than or equal to) all second windows
4 times of the average value of the sequencing depth of mouth, preferably not less than the sequencing depth of (being more than or equal to) all second windows
6 times of average value;That is, sequencing depth is much larger than the second window of average sequencing depth in removal or shielding reference gene group.
So-called sequencing depth is also referred to as depth, for the number that some region is capped, is represented by the upper region of comparison
Read quantity and the region size ratio.The sequencing depth of second window is to compare the read of upper second window
The ratio of number and second window size.
So-called second window can carry out sliding window acquisition by using the window that size is L2 to reference gene group, obtain big
A series of small second windows for L2.There can be overlapping that can also not be overlapped between adjacent second window, in one example,
The step size settings that sliding window obtains the second window are L2, i.e., so that non-overlapping and without interval (zero alkali between two neighboring second window
Base weight is folded and zero base intervals), reference gene group is converted to a series of second windows as a result, this series of second window covers
Lid reference gene group is primary, can represent genome using second window of series.
/ shielding processing is removed to reference gene group specific region so that utilize treated reference sequences (the first ginseng
Examine sequence) carry out step (2) compare after, influence of some abnormal datas to subsequent statistical analysis can be eliminated.
Specific implementation mode according to the present invention deviates considerably from sequencing depth-averaged value for sequencing depth or sequencing is deep
The second window of median is spent, assignment again can also be carried out by the depth to second window, to obtain by sequencing depth
A series of second windows of relative equilibrium so that utilize a series of the first ginseng of second windows comprising sequencing depth relative equilibrium
After examining sequence progress step (2) comparison, influence of some abnormal datas to subsequent statistical analysis can be equally eliminated.Show at one
In example, the sequencing depth of the second window to percentile more than 98 is assigned a value of the sequencing of second window of the percentile equal to 98
Depth, or the sequencing depth of the second window to percentile more than 99 are assigned a value of second window that percentile is equal to 99
Depth is sequenced, the first reference sequences obtained with this are conducive to eliminate the influence that testing result is brought in abnormal data/region, are conducive to
Obtain accurate testing result.For example, the second all windows can be ranked up from low to high by sequencing depth value, to ranking
Assignment again is carried out in the sequencing depth of the 99th~100 all second windows, such as be assigned a value of the 99th the second window
Depth value is sequenced, to eliminate influence of the window of abnormal high sequencing depth to subsequent detection.
The size L2 of second window can as needed and sequencing result adjustment determine, preferably, wish the second window
Size and the high and/or low Region/Window of most of sequencing deep anomalies size it is almost the same.In certain embodiments
In, sample is human sample, and reference gene group is people's reference gene group, based on to the preliminary of sequencing result and/or comparison result
Statistics, L2 can be set as 10-20Kbp, preferable 12-17Kbp;In one example, inventor has found, when L2 is set as 15Kbp
When, the window of abnormal area/second can be found out more comprehensively.
Inventor has found sequencing result rough estimates repetitive sequence region generally falls into sequencing deep anomalies region.It goes
Assignment again is carried out except these sequencing deep anomalies Region/Windows or to those regions, does not make to locate compared to these regions
Reason, the accuracy of testing result and/or sensitivity are obviously improved.
It can be compared using known comparison software, such as SOAP, BWA, BLAST, MAPQ and TeraMap etc., this reality
Mode is applied not to be restricted this., can be according to the setting of alignment parameters in comparison process, such as setting a pair or a read are most
Allow have n base mispairing (mismatch), such as setting n is 1 or 2 more, if having more than n base in read occurs mispairing,
Read can not be compared to the first reference sequences by being then considered as this or this, if alternatively, n base of mispairing is entirely located in read pair
In a read, then the reference sequences can not be compared by being considered as the read of the read centering.
Specific implementation mode according to the present invention, the comparison in step (2) include:(a) every read is converted to and is somebody's turn to do
The corresponding one group of short-movie section of read, obtains multigroup short-movie section;(b) determine short-movie section reference library corresponding position, to obtain
As a result, so-called reference library is the Hash table built based on the first reference sequences, reference library includes multiple entries, reference for one positioning
One entry in library corresponds to a seed sequence, and so-called seed sequence can be at least one section of sequence on the first reference sequences
Matching, the distance of corresponding two seed sequences of two neighboring entry of reference library on the first reference sequences are less than short-movie section
Length;(c) it removes and navigates to the short-movie section on any entry in reference library adjacent entries in the first positioning result, obtain second
Positioning result;(d) extended based on the short-movie section from identical read in second positioning result, to obtain the ratio of read
To result.Using above-mentioned comparison method, by the way that read is converted to short-movie section and read sequence information is converted to position letter
Breath, i.e., change into digital morphological by series modality, fixed conducive to the comparison for the lower machine data for rapidly and accurately realizing various microarray datasets
Position.Especially for the read for including the base for failing identification, that is, include the quick and precisely comparison of the read of gap or N, such as
It is especially suitable since sequencing quality is bad, base identifies that the comparison of the read obtained such as bad is analyzed.
So-called reference library is substantially Hash table (hash table), can be directly using so-called seed sequence as key (key
Name), with position (position) of the so-called seed sequence on reference sequences be that value (key assignments) builds the reference library;It can also
First by so-called seed sequence change into number either integer character string using the number or integer character string as key, with seed sequence
The position being listed on reference sequences is that value establishes the reference library.The so-called position with seed sequence on reference sequences is value,
Can be the seed sequence corresponding one or more position on reference sequences/chromosome, position can be directly with actual value
Or numberical range indicates, can also recompile with customized character and/or digital representation.
In one example, the structure that Hash table is realized using the vector v ector of C++, is represented by:Hash(seed)
=Vector (position), so-called vector v ector are a kind of object entities, and it is identical can to accommodate many other types
Element, therefore also referred to as container.It can be preserved with binary system, which is built up with this.Alternatively, it is also possible to which Hash table is divided into
Block (block) stores, the setting build key and block tail key at block, for example, for generic sequence block 5,6,7,8..., 19,
20 }, build and block tail (headers and footers in other words) 5 and 20 are set, if it is 3 to have number, because of 3<5, it is known that 3 are not belonging to the sequence
Sequence blocks, if it is 10 to have number, because of 5<10<20, it is known that 10 belong to this sequence blocks.It can so be selected when inquiry complete
Office's index, the block where can also quickly being navigated to by comparing build key and block tail key, it may be unnecessary to global index.
So-called reference library can in sequence alignment to be carried out structure, preservation can also be built in advance.According to the present invention
Specific implementation mode, in advance build reference library save backup, the structure of reference library includes:Base sum according to reference sequences
TotalBase determines length L, L=the μ * log (totalBase) of seed sequence,And L is less than analysis to be compared
The length (reading length) of read;Length based on seed sequence generates all possible seed sequence, obtains seed sequence collection;Really
The seed sequence of reference sequences and the matching position of the seed sequence can be matched to by determining seed sequence concentration, to obtain the ginseng
Examine library.The relationship for the seed sequence length and reference sequences that this method is established based on the multiple hypothesis test verification of inventor, energy
Enough make the reference library built comprising comprehensive seed sequence with the association of each seed sequence corresponding position on reference sequences
Information, the reference library is compact-sized, and EMS memory occupation is small and can be used in the inquiry of the high speed access in sequence positioning analysis.According to this
One entry of the reference library that embodiment obtains includes only a key, and a key corresponds at least one value.
Present invention specific implementation mode, to generating all possible seed sequence, obtaining the method for seed sequence collection not
It is restricted, for one set of input, the member in the set can be traversed and usually obtain specific length, all possible element group
It closes, such as can be realized using recursive algorithm and/or round-robin algorithm.
In one example, the first reference sequences are at least part of human genome, and human genome includes about 3,000,000,000
A base, the length of pending read are to take the integer in [11,15] not less than 25bp, L, are conducive to efficiently compare.
In one example, at least part of the first reference sequences behaviour cDNA reference gene groups counts this and refers to sequence
The base sum totalBase of row sets the length L of seed sequence (seed) based on base sum,Base type based on L and DNA sequence dna includes A, T, C and G tetra-
Kind, using recursive algorithm, the set of all possible seed sequence is generated, obtains seed sequence collection, which is represented by
Seed=B1B2...BL,B∈{ATCG};Determine that seed sequence is concentrated the seed sequence that can be matched to the reference sequences and is somebody's turn to do
The matching position of seed sequence, the seed sequences of the reference sequences can be matched to as key, with the seed sequence with reference to sequence
Position on row is that value obtains the reference library to build.
In one example, the first reference sequences are at least part of the DNA genomes and transcript profile of certain species, statistics
The base sum totalBase of the reference sequences sets the length L of seed sequence (seed) based on base sum,Base type based on L, composition DNA sequence dna includes A, T, C and G tetra-
Kind and the base type of composition RNA sequence include that tetra- kinds of A, U, C and G generate all possible seed sequence using recursive algorithm
The set of row, obtains seed sequence collection, which is represented by seed=B1B2...BL,B∈{ATCG}∪{AUCG};Determining kind
The seed sequence of the reference sequences and the matching position of the seed sequence can be matched in son sequence set, can be matched to
The seed sequence of the reference sequences is key, is that value obtains the reference to build with position position of the seed sequence on reference sequences
Library.
In one example, seed sequence can be converted to the character string being made of numerical character, using the character string as key
Library is built, can improve the speed of the built reference library of access queries.For example, obtaining the kind that can be matched to the first reference sequences
After subsequence, seed sequence is encoded as follows:
In another example after obtaining seed sequence collection, to the seed sequence of seed sequence concentration
It is encoded, alkali yl coding rule can be same as above, and the coding that the first reference sequences can also be carried out with same rule turns
It changes, conducive to seed sequence corresponding location information on reference sequences is quickly obtained, is also conducive to improve the access of built reference library
Inquiry velocity.
Specific implementation mode according to the present invention determines that seed sequence concentrates the seed that can be matched to the first reference sequences
The matching position of sequence and the seed sequence, including:Sliding window is carried out using the first reference sequences of window pair that size is L, it will
The seed sequence that seed sequence is concentrated is matched with the series of windows that sliding window obtains, to determine that seed sequence concentration can be matched to
The seed sequence of first reference sequences and the matching position of the seed, it is ε to carry out matched serious forgiveness1.It so, it is possible quickly
Corresponding position information of the seed sequence on the first reference sequences is obtained, is conducive to rapid build and obtains reference library.It is so-called fault-tolerant
Rate is the ratio shared by the base mismatch of permission, and mispairing is selected from least one of displacement, insertion and missing.
In one example, so-called matching is stringent matching, i.e. serious forgiveness ε1It is zero, when seed sequence and one or more
When sliding window sequence is completely the same, the position of sliding window sequence is the seed sequence corresponding position on the first reference sequences.
In another example, so-called matching is fault-tolerant matching, serious forgiveness ε1More than zero, when seed sequence and one or more sliding window sequence
When the inconsistent ratio of the base of the same position of row is less than serious forgiveness, the position of sliding window sequence is the seed sequence in the first ginseng
Examine corresponding position in sequence.In one example, to seed sequence, corresponding position encodes on the first reference sequences,
It is the structure that value carries out reference library with such as numerical character of the character after coding.
Change an angle, serious forgiveness ε1To be not zero, it is equivalent to and a seed sequence is transformed into ε1One group of seed under allowing
Template sequence (seed template), such as seed=ATCG, ε1To allow a mistake in 0.25 i.e. four bases, then
Seed template can be ATCG, TTCG, CTCG, GTCG, AACG, ACCG, AGCG etc..In ε1It is determined for 0.25 time
Seed=ATCG is equivalent at the position on reference sequences and determines the corresponding all seed template of the seed first
The position of reference sequences, such as ref=ATCG, all seed template indicated before can be matched to the position, ref
=TTCG, seed template, which are ATCG, TTCG, CTCG or GTCG, can be matched to the position.In turn, the ginseng built
Examine library can using a seed as key, can also using each in the corresponding all seed template of this seed as key,
Key is different from key, at least corresponding value of a key.
Specific implementation mode according to the present invention, when determining seed sequence in the corresponding position on reference sequences, to
One reference sequences carry out the step-length of sliding window according to L and ε1To determine.In one example, the step-length for carrying out sliding window is not less than L* ε1。
In a specific example, the first reference sequences are at least part of human genome, and human genome includes about 3,000,000,000 alkali
The length of base, pending read is not less than 25bp, L 14bp, ε10.2-0.3 is taken, the step-length for carrying out sliding window takes 3bp-
5bp enables in sliding window position fixing process two neighboring window across ε1Under the conditions of continuous faulty combination, be conducive to quickly positioning.
In one example, the distance between two neighboring entry of the reference library built is the step-length of sliding window.
Specific implementation mode according to the present invention, (a) include:Sliding window is carried out to read using the window that size is L, to obtain
One group of short-movie section corresponding with the read is obtained, the step-length of the sliding window is 1bp.In this way, for the reads that a length is K, obtain
(K-L+1) length is the short-movie section of L, and reads is changed into short-movie section, inquires reference library using high speed access, determines each short-movie
Section reference library corresponding position, and then obtain the corresponding reads of short-movie section reference library information.
Specific implementation mode according to the present invention includes (b):By short-movie section seed sequence corresponding with the entry of reference library
It is matched, to determine short-movie section in the position of reference library, it is ε to carry out matched serious forgiveness2。
In one example, so-called matching is stringent matching, i.e. serious forgiveness ε2It is zero, when a short-movie section and reference library
The corresponding seed seed template of an entry it is completely the same when, obtain the position of the short sequence in reference library
Information.
In another example, so-called matching is fault-tolerant matching, serious forgiveness ε2More than zero, when short sequence and reference library
The ratio of the corresponding unmatched bases of seed seed template of one or more entries is less than serious forgiveness ε2When, it obtains
Obtain the location information of the short sequence in reference library.
In one example, ε2=ε1And be not zero, allow to obtain valid data as much as possible.
Specific implementation mode according to the present invention, with reference to figure 1, (b) in, the two neighboring entry of so-called reference library corresponds to
Distances of two seed sequence X1 and X2 on reference sequences ref, two following situations can be divided into:When two items of reference library
Purpose key and value are unique, i.e. an entry corresponds to one [key, value], with reference to figure 1a, is equivalent to the X1 and X2 and reference sequences
When being unique match (X1 and X2 are only matched to one position of reference sequences), so-called distance is X1 and X2 in reference sequences
The distance between the two upper corresponding positions;When the key of at least one entry in two entries of reference library corresponds to multiple values,
With reference to figure 1b, what is be equivalent in two the seed sequences X1 and X2 at least one with reference sequences is not exclusive to match i.e. X1 and X2
In at least one multiple positions for being matched to reference sequences, so-called distance be the X1 and X2 corresponding phases on reference sequences
Away from nearest the distance between two positions.The embodiment is not restricted the representation method of the distance between two sequences,
For example, distance of the either end that can be expressed as in two ends of a sequence to the either end of another sequence,
The center of a sequence can be expressed as to the distance at the center of another sequence.
Specific implementation mode according to the present invention further includes (c) after obtaining the second positioning result:Removal connection length
Degree substitutes the second positioning result, connection length is the second comparison result less than the short-movie section of predetermined threshold with the result after removal
In from identical read and navigate to the short-movie section of reference library difference entry and be mapped to the total length of reference sequences.The processing has
Conducive to remove some transition redundancies and/or relatively low-quality data, it is conducive to improve and compares speed.
Connection length be represented by from identical read and navigate to reference library difference entry short-movie section length summation
Subtract the length for the lap being mapped on reference sequences between short-movie section.In one example, it comes from a read and determines
Position has 4 to the short-movie section of reference library difference entry, is expressed as Y1, Y2, Y3 and Y4, respective length be respectively S1, S2, S3 and
There are overlapping in S4, the position that X1 and X2 therein are mapped to the first reference sequences, and the length of lap is J, and connection length is (S1
+S2+S3+S4-J).In one example, the length of different short-movie sections is L, and so-called predetermined threshold is L, in this way, can permit
Perhaps it in the case of the data that lost part is effective but quality is relatively low, improves and compares speed.
Specific implementation mode according to the present invention further includes (c) after obtaining the second positioning result:It is fixed according to second
The positioning result of the short-movie section from identical read, judges the positioning result of the read in the result of position, and knot is judged in removal
Fruit does not meet the read of pre-provisioning request.Removal read is also to eliminate the corresponding short-movie section of the read simultaneously.In this way, meeting one
Under the premise of fixed sensibility and accuracy, it is based on the second positioning result, accurately matching/part is directly carried out and quickly compares, energy
Enough speed-up ratios pair.
The embodiment is not construed as limiting the method for judge, such as can be in the way of quantization marking.In an example
In, the positioning result short to the short-movie from identical read is given a mark, and marking rule is:With the matched position of the first reference sequences
Point makees deduction, makees bonus point with the unmatched site of the first reference sequences;After obtaining the second positioning result, positioned according to second
As a result the positioning result of the short-movie section from identical read in, scores to the positioning result of the read, and removal score is little
In the read of the first preset value.
In a specific example, a length of 25bp is read, sequence construct is carried out to the short-movie section from identical read, to obtain
Reproducing sequence, for example, the base type in certain site can be determined according to being supported with more short sequences, if certain site does not have branch
The short-movie section held i.e. no short-movie section compares arrives the site, then the site base type is uncertain to be indicated with N, is obtained with this
Obtain reproducing sequence, it can be seen that reproducing sequence is corresponding with read, and the length of reproducing sequence is to read to grow;Reproducing sequence and the first ginseng
It examines the matched site sequence (ref) and makees minus fifteen, make to add one point with the unmatched site of the first reference sequences, comparing serious forgiveness is
The mispairing ratio that one read/reproducing sequence allows is 0.12, compares and allows the length of mistake for 3bp (25*0.12), initial point
Number Scoreinit is to read to grow, and the first preset value is 22 (25-3), in this way, removal score, which is less than 22, unmatches the first reference
The site accounting of sequence is more than to compare the reproducing sequence of serious forgiveness, is conducive in the number for allowing lost part effectively but quality is relatively low
In the case of, speed-up ratio pair.
According to a specific example, bit arithmetic and dynamic programming algorithm [G.Myers.A fast bit-vector are used
algorithm for approximate string matching based on dynamic progamming.Journal
of the ACM,46(3):395-415,1999], for every reproducing sequence, the position of each site i is read in, utilizes 64
Binary mask carry out Rapid matching score, each site one divides, and initial score Scoreinit is to read to grow, and is represented by
Scoreinit=length (read), matching score obtain score Score, are represented by:
In one example, the positioning result short to the short-movie from identical read is given a mark, and marking rule is:With
Bonus point is made in the matched site of one reference sequences, makees deduction with the unmatched site of the first reference sequences;Obtaining the second positioning knot
After fruit, according to the positioning result of the short-movie section from identical read in the second positioning result, to the positioning result of the read into
Row score, removal score are not less than the corresponding short-movie section of read of the second preset value.
In a specific example, a length of 25bp is read, sequence construct is carried out to the short-movie section from identical read, to obtain
Reproducing sequence, for example, the base type in certain site can be determined according to being supported with more short sequences, if certain site does not have branch
The short-movie section held i.e. no short-movie section compares arrives the site, then the site base type is uncertain to be indicated with N, is obtained with this
Obtain reproducing sequence, it can be seen that reproducing sequence is corresponding with read, and the length of reproducing sequence is to read to grow;Reproducing sequence and the first ginseng
It examines the matched site of sequence (ref) to make to add one point, makees minus fifteen with the unmatched site of the first reference sequences, comparing serious forgiveness is
The mispairing ratio that one read/reproducing sequence allows is 0.12, compares and allows the length of mistake for 3bp (25*0.12), initial point
Number Scoreinit is -25, and the second preset value is -22 (- 25-3), in this way, removal score is more than -22 reproducing sequence, is being allowed
In the case that lost part is effective but relatively low-quality data, speed-up ratio pair.
Specific implementation mode according to the present invention, (d) in based on the short-movie from identical read in the second positioning result
Duan Jinhang extends, including:Sequence construct is carried out based on the short-movie section from identical read, obtains reproducing sequence;Based on reconstruct sequence
The common portion of row reference sequences corresponding with the reproducing sequence is extended, to obtain extension sequence.In this way, by short-movie section and
Short-movie section location information is converted to the location information of the corresponding read of short-movie section (referred to here as reproducing sequence), is conducive to follow-up compare
Processing fast and accurately carries out.
So-called common portion, the part shared for a plurality of sequence.Specific implementation mode according to the present invention, so-called public affairs
Part is public substring and/or common subsequence altogether.Public substring refers to the continuous part shared in a plurality of sequence, common subsequence
It is then not necessary to continuous.For example, for ABCBDAB and BDCABA, common subsequence is BCBA, and public substring is AB.
It is so-called that sequence construct is carried out based on the short-movie section from identical read, reproducing sequence is obtained, in one example,
Can according to supporting determine the base type in certain site on reproducing sequence with more short-movie sections, if certain site do not support it is short
Segment, that is, no short-movie section compares to the reference sequences site, then the site base type is uncertain to be indicated with N, with this
To obtain so-called reproducing sequence.It can be seen that, reproducing sequence is corresponding with read, and the length of reproducing sequence is to read to grow.
The corresponding reference sequences of so-called reproducing sequence are and reproducing sequence matched one section of reference sequences section reference
The length of sequence is grown not less than reading.In one example, the length of the corresponding reference sequences of reproducing sequence is identical as reproducing sequence,
It is to read length.In another example, allow reproducing sequence and the fault-tolerant matching of corresponding reference sequences, reproducing sequence corresponding
The length of reference sequences is that the length of reproducing sequence adds twice of fault-tolerant matching length, for example, reproducing sequence length reads length
For 25bp, the matching of reproducing sequence and reference sequences allows mispairing 12%, that section that can be compared with reproducing sequence to refer to sequence
Row and this section of reference sequences both ends each 3bp (25*12%) sequence are used as the corresponding reference sequences of reproducing sequence.
A specific example according to the present invention, so-called common portion are public substring.(d) being positioned based on second in
As a result the short-movie section from identical read in is extended, including:It is corresponding with the reproducing sequence to search the reproducing sequence
The public substring of reference sequences determines the Longest Common Substring of reproducing sequence reference sequences corresponding with the reproducing sequence;Base
In editing distance, extend the Longest Common Substring to obtain extension sequence.It so, it is possible more accurately to obtain comprising longer matching
The comparison result of sequence.
A specific example according to the present invention, so-called common portion are common subsequence.(d) fixed based on second in
The short-movie section from identical read is extended in the result of position, including:Search reproducing sequence ginseng corresponding with the reproducing sequence
The common subsequence for examining sequence determines the longest common subsequence of reproducing sequence reference sequences corresponding with the reproducing sequence;
Based on editing distance, extend the longest common subsequence to obtain extension sequence.
So-called editing distance is also Levenshtein distances, refers to being changed by one another between two character strings
A required minimum edit operation number.Edit operation includes that a character is substituted for another character, is inserted into a character
And delete a character.In general, editing distance is smaller, and the similarity of two strings is bigger.
In one example, for a reproducing sequence/read, reproducing sequence ginseng corresponding with the reproducing sequence is searched
The Longest Common Substring for examining sequence is represented by and seeks two character string x1x2...xiAnd y1y2...yjPublic substring, character string
Length be respectively m and n, calculate the length c [i, j] of the public substring of this two character string, equation of transfer can be obtained:
Solve equation the length for the Longest Common Substring that can obtain this two sequences
For max (c [i, j]), i ∈ 1 ..., m }, j ∈ 1 ..., n };Followed by editing distance, Longest Common Substring is converted to
Corresponding reference sequences can be such that Longest Common Substring both ends constantly grow, and find out the minimum character needed between two character strings
Operation (is replaced, is deleted, be inserted into).Editing distance can be determined using dynamic programming algorithm, which has optimal minor structure, compiles
The calculating for collecting distance d [i, j] is represented by following equation:
Wherein,
Hole/vacancy (gap) indicates that insertion either deletes the gap in a character formula and indicates that being inserted into or delete a character (corresponds to
Site in sequence) needed for point penalty, matching (match) indicates that two characters are the same, and match in formula indicates two characters
Score when the same, mispairing (mismatch) indicate that two characters are unequal/different, and the mismatch in formula indicates two
Unequal/asynchronous valve of a character point.D [i, j] takes minimum in three one.In a specific example, a gap penalizes 3
Point, continuous gap increases valve 1 and divides, and a site mispairing penalizes 2 points, and site matches to obtain 0 point.In this way, conducive to the efficient of the sequence containing gap
It compares.
Specific implementation mode according to the present invention, so-called common portion are common subsequence.It is according to the present invention specific
Embodiment includes (d):Search the public sub- sequence of the short-movie section for the same item that reference library is navigated in the second positioning result
Row, determine the corresponding longest common subsequence of every read;Based on editing distance, extend longest common subsequence to be extended
Sequence.
In one example, for a reproducing sequence/read, reproducing sequence reference corresponding with the reproducing sequence is searched
The longest common subsequence of sequence is based on longest common subsequence, and corresponding that section of reproducing sequence of longest common subsequence is turned
Corresponding that section of reference sequences of longest common subsequence are turned to, this two sections of sequences are found out using Smith Waterman algorithms
Editing distance, to two character string x1x2...xiAnd y1y2...yj, can be acquired by following formula:
Wherein,
σ indicates that scoring function, σ (i, j) indicate character (site) xiAnd yjMispairing or matched score, σ (-, j) it indicates
xiVacancy (deletion) or yjThe score of insertion, σ (i, -) indicate yjDeletion or xiThe score of insertion;Then, using front
The method of calculating editing distance in example, reproducing sequence pair is converted to by corresponding that section of reproducing sequence of longest common subsequence
The reference sequences answered can constantly grow at the both ends of corresponding that section of reproducing sequence of longest common subsequence, find out minimum character
Operation (is replaced, is deleted, be inserted into).
In a specific example, a gap penalizes 3 points, and continuous gap increases penalize 1 point, and a site mispairing penalizes 2 points, site
With 4 points.It so, it is possible to realize the efficient comparison of the sequence containing gap and both other site accuracy height containing gap can be retained
Sequence.
Specific implementation mode according to the present invention further includes (d):Extension sequence is carried out from least one end of extension sequence
It blocks, calculates the ratio in the location of mistake site of the extension sequence after blocking, meet the following conditions stopping and block:Prolonging after blocking
The ratio for stretching the location of mistake site of sequence is less than third preset value.In this way, by the way of blocking and rejecting, it can be preferable
Retain the good local sequence of matching, is conducive to improve the effective percentage of data.
Specifically, specific implementation mode according to the present invention, based on being blocked below to extension sequence:I, first is calculated
Error rate and the second error rate, if the first error rate is less than the second error rate, from the first end of extension sequence to extension sequence
It is blocked, if the first error rate is more than the second error rate, extension sequence is blocked from the second end of extension sequence, with
Extension sequence after being blocked, so-called first error rate are block obtaining to extension sequence from the first end of extension sequence
Block after extension sequence location of mistake site ratio, so-called second error rate is from the second end of extension sequence
Extension sequence is blocked, obtain block after extension sequence location of mistake site ratio;Ii, with prolonging after blocking
It stretches sequence replacing extension sequence and carries out i, preset until the ratio in the location of mistake site of the extension sequence after blocking is less than the 4th
Value.In this way, in such a way that both-end blocks and rejects, it can preferably retain the good local sequence of matching, be conducive to improve number
According to effective percentage.According to a specific example, the length of extension sequence is 25bp, and the 4th preset value, which is that third is default, to be set to
0.12。
Specific implementation mode according to the present invention further includes (d):Extension sequence is carried out from least one end of extension sequence
Sliding window calculates the ratio in the location of mistake site for the series of windows that sliding window obtains, according to the ratio in the location of mistake site of series of windows
Example blocks extension sequence, meets the following conditions stopping and blocks:The ratio in the location of mistake site for the series of windows that sliding window obtains
Example is more than the 5th preset value.In this way, by the way of blocking and rejecting, it can preferably retain the good local sequence of matching,
Conducive to the effective percentage of raising data.
Specifically, specific implementation mode according to the present invention, based on being blocked below to extension sequence:I, third is calculated
Error rate and the 4th error rate, if third error rate is less than the 4th error rate, from the second end of extension sequence to extension sequence
It is blocked, if third error rate is more than the 4th error rate, extension sequence is blocked from the first end of extension sequence, with
Extension sequence after being blocked, so-called third error rate be from the first end of extension sequence to extension sequence carry out sliding window,
The ratio in the location of mistake site of the series of windows of acquisition, so-called 4th error rate are the second end from extension sequence to extending
Sequence carries out the ratio in the location of mistake site of sliding window, the series of windows of acquisition;Ii, extension is substituted with the extension sequence after blocking
Sequence carries out i, until the ratio in the location of mistake site of series of windows is more than the 6th preset value.In this way, blocked using both-end and
The mode of rejecting can preferably retain the good local sequence of matching, be conducive to improve the effective percentage of data.
Specific implementation mode according to the present invention, the window of sliding window are not more than the length of extension sequence.It is specific according to one
The length of example, extension sequence is 25bp, and the window size of sliding window is 10bp, and it is 0.12 that the 6th preset value, which is the 5th preset value,.
Specific implementation mode according to the present invention, the size blocked are 1bp, i.e., once block to remove 1 base.Such as
This, can efficiently obtain comprising more how long the comparison result of sequence.
For make comparison in difference result have statistical significance, usually, negative sample be it is multiple, such as negative sample number
Not less than 20, preferably, being not less than 30.
So-called negative sample is the sample without chromosomal aneuploidy exception, such as the target for the detection that makes a variation is people
Or sample to be tested is the sample from human body, then negative sample is the sample obtained from normal diploid individual.Negative sample is surveyed
The acquisition of sequence result and the acquisition of sample to be tested sequencing result are limited without sequence, such as can be obtained simultaneously, also can successively be obtained,
Preferably, being obtained simultaneously under same test conditions, influenced caused by testing result with reducing experimental factor difference as possible.Separately
Outside, preferably, negative sample and sample to be tested are same type sample, the heredity of for example, fetus in Non-invasive detection parent is believed
Breath, negative sample and sample to be tested can be maternal blood sample.
Specific implementation mode according to the present invention determines the amount for the read that corresponding the first chromosome is navigated in negative sample
Including:Sample to be tested is substituted with negative sample and carries out step (1)-(3), to navigate to the first chromosome of the negative sample
Read amount;It is navigated in using the mean value of the amount of the read of the first chromosome of multiple negative samples as negative sample corresponding
The amount of the read of the first chromosome.
The amount of the so-called read for navigating to chromosome can be absolute amount, can also be opposite amount, such as show
For a numerical value such as integer, ratio, or show as a numberical range.
Specific implementation mode according to the present invention carries out before carrying out step (3) at least one in following (i)-(iii)
Item, at least two or whole three:(i) read that the length in sequencing result is not more than predetermined length is removed;(ii) it removes
Read of the non-locating to the first reference sequences unique positions in comparison result;Compare/navigate to reference sequences unique positions
Reads is known as unique reads;(iii) read that error rate in comparison result is not less than predetermined error rate is removed, read
Error rate is after comparing be in the read be inserted into, the ratio shared by least one of base of missing and mispairing.
In one example, the error rate of so-called read is to be shown inserted into the read after comparing
(insertion), missing (deletion) and mispairing (mismatch) base number ratio shared by the number of position in other words.
So-called predetermined error rate can according to microarray dataset, lower machine data volume, the quality of data and testing goal etc. into
Row setting, it is possible to understand that, if lower machine data volume is small and/or the quality of data is higher, may be suitble to set larger predetermined mistake
Accidentally rate is conducive to fast conversely, smaller predetermined error rate can be set with the relatively low data of removal quality while meeting detection
Speed detection.
In one example, to the sequencing result from single-molecule sequencing platform, sequencing is tied using whole (i)-(iii)
Fruit is filtered, and is conducive to quickly detection.Specifically, lower machine data volume 12.8M, predetermined error rate is set as 10%, i.e., for one
The reads of 10bp at most allows insertion, missing or the mispairing of 1bp, and data 3.4M is obtained after filtering.If it is to be appreciated that
Relatively stringent filtering when comparison, can be without (ii), such as can set the predetermined error rate as 100%.
Specific implementation mode according to the present invention, step (3) include:(a) sequence is referred to for the window pair first of L3 with size
Row carry out sliding window, obtain multiple third windows, optional, the step-length of sliding window is L3;(b) it is based on comparison result, determines third window
The sequencing depth of mouth, the sequencing depth of third window are the number and the third window size for the read for comparing the upper third window
The ratio of L3;(c) the sequencing depth for the third window for being included based on the first chromosome, navigates to the first chromosome
The amount of read.
Specific implementation mode according to the present invention includes (b):Sequencing of the G/C content based on third window to third window
Depth is corrected, using the sequencing depth of the third window after correction as the sequencing depth of third window.
The setting of size, that is, L3 of third window, usually, it would be desirable to be able to reflect the difference of G/C content and distribution to those
The difference that region (third window) sequencing result is brought.For human genome, usually, L3 values are less than 300Kbp.At one
In example, inventor comes according to the coefficient of variation (coefficients of variation, CV) and the relationship of different size window
It determines L3, as shown in figure 3, according to the curve, CV values is selected to be influenced apparent corresponding window size as the by window size
Three window sizes, it is 100Kbp-200Kbp that L3, which is such as arranged, can reflect G/C content and be distributed to the influence that sequencing is brought, also sharp
It is compared in quick.The so-called coefficient of variation is also known as coefficient of dispersion, is a normalization measurement of probability distribution dispersion degree,
For the ratio between standard deviation and average value;The dispersion degree of the G/C content in one group window/region of certain window size is reflected herein
Absolute value.
Two neighboring third window can be overlapped and can not also be overlapped, and in one example, setting L3 is 150Kbp, adjacent
There is two third windows 100bp to be overlapped (overlap), i.e. sliding window step-length is set as 149.9Kbp.
Specifically, can be by establishing the relationship of the G/C content of third window and the sequencing depth of third window, to carry out school
Just;In one example, the G/C content of third window and the sequencing depth of third window are established using the local weighted Return Law
Relationship.
Specific implementation mode according to the present invention further includes (b), before carrying out above-mentioned correction, the survey to third window
Sequence depth is standardized, using the sequencing depth of the third window after standardization as the sequencing depth of third window.
In one example, so-called standardization be normalized, such as can be based on sequencing depth-averaged value or
Depth median is sequenced to carry out the normalization of third window depth.
Specific implementation mode according to the present invention, in (c), based on third window sequencing depth determine compare to this
The weight coefficient of the read of three windows navigates to the amount of the read of the first chromosome based on weight coefficient.
In one example, the sequencing depth of third window is standardized or normalized, such as with third window
The sequencing depth of the sequencing depth of mouth and the ratio of particular value as the third window, alleged particular value are that the sequencing of third window is deep
The mean value of degree makes third window sequencing depth be transformed into one group of numerical value around 1 fluctuation;It determines that treated and depth (phase is sequenced
To depth is sequenced) with the relationship of G/C content.
The weight coefficient of the read of so-called third window is the opposite sequencing depth of the window, in one example, institute
The amount of the read for navigating to the first chromosome be referred to as relative quantity, and for by the corrected relative quantity of weight coefficient, as a result,
The influence of G/C content and/or distributional difference to testing result can be eliminated or be reduced, detection accuracy is improved.
In some examples, inventor has found that the opposite sequencing depth of third window and the G/C content of the window are inversely proportional, i.e.,
The opposite sequencing depth of the low third window of G/C content is high, and the opposite sequencing depth of the high third window of G/C content is low.It is right as a result,
Pass through the corrected relative quantity of weight coefficient in so-called, for example, N read navigates to certain third window of the first chromosome,
The depth of sequencing relatively of the third window of the first chromosome is w, then is obtained after correcting and navigate to being somebody's turn to do for the first chromosome
Third window isRead.
In one example, the amount of the so-called read for navigating to the first chromosome is relative quantity, for navigate to this
The ratio of the amount of the read of one chromosome and the amount for navigating to the autosomal read of all or at least a portion, is examined by z
(z-score) whether the difference for comparing ratio ratio corresponding to negative sample has statistical significance, to judge to wait for test sample
Whether this first chromosome saves as aneuploidy exception.
Specific implementation mode according to the present invention, the first chromosome are selected from least one of 13,18 and No. 21 chromosome.
For example, based on the free nucleic acid in detection maternal blood sample, to obtain Fetal genetic information, including screening or auxiliary diagnosis
Fetus makes a variation with the presence or absence of 13,18 and/or No. 21 chromosomal aneuploidies.
Usually, the G/C content of different chromosomes and distribution have different characteristics, such as the opposite height based on G/C content,
Chromosome in genome can return to high GC content group, middle G/C content group and low G/C content group, or can return to opposite height
G/C content group, middle high GC content group, middle G/C content group, in low G/C content group and low G/C content group.
Table 2 shows that the G/C content of huamn autosomal, inventor are based on multiple check sample sequencing datas and draw chromosome
The relation curve of the G/C content of standardized sequencing depth and chromosome, as shown in figure 4, the dye that G/C content is relatively high and relatively low
The sequencing result of colour solid influenced by G/C content it is more apparent, for 21,13 and No. 18 chromosomes, from the point of view of relatively, 21 dyeing body examinations
Sequence result is influenced minimum by G/C content, and No. 18 chromosomes take second place, and 13 chromosomes are affected by G/C content.
Table 2
Chr | 4 | 5 | 6 | 3 | 18 | 8 | 2 | 7 | 12 | 21 |
G/C content | 0.3825 | 0.3952 | 0.3961 | 0.3969 | 0.3979 | 0.4018 | 0.4024 | 0.4075 | 0.4081 | 0.4083 |
Chr | 14 | 11 | 10 | 1 | 15 | 20 | 16 | 17 | 22 | 19 |
G/C content | 0.4089 | 0.4157 | 0.4158 | 0.4174 | 0.4220 | 0.4413 | 0.4479 | 0.4554 | 0.4799 | 0.4836 |
Specific implementation mode according to the present invention, sample to be tested are maternal blood sample.Since fetus dissociative nucleic acid includes
Content of the fetus dissociative DNA (cffDNA) in parent free nucleic acid sample is in different pregnant woman and/or within the different pregnant week phases
Fluctuation is very big.If detection sensitivity can be improved, pregnant week phase more early sample is can detect under identical detection accuracy, then it is pregnant
Can manpower intervention time it is more early, the influence to pregnant woman is smaller;If accuracy can be improved, false positive and false negative can all drop
It is low, it is final to be become possible to so that being applied to diagnosis, and screening is not only for auxiliary diagnosis.Usually, pregnant woman's body fluid sample passes through
It crosses extraction cffDNA, structure library, the sequencing of upper machine, machine finally descended to obtain sequencing data (such as fastq formats), by lower machine data
It is compared with reference sequences, obtains containing every read in the position of genome, comparison score, whether uniquely comparison, ratio
To the comparison result (such as being known as sam files) of the information such as mistake, the read of some chromosome can be counted according to comparison result
Number, finally calculates accounting (hereinafter referred to as chromosome accounting) of the read number in autosomal read number of the chromosome, to sentence
The chromosome break with the presence or absence of numerical abnormality.
Specific implementation mode according to the present invention carries out noninvasive Prenatal Screening (NIPT or NIPD), can first obtain one
The pregnant woman's body fluid sample (negative sample) comprising dissociative DNA for confirming normal fetal by detection is criticized, and calculates these pregnant woman's bodies
The accounting of chromosome such as 21/18/13 chromosome in liquid sample, so that it is determined that it is normal and/or different to be somebody's turn to do (a little) chromosome number
Normal range or line of demarcation.It can also determine that chromosome number is normal and/or abnormal in the same way using positive sample
Range or line of demarcation.
Embodiment of the present invention also provides a kind of device of detection chromosomal aneuploidy, and the device is implementing above-mentioned
The method of detection chromosomal aneuploidy in invention any embodiment or specific implementation mode, the device include:Sequencer module,
For at least part nucleic acid in sample to be tested to be sequenced, acquisition includes the sequencing result of read;Comparing module is used for
Read from sequencer module is compared to the first reference sequences, obtains comparison result, the comparison result includes the read
It is positioned at the information of chromosome, first reference sequences are the set in the region that the comparison ability in reference gene group is 1, than
The region for being 1 to ability is the region for navigating to unique positions in reference gene group;Quantitative module, for for the first dyeing
Body navigates to the amount of the read of the first chromosome based on the comparison result from comparing module;Judgment module is used for
The amount for comparing the read for navigating to the first chromosome come self-quantitatively module navigates to corresponding first dye to negative sample
The amount of the read of colour solid, to judge the number of the first chromosome.
The skill of the above-mentioned detection method to the chromosomal aneuploidy in any embodiment of the present invention or specific implementation mode
The description of art feature and effect, it is equally applicable the present invention this embodiment in device, details are not described herein.
For example, in some embodiments, the determination of the comparison ability in region includes:It is the first window pair of L1 with size
Reference gene group carries out sliding window, obtains multiple regions, and the step-length of sliding window for example may be configured as 1bp;Region is compared to reference to base
Because of group, the comparison ability in the region is calculated based on the number that region is compared to the position of reference gene group.
In some embodiments, the number of negative sample is not less than 20, or is preferably not less than 30.
In some embodiments, the read that corresponding the first chromosome is navigated in negative sample is determined as follows
Amount:Sample to be tested is substituted with negative sample and enters sequencer module, comparing module and quantitative module, to navigate to the feminine gender sample
The amount of the read of this first chromosome;Using the mean value of the amount of the read of the first chromosome of multiple negative samples as negative sample
The amount of the read of corresponding the first chromosome is navigated in this.
In some embodiments, alleged first reference sequences are to eliminate the ginseng in region shown in table 1 to examine genome
At least part in hg19 sequences.
In some embodiments, the first reference sequences are to eliminate the corresponding region of the second window for meeting the following conditions
Reference gene group at least part:The sequencing depth of second window is flat not less than the sequencing depth of all second windows
4 times of mean value.
In other embodiments, the first reference sequences are to eliminate the corresponding area of the second window for meeting the following conditions
At least part of the reference gene group in domain:The sequencing depth of second window is not less than the sequencing depth of all second windows
6 times of average value.
In some embodiments, the first reference sequences be to the corresponding region of the second window in reference gene group carry out with
At least part of the reference gene group of lower processing:The sequencing depth of the second window to percentile more than 98 is assigned a value of percentage
The sequencing depth of second window of the digit equal to 98.
In other embodiments, the sequencing depth of the second window to percentile more than 99 is assigned a value of percentile
The sequencing depth of the second window equal to 99.Alleged second window carries out reference gene group by using the window that size is L2
Sliding window obtains, and in one example, the step-length of sliding window is also L2.The sequencing depth of second window is to compare upper second window
The ratio of the number of read and second window size L2.
In some embodiments, device further includes filtering module, and the filtering module is for carrying out in following (i)-(iii)
At least one of:(i) read that the length in sequencing result is not more than predetermined length is removed;(ii) non-locating in comparison result is removed
To the read of the first reference sequences unique positions;(iii) read that error rate in comparison result is not less than predetermined error rate is removed,
The error rate of read is after comparing be in the read be inserted into, the ratio shared by least one of base of missing and mispairing.
In some embodiments, quantitative module is used to carry out hereinafter, (a) refers to sequence with size for the window pair first of L3
Row carry out sliding window, obtain multiple third windows;(b) it is based on comparison result, determines the sequencing depth of third window, third window
Sequencing depth is the ratio of the number and third window size L3 for the read for comparing the upper third window;(c) it is based on the first dye
The sequencing depth for the third window that colour solid is included, navigates to the amount of the read of the first chromosome.
In some instances, further include (b) being standardized to the sequencing depth of third window, after standardization
Third window sequencing depth of the sequencing depth as third window.
In other examples, further include (b) that the G/C content based on third window carries out the sequencing depth of third window
Correction, using the sequencing depth of the third window after correction as the sequencing depth of third window.The survey of third window before correction
Sequence depth can be the sequencing depth of standardized third window.
Specifically, being corrected using the relationship of the sequencing depth of the G/C content and third window of third window;At one
In example, the relationship of the G/C content of third window and the sequencing depth of third window is established using the local weighted Return Law.
In some instances, include (c) that the read compared to the third window is determined based on the sequencing depth of third window
Weight coefficient, the amount of the read of the first chromosome is navigated to based on weight coefficient.
In some embodiments, sample to be tested is maternal blood sample.
In some embodiments, the first chromosome is at least one of 13,18 and No. 21 chromosome of fetus.
Embodiment of the present invention provides a kind of computer readable storage medium, is executed for computer for storing/carrying
The execution of program, program includes the chromosomal aneuploidy detection side completed in any of the above-described embodiment or specific implementation mode
Method.Above-mentioned detection method and/or device to the chromosomal aneuploidy in any embodiment of the present invention or specific implementation mode
Technical characteristic and effect description, it is equally applicable the present invention this embodiment in computer readable storage medium, herein
It repeats no more.
Embodiment of the present invention also provides a kind of computer program product, should include instruction, and instruction executes journey in computer
When sequence, computer is made to execute the chromosomal aneuploidy detection method in any of the above-described embodiment or specific implementation mode.
Embodiment
The reference sequences used are to meet without region shown in table 1 and simultaneously the collection that ginseng below examines the region of genome
It closes:1) it is 1,2 to compare ability) removal is sequenced 6 times that depth is less than average sequencing depth, or depth percentile will be sequenced
The sequencing depth in the region more than 99 is assigned a value of the 99th percentile sequencing depth value.
Check sample and sample to be tested pass through following processing:
1, it is sequenced, obtains lower machine data, that is, obtain read set;Get rid of the read less than 25bp;
2, comparison result (sam files) is obtained, including obtains unique read and (compares the reading to reference sequences unique positions
Section) and these reads in reference sequences/with reference to the position on chromosome;
3, GC corrections are carried out, including:Reference sequences are cut into the window/area (Bin=150K) of 150Kbp sizes;According to
The sequencing depth that each Bin is calculated according to unique read is normalized the sequencing depth of each Bin, is normalized
Sequencing depth;Count the G/C content of each Bin;The relationship of normalization sequencing depth and G/C content is established, such as with the GC of Bin
Content is x, and the normalized sequencing depth to correspond to Bin is fitted the equation for obtaining the relationship of the two for y.
4, it is corrected using y as weight coefficient w and uniquely compares the quantity to the read of the window/area, table in comparison result
It is 1/w to be shown as the score of this read or contribution margin;
5, the quantity for determining the unique reads of revised chromosome i is expressed as all unique of chromosome i
The sum of score of reads
6, the sum of the score for calculating the unique reads of chromosome i accounts for all autosomal ratios
Then, the ratio value (Ratio of the chromosome i based on multiple check samplesi), determine the ratio value of chromosome i
Average value muiAnd variances sigmai;
Formula is examined using zThe Z score values (Zscore) for calculating the chromosome i of sample to be tested, compare
The size of the value and threshold value, to judge that the chromosome i of the sample to be tested whether there is numerical abnormality.
If Zscore >=3 of certain chromosome of a maternal peripheral blood sample to be measured, difference has statistical significance, it is believed that should
There are three chromosomes by the pregnant youngster of parent institute.
For threshold value, the distribution of the ratio value of the chromosome i of multiple check samples meets normal distribution or approximation meets
Normal distribution can search z values and corresponding confidence level by z tables (gaussian distribution table);Such as it is 99.97% to take confidence level, it is right
The z values substantially 3 answered are more than the z values, illustrate the improper sample of the 99.97% probability sample, can determine whether as exception.Certainly,
As needed, those skilled in the art can set other confidence levels, and then using corresponding z values as threshold value to determine whether depositing
In exception.
Karyotyping, which is had already passed through, using ten an example of the above method pair confirms that the sample of male style is detected, all samples
This can be detected, and the results are shown in Table 3.
Table 3
In the description of this specification, an embodiment, some embodiments, one or some specific implementation modes,
The description of one or some embodiments, example etc. mean the specific features for combining the embodiment or example to describe, structure or
Feature is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms
It may not refer to the same embodiment or example.Moreover, the features such as specific features of description, structure, can be at any one
Or it can be combined in any suitable manner in multiple embodiments or example.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not
In the case of being detached from the principle of the present invention and objective a variety of change, modification, replacement and modification can be carried out to these embodiments, this
The range of invention is by claim and its equivalent limits.
Claims (10)
1. a kind of method of detection chromosomal aneuploidy, which is characterized in that including:
(1) at least part nucleic acid in sample to be tested is sequenced, acquisition includes the sequencing result of read;
(2) read is compared to the first reference sequences, obtains comparison result, the comparison result includes the read positioning
In the information of specific chromosome, first reference sequences are the set in the region that the comparison ability in reference gene group is 1, than
The region for being 1 to ability is the region for navigating to unique positions in the reference gene group;
(3) for the first chromosome, it is based on the comparison result, navigates to the amount of the read of the first chromosome;
(4) amount for the read for navigating to the first chromosome described in comparing navigates to corresponding the first chromosome to negative sample
Read amount, to judge the number of the first chromosome.
2. the method as described in claim 1, which is characterized in that the determination of the comparison ability in the region includes:
Sliding window is carried out to the reference gene group for the first window of L1 with size, obtains multiple regions, optional, it is described
The step-length of sliding window is 1bp;
The region is compared to the reference gene group, the number to the position of the reference gene group is compared based on the region
Mesh calculates the comparison ability in the region;
Optional,
The number of the negative sample is not less than 20, and optional is not less than 30;
Optional,
The amount for the read that corresponding the first chromosome is navigated in negative sample is determined as follows:
Sample to be tested is substituted with negative sample and carries out (1)-(3), to navigate to the reading of the first chromosome of the negative sample
The amount of section;
Corresponding first dye is navigated in using the mean value of the amount of the read of the first chromosome of multiple negative samples as negative sample
The amount of the read of colour solid;
Optional,
First reference sequences are to eliminate the ginseng in region shown in following table to examine at least part in genome hg19 sequences:
Optional,
First reference sequences are to eliminate the reference gene group in the corresponding region of the second window for meeting the following conditions extremely
A few part:The sequencing depth of second window is not less than 4 times of the average value of the sequencing depth of all second windows, optionally
, the sequencing depth of second window is not less than 6 times of the average value of the sequencing depth of all second windows;
Second window carries out sliding window acquisition by using the window that size is L2 to the reference gene group, optional, institute
The step-length for stating sliding window is L2, the sequencing depth of second window be compare the read for going up second window number and this second
The ratio of window size;
Optional,
First reference sequences are to carry out the following reference gene handled to the corresponding region of the second window in reference gene group
At least part of group carries out following processing:The sequencing depth of the second window to percentile more than 98 is assigned a value of percentile
The sequencing depth of the second window equal to 98, optional, the sequencing depth of the second window to percentile more than 99 is assigned a value of
The sequencing depth of second window of the percentile equal to 99;
Second window carries out sliding window acquisition by using the window that size is L2 to the reference gene group, optional, institute
The step-length for stating sliding window is L2, the sequencing depth of second window be compare the read for going up second window number and this second
The ratio of window size L2;
Optional,
The method includes before carrying out step (3), carrying out at least one of following (i)-(iii):
(i) read that the length in the sequencing result is not more than predetermined length is removed;
(ii) read of the non-locating to the first reference sequences unique positions in the removal comparison result;
(iii) read that error rate in the comparison result is not less than predetermined error rate is removed, the error rate of read is after comparing
It is the ratio shared by least one of base of insertion, missing and mispairing in the read.
3. method as claimed in claim 1 or 2, which is characterized in that step (3) includes,
(a) sliding window is carried out to first reference sequences for the window of L3 with size, obtains multiple third windows;
(b) be based on the comparison result, determine the sequencing depth of the third window, the sequencing depth of the third window be than
To the ratio of the above number of the read of the third window and third window size L3;
(c) the sequencing depth for the third window for being included based on the first chromosome, determine described in navigate to this first dyeing
The amount of the read of body;
Optional,
(b) further include being standardized to the sequencing depth of the third window, with the survey of the third window after standardization
Sequencing depth of the sequence depth as the third window;
Optional,
(b) further include that the G/C content based on the third window is corrected the sequencing depth of the third window, with correction
Sequencing depth of the sequencing depth of third window afterwards as the third window;
Optional,
The correction is carried out using the relationship of the sequencing depth of the G/C content and third window of the third window;Optionally
, the relationship of the G/C content of the third window and the sequencing depth of third window is established using the local weighted Return Law;
Optional,
(c) include that the weight coefficient compared to the read of the third window is determined based on the sequencing depth of the third window,
Based on the amount for the read for navigating to the first chromosome described in weight coefficient determination.
4. method as described in any one of claims 1-3, which is characterized in that the sample to be tested is maternal blood sample;
Optional, the first chromosome is at least one of 13,18 and No. 21 chromosome of fetus.
5. a kind of device of detection chromosomal aneuploidy, which is characterized in that including:
Sequencer module, at least part nucleic acid in sample to be tested to be sequenced, acquisition includes the sequencing result of read;
Comparing module obtains comparison result, the comparison for will be compared to the first reference sequences from the read of sequencer module
As a result include the information that the read is positioned at chromosome, first reference sequences are that the comparison ability in reference gene group is
The set in 1 region, the region that comparison ability is 1 are the region for navigating to unique positions in the reference gene group;
Quantitative module, for for the first chromosome, based on the comparison result from comparing module, navigating to first dye
The amount of the read of colour solid;
Judgment module, for comparing come in the amount and negative sample of the read for navigating to the first chromosome of self-quantitatively module
The amount for navigating to the read of corresponding the first chromosome, to judge the number of the first chromosome.
6. device as claimed in claim 5, which is characterized in that the determination of the comparison ability in the region includes:
Sliding window is carried out to the reference gene group for the first window of L1 with size, obtains multiple regions, optional, it is described
The step-length of sliding window is 1bp;
The region is compared to the reference gene group, the number to the position of the reference gene group is compared based on the region
Mesh calculates the comparison ability in the region;
Optional,
The number of the negative sample is not less than 20, and optional is not less than 30;
Optional,
The amount for the read that corresponding the first chromosome is navigated in negative sample is determined as follows:
Sample to be tested is substituted with negative sample and enters through sequencer module, comparing module and quantitative module, to navigate to this
The amount of the read of the first chromosome of negative sample;
Corresponding first dye is navigated in using the mean value of the amount of the read of the first chromosome of multiple negative samples as negative sample
The amount of the read of colour solid;
Optional,
First reference sequences are to eliminate the ginseng in region shown in following table to examine at least part in genome hg19 sequences:
Optional,
First reference sequences are to eliminate the reference gene group in the corresponding region of the second window for meeting the following conditions extremely
A few part:The sequencing depth of second window is not less than 4 times of the average value of the sequencing depth of all second windows, optionally
, the sequencing depth of second window is not less than 6 times of the average value of the sequencing depth of all second windows;
Second window carries out sliding window acquisition by using the window that size is L2 to the reference gene group, optional, institute
The step-length for stating sliding window is L2, the sequencing depth of second window be compare the read for going up second window number and this second
The ratio of window size;
Optional,
First reference sequences are to carry out the following reference gene handled to the corresponding region of the second window in reference gene group
At least part of group:The sequencing depth of the second window to percentile more than 98 is assigned a value of second that percentile is equal to 98
The sequencing depth of window, optional, the sequencing depth of the second window to percentile more than 99 is assigned a value of percentile and is equal to
The sequencing depth of 99 the second window;
Second window carries out sliding window acquisition by using the window that size is L2 to the reference gene group, optional, institute
The step-length for stating sliding window is L2, the sequencing depth of second window be compare the read for going up second window number and this second
The ratio of window size L2;
Optional,
Further include filtering module, the filtering module is for carrying out at least one of following (i)-(iii):
(i) read that the length in the sequencing result is not more than predetermined length is removed;
(ii) read of the non-locating to the first reference sequences unique positions in the removal comparison result;
(iii) read that error rate in the comparison result is not less than predetermined error rate is removed, the error rate of read is after comparing
It is the ratio shared by least one of base of insertion, missing and mispairing in the read.
7. such as device described in claim 5 or 6, which is characterized in that the quantitative module for carrying out hereinafter,
(a) sliding window is carried out to first reference sequences for the window of L3 with size, obtains multiple third windows;
(b) be based on the comparison result, determine the sequencing depth of the third window, the sequencing depth of the third window be than
To the ratio of the above number of the read of the third window and third window size L3;
(c) the sequencing depth for the third window for being included based on the first chromosome, determine described in navigate to this first dyeing
The amount of the read of body;
Optional,
(b) further include being standardized to the sequencing depth of the third window, with the survey of the third window after standardization
Sequencing depth of the sequence depth as the third window;
Optional,
(b) further include that the G/C content based on the third window is corrected the sequencing depth of the third window, with correction
Sequencing depth of the sequencing depth of third window afterwards as the third window;
Optional,
The correction is carried out using the relationship of the sequencing depth of the G/C content and third window of the third window;Optionally
, the relationship of the G/C content of the third window and the sequencing depth of third window is established using the local weighted Return Law;
Optional,
(c) include that the weight coefficient compared to the read of the third window is determined based on the sequencing depth of the third window,
Based on the amount for the read for navigating to the first chromosome described in weight coefficient determination.
8. such as claim 5-7 any one of them devices, which is characterized in that the sample to be tested is maternal blood sample;
Optional, the first chromosome is at least one of 13,18 and No. 21 chromosome of fetus.
9. a kind of computer readable storage medium, which is characterized in that for storing the program executed for computer, described program
Execution includes completing method according to any one of claims 1-4.
10. a kind of computer program product, which is characterized in that including instruction, described instruction executes the journey in the computer
When sequence, the computer is made to execute method according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810425695.6A CN108629152A (en) | 2018-05-07 | 2018-05-07 | Detect the method, apparatus and system of chromosomal aneuploidy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810425695.6A CN108629152A (en) | 2018-05-07 | 2018-05-07 | Detect the method, apparatus and system of chromosomal aneuploidy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108629152A true CN108629152A (en) | 2018-10-09 |
Family
ID=63695783
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810425695.6A Withdrawn CN108629152A (en) | 2018-05-07 | 2018-05-07 | Detect the method, apparatus and system of chromosomal aneuploidy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108629152A (en) |
-
2018
- 2018-05-07 CN CN201810425695.6A patent/CN108629152A/en not_active Withdrawn
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102273717B1 (en) | Deep learning-based variant classifier | |
CN107423578B (en) | Device for detecting somatic cell mutation | |
CN108595912A (en) | Detect the method, apparatus and system of chromosomal aneuploidy | |
EP2926288B1 (en) | Accurate and fast mapping of targeted sequencing reads | |
US20220101944A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
US11842794B2 (en) | Variant calling in single molecule sequencing using a convolutional neural network | |
CN113160882A (en) | Pathogenic microorganism metagenome detection method based on third generation sequencing | |
US20190287646A1 (en) | Identifying copy number aberrations | |
CN115083521B (en) | Method and system for identifying tumor cell group in single cell transcriptome sequencing data | |
CN111919256A (en) | Method, device and system for detecting chromosome aneuploidy | |
CN108268752B (en) | A kind of chromosome abnormality detection device | |
CN112289376A (en) | Method and device for detecting somatic cell mutation | |
WO2019046804A1 (en) | Identifying false positive variants using a significance model | |
US20190362807A1 (en) | Genomic variant ranking system for clinical trial matching | |
CN109461473B (en) | Method and device for acquiring concentration of free DNA of fetus | |
CN113862351B (en) | Kit and method for identifying extracellular RNA biomarkers in body fluid sample | |
WO2019213810A1 (en) | Method, apparatus, and system for detecting chromosome aneuploidy | |
CN116469462A (en) | Ultra-low frequency DNA mutation identification method and device based on double sequencing | |
CN108629152A (en) | Detect the method, apparatus and system of chromosomal aneuploidy | |
US11535896B2 (en) | Method for analysing cell-free nucleic acids | |
CN117711487B (en) | Identification method and system for embryo SNV and InDel variation and readable storage medium | |
CN112562787B (en) | Gene large fragment rearrangement detection method based on NGS platform | |
Mölbert et al. | Adjustments to the reference dataset design improve cell type label transfer | |
US20220042091A1 (en) | Mitochondrial DNA Quality Control | |
Deshpande et al. | Reconstructing and characterizing focal amplifications in cancer using AmpliconArchitect |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20181009 |