CN103805689A - Characteristic kmer based metatypic chromosomal sequence assembly method and application thereof - Google Patents

Characteristic kmer based metatypic chromosomal sequence assembly method and application thereof Download PDF

Info

Publication number
CN103805689A
CN103805689A CN201210460704.8A CN201210460704A CN103805689A CN 103805689 A CN103805689 A CN 103805689A CN 201210460704 A CN201210460704 A CN 201210460704A CN 103805689 A CN103805689 A CN 103805689A
Authority
CN
China
Prior art keywords
kmer
data
sequence
atypia
special
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210460704.8A
Other languages
Chinese (zh)
Other versions
CN103805689B (en
Inventor
黄铨飞
李振宇
刘耿
刘兵行
王俊
汪建
杨焕明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Technology Solutions Co Ltd
Original Assignee
BGI Technology Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Technology Solutions Co Ltd filed Critical BGI Technology Solutions Co Ltd
Priority to CN201210460704.8A priority Critical patent/CN103805689B/en
Publication of CN103805689A publication Critical patent/CN103805689A/en
Application granted granted Critical
Publication of CN103805689B publication Critical patent/CN103805689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a characteristic kmer based metatypic chromosomal sequence assembly method and application thereof. The characteristic kmer based metatypic chromosomal sequence assembly method is characterized by comprising the following steps of: performing whole-genome sequencing for a homoplastic (like XX or ZZ) sample and a metatypic (like XY or ZW) sample; analyzing kmer difference between the two samples based on the data to obtain the characteristic kmer of the metatypic chromosome; and then performing metatypic chromosomal sequence assembly based on the characteristic kmer so as to obtain the complete sequence information of metatypic chromosome. The invention also provides an assembly unit on the basis of the method above.

Description

A kind of atypia chromosome sequence assemble method and application thereof based on feature kmer
Technical field
The invention belongs to bioinformation field, particularly, the present invention relates to a kind of atypia chromosome sequence assemble method and application thereof based on feature kmer.
Background technology
Along with new-generation sequencing technology 454(Roche company), Solexa(Illumina company) and SOLiD(ABI company) birth, sequencing throughput is promoted rapidly, and order-checking cost sharply declines, this breakthrough has greatly promoted genome the reach of science.
Common sex chromosome classification is mainly XY type and ZW type.In XY type, XX is female, and XY is male (for example people, fruit bat); In ZW type, ZW is female, and ZZ is male (for example chicken, Cynoglossus semilaevis).Y/W is called atypia karyomit(e).The chromosomal method of tradition assembling atypia is genome sequencing assembling, then picks out atypia chromosome sequence.For example: first to the female sample ZW of the Cynoglossus semilaevis assembling of checking order, then pick out W sequence.But the chromosomal base overburden depth of atypia only has autosomal half, has large fragment replication region, make the chromosomal assembling effect of atypia poor; And selecting atypia chromosome sequence needs more analysis or experimental verification, expensive consuming time.
What new-generation sequencing technology produced is all the small segment sequence that is about 25bp ~ 100bp left and right, these small segments are all certain parts of testing sample large fragment, the magnanimity small segment sequence data how order-checking being obtained is assembled into the large fragment data in sample, and this has proposed great challenge to follow-up karyomit(e) (especially atypia karyomit(e)) information analysis work.In the prior art, the fragment sequence producing during due to order-checking is very short, so need to just can complete the assembling to large fragment data by very large operand.
In sum, this area does not also have a kind of effectively easy method that atypia chromosome sequence is assembled at present, therefore in the urgent need to developing corresponding method and product.
Summary of the invention
Object of the present invention is just to provide a kind of atypia chromosome sequence assemble method based on feature kmer.
Object of the present invention is just to provide a kind of atypia chromosome sequence assembled unit based on feature kmer.
In a first aspect of the present invention, provide a kind of atypia chromosomal sequence assembling method, comprise step:
(1) respectively homotype sample and special-shaped sample are carried out to genome sequencing, obtain sequencing data and the kmer data of two kinds of samples;
(2) the kmer data of two kinds of samples of comparison step (1), obtain the chromosomal feature kmer of atypia;
(3) according to the feature kmer of step (2), atypia karyomit(e) is assembled, obtain the chromosomal assembling sequence of atypia.
In another preference, described homotype sample is XX type or ZZ type.
In another preference, described special-shaped sample is XY type or ZW type.
In another preference, described atypia karyomit(e) is Y chromosome or W karyomit(e).
In another preference, in step (1), described order-checking is high-flux sequence method.
In another preference, the application of described high-flux sequence method is optionally from the order-checking platform of lower group: 454FLX order-checking platform, Solexa order-checking platform, the SOLID platform that checks order.
In another preference, in step (2), the kmer data of described two kinds of samples of comparison comprise step:
(i) remove order-checking wrong data, and build kmer set; With
(ii) screening feature kmer, screening principle is as follows: in homotype sample sequencing data, travel through each and read order, obtain kmer, if this kmer occurs in the kmer of special-shaped sample data set, from the kmer set of special-shaped sample data, delete this kmer, in the kmer set of special-shaped sample data, the remaining deleted kmer that do not have is feature kmer.
In another preference, in step (3), use conventional composite software to carry out the assembling of atypia chromosome sequence.
In another preference, use SOAPdenovo software to carry out the assembling of atypia chromosome sequence.
In another preference, the assembling that SOAPdenovo software carries out atypia chromosome sequence comprises step:
(a) build according to feature kmer and simplify kmer and scheme (de Brujin graph);
(b) build contig sequence according to step (a);
(c) the contig sequence of special-shaped sample being read to order sequenced data and step (b) is compared, and obtains stent sequence, thereby obtains the chromosomal sequence of complete atypia.
In a second aspect of the present invention, a kind of method of screening atypia karyological character kmer is provided, comprise step:
Respectively homotype sample and special-shaped sample are carried out to genome sequencing, obtain sequencing data and the kmer data of two kinds of samples, and build kmer set; With
Screening feature kmer, screening principle is as follows: in homotype sample sequencing data, travel through each and read order, obtain kmer, if this kmer occurs in the kmer of special-shaped sample data set, from the kmer set of special-shaped sample data, delete this kmer, in the kmer set of special-shaped sample data, the remaining deleted kmer that do not have is feature kmer.
In a third aspect of the present invention, the set of a kind of atypia karyological character kmer is provided, it is to prepare by the method described in second aspect.
In a fourth aspect of the present invention, the chromosomal sequence assembling of a kind of atypia unit is provided, comprise the module that is selected from lower group:
(A) retrieval module, for obtaining homotype sample and sequencing data and the kmer data with special-shaped sample;
(B) feature kmer screening module, sequencing data and the kmer data of utilizing retrieval module to obtain, carry out the screening of the chromosomal feature kmer of atypia; With
(C) atypia chromosome sequence assembling, for assembling the chromosomal complete sequence of atypia.
Should be understood that within the scope of the present invention, above-mentioned each technical characterictic of the present invention and can combining mutually between specifically described each technical characterictic in below (eg embodiment), thus form new or preferred technical scheme.As space is limited, tire out and state no longer one by one at this.
Accompanying drawing explanation
Following accompanying drawing is used for illustrating specific embodiment of the invention scheme, limits and be not used in the scope of the invention being defined by claims.
Fig. 1 has shown the schema of a technical scheme of the present invention.
Embodiment
The inventor is through extensive and deep research, first passage carries out respectively genome sequencing to homotype (as XX or ZZ) sample and abnormal shape (as XY or ZW) sample, the difference of analyzing two kinds of sample data kmer, obtains out the chromosomal feature kmer of atypia; Then utilize composite software assembling, thereby obtain the chromosomal complete sequence information of atypia.This has completed the present invention on basis.
Atypia karyomit(e)
As used herein, term " atypia karyomit(e) " refers to: the karyomit(e) that determines Gender Classification in sex chromosome classification.Special-shaped sample is common XY or ZW, and atypia karyomit(e) is Y or W commonly.For example: common sex chromosome classification is mainly XY type and ZW type.In XY type, XX is female, and XY is male (for example people, fruit bat); In ZW type, ZW is female, and ZZ is male (for example chicken, Cynoglossus semilaevis).Y/W is called atypia karyomit(e).
Two end sequencings
Gene fragment (comprising DNA and cDNA) is checked order, and its order-checking object is all one section of base sequence fragment that physics is continuous, and this fragment is called Insert Fragment, and its length is called Insert Fragment length (insertsize).
As used herein, term " two end sequencing " be to the both sides base sequence of this fragment from edge to inner order-checking, the sequence recording is called reads order (read), length is called reads long (read-length).What both sides recorded reads order is to come from same Insert Fragment, and its end-to-end distance is from being insertsize, determines therefore the pair relationhip of order is read in both sides.Reading order for these two is called as pairing and reads order (Pair-end reads).
High-flux sequence
Genomic high-flux sequence can be found and the ANOMALOUS VARIATIONS of disease related gene the mankind as soon as possible, contributes to diagnosis and treatment to individual disease to carry out deep research.Those skilled in the art can adopt three kinds of s-generations order-checking platforms to carry out the SOLID etc. of high-flux sequence: 454FLX (Roche company), Solexa Genome Analyzer (Illumina company) and Applied Biosystems company conventionally.The common feature of these platforms is high sequencing throughput, with respect to the kapillary order-checking of tradition order-checking 96 road, high-flux sequence is once tested and can be read 400,000 to 4,000,000 sequences, according to the difference of platform, read length from 25bp to 450bp not etc., therefore different order-checking platforms, in once testing, can read the base number that 1G does not wait to 14G.
Solexa high-flux sequence comprises that DNA bunch forms and two steps of upper machine order-checking: order-checking probe fixing on the mixture of pcr amplification product and solid phase carrier is hybridized, and carries out solid phase bridge-type pcr amplification, forms order-checking bunch; Described order-checking bunch is checked order with " limit synthetic-Bian sequencing ", thereby obtain the sequence of sample amplifying nucleic acid molecule.
The formation of DNA bunch is to use surface to be connected with the sequence testing chip (flow cell) of one deck strand primer (primer), the principle that the DNA fragmentation of strand state matches by base complementrity by the primer of joint sequence and chip surface is fixed on the surface of chip, pass through amplified reaction, fixing single stranded DNA becomes double-stranded DNA, two strands again sex change becomes strand, its one end is anchored on sequence testing chip, thereby near complementary being anchored of another primer that the other end is random and, forms on " bridge "; On sequence testing chip, there are up to ten million the reactions more than generation of DNA single molecules simultaneously; The strand bridge forming, take primer around as amplimer, increases again on the surface of amplification chip, forms two strands, and two strands becomes strand through sex change, again becomes bridge, and the template that is called next round amplification continues amplification; Repeatedly carried out 30 and taken turns after amplification, each unit molecule obtains 1000 times of amplifications, is called monoclonal DNA bunch.
DNA bunch is carried out the order-checking while synthesizing on Solexa sequenator; in sequencing reaction; four kinds of bases different fluorescence of mark respectively, the protected base sealing of each base end, single reaction can only add a base; through overscanning; read after the color of this secondary response, this protection group is removed, and next reaction can be proceeded; so repeatedly, obtain the accurate sequence of base.In the multiple order-checking of Solexa (Multiplexed Sequencing) process, can use Index (label) to distinguish sample, and after routine has checked order, additionally check order for Index part, by the identification of Index, can in 1 order-checking path, distinguish nearly 12 kinds of different samples.
Contig (contig) and contig contig assembling
As used herein, term " contig " is the meaning of contig, after the gene fragment that contains STS (sequence tags site, sequence tagged site) is checked order respectively, overlapping analysis can obtain complete sequence, and what in analysis, use is exactly contig.
The ultimate principle that obtains contig contig is to splice after huge DNA " breaks into pieces " again.Using Mb, kb, bp as map distance, take the STS sequence of DNA probe as road sign, obtain physical map.Build one of physical map and to the effect that the cloned sequence of the DNA that contains the corresponding sequence of STS is connected into overlapped fragment " contig ", the library that is loaded with DNA fragmentation can comprise that to build overall fraction of coverage be 100%, have highly representational fragment contig.
In a preference of the present invention, also comprise the step that kmer is filtered, preferably include: kmer, the removal length of deleting incredible kmer, deletion low depth are less than the end points tips that 2 times of kmer values are long, or its combination.
In another preference, described incredible kmer is: be all in the out-degree of a kmer or the kmer collection of in-degree, take the degree of depth of the highest kmer of the degree of depth as standard, the kmer that is less than 10% (preferably 5%) of this standard is incredible kmer.Described low depth, for being less than certain depth standard, is defaulted as 0, can be determined by user by program parameter.
The incredible contact of described deletion (or contact data) is selected from lower group:
(i) delete continuous sequence and there is high depth, the contact data between the continuous sequence of low weight own;
(ii) differ great continuous sequence for having between degree of having more and out-degree, delete the contact data between the continuous sequence of low weight;
(iii) differ great continuous sequence for thering is the in-degree of going out and going out in-degree, delete the relatively little contact data of weight;
Or aforesaid arbitrary combination (iv).
In another preference, the high depth described in (i) is: the continuous sequence degree of depth is higher than 25 times of the contact data weight between continuous sequence.
In another preference, (i) described in low weight be: weight was less than for 3 (preferably weight is less than 2).
In another preference, (ii) in continuous sequence there is degree of having more, form degree of having more collection, what be less than contact data highest weighting 3% between continuous sequence is relatively low weight data.
In another preference, between the out-degree described in (ii), differ greatly and refer to: little out-degree is less than the more than 5% of large out-degree, is preferably less than the more than 10% of large out-degree.
In another preference, there is out the continuous sequence of in-degree in (iii) simultaneously, calculate the contact data weight summation between all continuous sequences in out-degree, if contact data weight is less than 2% of described summation in in-degree, delete; The same summation of calculating in-degree, if the weight of contact data is less than 2% of in-degree summation in out-degree, deletes.
In a preference of the present invention, contig assembling comprises step: will read order and be configured to kmer figure; Kmer figure is filtered and linearization process, form continuous sequence; Obtain the contact (Arc) between continuous sequence, and carry out Arc filtration; Carry out linearizing by not having forked continuous sequence; Repeat Arc filtration step and linearizing step, no longer change to sequence, obtain the contig sequence of output.
Support and bracket assembled method
As used herein, term " support " or " scaffold " can exchange use, are to have awaited being assembled into complete transcriptional group or genomic sequence fragment.
In a preference of the present invention, the method for scaffold assembling focuses on constructing genome: scaffold figure is divided into subgraph one by one, and a subgraph means a genome.
In a preference,, by following method, scarford figure is divided into subgraph: the contigs that has connection between contig is formulated for a class by scaffold figure, it is subgraph, as: contig1 connects contig3, contig3 connects contig5, and contig1, contig3, contig5 connect without other, contig1, contig3, contig5 and be connected to a subgraph.Build in each subgraph, thereby export the successional genome sequence of complete tool again.
In a preference, scaffold assembling comprises step: will read that order is read in order and pairing and contig output sequence is compared, the information between order and contig is read in acquisition; Set up the connection between contig, build take contig as point, be connected to the figure on limit; The figure obtaining is divided into independently subgraph; According to the complete order of reading of subgraph output.
Polynary group and de Brujin graph
As used herein, term " polynary group " or " kmer " can exchange, and refer to DNA sequencing fragment or its combination that a length is k, and k is positive integer.K-mer has multiple use, for correcting order-checking mistake, builds contig (contig), and estimates Genome Size, heterozygosis rate, and tumor-necrosis factor glycoproteins content etc.
As used herein, term " de Brujin graph ", " kmer figure " or " de Bruijn figure " can exchange.
In a preference, first fragment is cut into the fragment of kmer size in the step mode of moving of single base, as: for the fragment of a 75bp, kmer is 50 o'clock, and the fragment of its generation is just 1-50bp, 2-51bp, 3-52bp, etc., be that unit mates afterwards by the fragment of these kmer sizes, if can mate, just explanation has can being stitched together of these two kmer fragments.
Those skilled in the art can use general method design of graphics in sequence assembling, and in a preference, described method comprises step: i. receives sequencing sequence; Ii by the sequencing sequence receiving one by one the base cutting of sliding be fixed the short string of base length, and obtain the left and right annexation of described short string; By the sequential value of each short string, left and right annexation and number of connection thereof are stored as a node of de Bruijn figure, realize thus design of graphics in short sequence assembling with iii..
Contig
As used herein, term " contig " and " edge " can exchange, and all refer to each other and can connect into one group of short-movie section compared with long segment by overlap.The continuous sequence that the representative of contig record builds from multiple cloned sequences.These records may comprise sketch or complete sequence, also may comprise sequence gap (in single clone) or cross over other gap of not checking order between the multiple clones that clone.
N50
Take the summation of all contig length as comparison other, as 500Mb, the contig containing is from 100 to 500bp.By contig from the longest or from the shortest contig, removing one by one, is added the sequence length of these removals simultaneously.In the time removing some contigs, the total length of all (or being retained) of being removed is a half of all contig length, and the length of this contig is exactly the value of N50.
The assembling of atypia chromosome sequence
The invention provides a kind of assemble method (Fig. 1) of atypia chromosome sequence, by homotype (as XX or ZZ) sample and abnormal shape (as XY or ZW) sample are carried out respectively to genome sequencing, the difference of analyzing two kinds of sample data kmer, obtains out the chromosomal feature kmer of atypia; Then utilize composite software assembling to obtain atypia chromosome sequence.
In the present invention, can use conventional composite software to carry out the assembling of atypia chromosome sequence, a kind of preferred composite software is: SOAPdenovo(is with reference to Ruiqiang Li, Hongmei Zhu, Jue Ruan, ea al.De novo assembly of human genomes with massively parallel short read sequencing.Genome Research 2009,20:265-272).
The ultimate principle of the inventive method is: atypia chromosome sequence is only present in special-shaped sample, and is not present in homotype sample, goes out the feature kmer of special-shaped sample by analyzing the differential screening of two samples, and these features kmer belongs to atypia karyomit(e).
The inventive method the first step is the kmer set that builds special-shaped sample data.The short sequence reads data of the order-checking of special-shaped sample are removed after order-checking mistake, and selected kmer length builds kmer set.Here kmer refers to that step-length is N(kmer length, odd number), the subsequence that moving window is 1, adjacent two kmer have N-1 overlapping character.
The inventive method second step is screening feature kmer.The short sequence of the order-checking of homotype sample is read to order sequenced data and remove after order-checking mistake, travel through each and read order, obtain kmer.If this kmer occurs in the kmer of special-shaped sample data set, from the kmer set of special-shaped sample data, delete this kmer.To after the reading order sequenced data traversal and finish of homotype sample, kmer in homotype sample in euchromosome and non-atypia karyomit(e) (X/Z) data can be deleted from the kmer set of special-shaped sample data, and therefore in the kmer of special-shaped sample data set, the remaining deleted kmer that do not have is the chromosomal feature kmer of atypia.
In a preference of the present invention, false code is described below:
Begin
Input: homotype sample is read ordered sets R, special-shaped sample kmer set K;
For?kmer?in?R
If(kmer∈K)
From K, delete kmer;
End?if
End?for
Output: the chromosomal feature kmer set of atypia;
End
The inventive method the 3rd step is: assembling atypia chromosome sequence.Obtain after feature kmer set assembling atypia chromosome sequence.Those of ordinary skill in the art can use ordinary method assembling atypia chromosome sequence (as utilized SOAPdenovo software).
First build de Bruijn figure (as with reference to Pevzner AP according to feature kmer set, Tang H, Waterman MS.An Eulerian path approach to DNA fragment assembly.Proc Natl Acad Sci 2001,98:9748-9753), and do simplify process; Then on the repetition border of figure, interrupt to connect and build contigs sequence; Finally use order-checking reads data and the contigs sequence alignment of special-shaped sample, use the both end information of read sequence that short contig sequence is linked to be to long scaffold sequence filling-up hole.
The present invention also provides the assembled unit of atypia chromosome sequence, comprising:
(A) retrieval module, for obtaining sequencing data and the kmer data of homotype sample and special-shaped sample;
(B) feature kmer screening module, for screening the chromosomal feature kmer of atypia;
(C) atypia chromosome sequence assembling, for assembling the chromosomal complete sequence of atypia.
Major advantage of the present invention comprises:
1. the present invention utilizes the scheme that feature kmer assembles, and having solved atypia karyomit(e) and another heterosomal similarity in special-shaped sample affects the problem of assembling effect, guarantees that assembling result is not mingled with other chromosome sequences.
2. when the present invention carries out the screening of feature kmer, can filter out rapidly feature kmer set, and memory consumption and traditional SOAPdenovo assembling are quite, this is mainly the first step pregraph by revising SOAPdenovo software, and utilizes the efficient data structure of SOAPdenovo to realize.
3. the present invention selects the chromosomal operation of atypia after having saved full genome assembling, and this technical scheme can be assembled atypia chromosome sequence quickly and easily.
Below in conjunction with specific embodiment, further set forth the present invention.Should be understood that these embodiment are only not used in and limit the scope of the invention for the present invention is described.The experimental technique of unreceipted actual conditions in the following example, conventionally according to normal condition as people such as Sambrook, molecular cloning: laboratory manual (New York:Cold Spring Harbor Laboratory Press, 1989) condition described in, or the condition of advising according to manufacturer.
Embodiment 1
The assembling of Cynoglossus semilaevis W atypia karyomit(e)
1. sample source is in the blood cell of a female Cynoglossus semilaevis and a male Cynoglossus semilaevis.
2. utilize Illumina Hiseq2000 to carry out conventional genome sequencing to sample.
Sequencing result: male parent (ZZ) and maternal (ZW) sequencing data amount are respectively 68GB and 91GB.
3. the first step is filtered sequencing data and is prepared data configuration file.
Sequencing data filters: more than 10% sequence of N content; Mass value is lower than 40 sequence; The sequence that comprises primer; The sequence that Insert Fragment is little; The sequence of PCR complexity;
The preparation of data configuration file can be referring to the operation instruction of SOAPDenovo.
Male parent after filtration and maternal data volume are respectively 47GB and 64GB.
4. second step:
1) read in the reads data of two samples, analyze kmer, obtain feature kmer set, and build de Bruijn figure, program Run Script is as follows:
SOAPdenovo-63mer?pregraph-s?zw.reads.cfg-r?zz.reads.cfg-K?27-o?Cs-p?16>pregraph.log
In the present embodiment, contriver completes the work of screening feature kmer by the SOAPdenovo program of revising: wherein-s parameter is the configuration file of maternal read data,-r parameter is the configuration file of male parent read data, and the kmer simultaneously occurring in maternal and male parent read data can be deleted from the kmer set of maternal read data.Kmer length gets 27;
2) build contig, program Run Script is as follows:
SOAPdenovo-63mer?contig-g?Cs>contig.log
3) maternal reads data and contig sequence alignment, program Run Script is as follows:
SOAPdenovo-63mer?map-s?zw.reads.cdf-g?Cs-p?16>map.log
4) build scaffold sequence filling-up hole, program Run Script is as follows:
SOAPdenovo-63mer?scaff-g?Cs-F-p?16>scaff.log
So far, completed the assembling of Cynoglossus semilaevis W karyomit(e).
The about 30G of SOAPdenovo memory requirements revising.
5. result is shown:
The Statistical information of contig assembling sequence:
Comprise the sequence overall length of N: 13743913
Do not comprise the sequence overall length of N: 13743913
Contig sequence number: 82806
Mean length: 165
Median length: 129
Maximum length sequence length: 3152
The shortest sequence length: 100
N50(number):: 159(25420)
Scaffold sequence Statistical information:
Comprise the sequence overall length of N: 20776731
Do not comprise the sequence overall length of N: 18172557
Scaffold sequence number: 43894
Mean length: 473
Median length: 132
Maximum length sequence length: 27006
The shortest sequence length: 100
N50(number):: 1545(2685)
All documents of mentioning in the present invention are all quoted as a reference in this application, are just quoted separately as a reference as each piece of document.In addition should be understood that those skilled in the art can make various changes or modifications the present invention after having read above-mentioned teachings of the present invention, these equivalent form of values fall within the application's appended claims limited range equally.

Claims (10)

1. the chromosomal sequence assembling method of atypia, is characterized in that, comprises step:
(1) respectively homotype sample and special-shaped sample are carried out to genome sequencing, obtain sequencing data and the kmer data of two kinds of samples;
(2) the kmer data of two kinds of samples of comparison step (1), obtain the chromosomal feature kmer of atypia;
(3) according to the feature kmer of step (2), atypia karyomit(e) is assembled, obtain the chromosomal assembling sequence of atypia.
2. the method for claim 1, is characterized in that, described homotype sample is XX type or ZZ type; Or described special-shaped sample is XY type or ZW type.
3. the method for claim 1, is characterized in that, described atypia karyomit(e) is Y chromosome or W karyomit(e).
4. the method for claim 1, is characterized in that, in step (1), described order-checking is high-flux sequence method.
5. the method for claim 1, is characterized in that, in step (2), the kmer data of described two kinds of samples of comparison comprise step:
(i) remove order-checking wrong data, and build kmer set; With
(ii) screening feature kmer, screening principle is as follows: in homotype sample sequencing data, travel through each and read order, obtain kmer, if this kmer occurs in the kmer of special-shaped sample data set, from the kmer set of special-shaped sample data, delete this kmer, in the kmer set of special-shaped sample data, the remaining deleted kmer that do not have is feature kmer.
6. the method for claim 1, is characterized in that, in step (3), uses conventional composite software to carry out the assembling of atypia chromosome sequence.
7. method as claimed in claim 6, is characterized in that, comprises step:
(a) build according to feature kmer and simplify kmer and scheme (de Brujin graph);
(b) build contig sequence according to step (a);
(c) the contig sequence of special-shaped sample being read to order sequenced data and step (b) is compared, and obtains stent sequence, thereby obtains the chromosomal sequence of complete atypia.
8. a method of screening atypia karyological character kmer, is characterized in that, comprises step:
Respectively homotype sample and special-shaped sample are carried out to genome sequencing, obtain sequencing data and the kmer data of two kinds of samples, and build kmer set; With
Screening feature kmer, screening principle is as follows: in homotype sample sequencing data, travel through each and read order, obtain kmer, if this kmer occurs in the kmer of special-shaped sample data set, from the kmer set of special-shaped sample data, delete this kmer, in the kmer set of special-shaped sample data, the remaining deleted kmer that do not have is feature kmer.
9. a set of atypia karyological character kmer, is characterized in that, it is prepared by method claimed in claim 8.
10. the chromosomal sequence assembling of an atypia unit, is characterized in that, comprises the module that is selected from lower group:
(A) retrieval module, for obtaining sequencing data and the kmer data of homotype sample and special-shaped sample;
(B) feature kmer screening module, sequencing data and the kmer data of utilizing retrieval module to obtain, carry out the screening of the chromosomal feature kmer of atypia; With
(C) atypia chromosome sequence assembling, for assembling the chromosomal complete sequence of atypia.
CN201210460704.8A 2012-11-15 2012-11-15 A kind of sex chromosome with heterotype sequence assembling method of feature based kmer and application thereof Active CN103805689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210460704.8A CN103805689B (en) 2012-11-15 2012-11-15 A kind of sex chromosome with heterotype sequence assembling method of feature based kmer and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210460704.8A CN103805689B (en) 2012-11-15 2012-11-15 A kind of sex chromosome with heterotype sequence assembling method of feature based kmer and application thereof

Publications (2)

Publication Number Publication Date
CN103805689A true CN103805689A (en) 2014-05-21
CN103805689B CN103805689B (en) 2015-08-19

Family

ID=50703074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210460704.8A Active CN103805689B (en) 2012-11-15 2012-11-15 A kind of sex chromosome with heterotype sequence assembling method of feature based kmer and application thereof

Country Status (1)

Country Link
CN (1) CN103805689B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573407A (en) * 2015-02-10 2015-04-29 东南大学 Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing
CN105631464A (en) * 2015-12-18 2016-06-01 深圳先进技术研究院 Method and device for classifying chromosome sequences and plasmid sequences
CN107784201A (en) * 2016-08-26 2018-03-09 深圳华大基因科技服务有限公司 A kind of real-time sequencing sequence joint filling-up hole method and system of two generation sequences and three generations's unimolecule
CN108875306A (en) * 2018-05-31 2018-11-23 福建农林大学 A kind of method and system for searching Sex Determination sequence

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021986B (en) * 2016-05-24 2019-04-09 人和未来生物科技(长沙)有限公司 Ultralow frequency mutating molecule consensus sequence degeneracy algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘新星 等: "De Novo Assembly of Allotetraploid Arabidopsis suecica Transcriptome using Short Reads for Gene Discovery and Marker Identification", 《中国生物工程杂志》 *
王磊 等: "DNA片段拼接中重复序列算法研究", 《计算机科学》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573407A (en) * 2015-02-10 2015-04-29 东南大学 Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing
CN104573407B (en) * 2015-02-10 2017-05-24 东南大学 Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing
CN105631464A (en) * 2015-12-18 2016-06-01 深圳先进技术研究院 Method and device for classifying chromosome sequences and plasmid sequences
CN105631464B (en) * 2015-12-18 2019-03-01 深圳先进技术研究院 The method and device classified to chromosome sequence and plasmid sequence
CN107784201A (en) * 2016-08-26 2018-03-09 深圳华大基因科技服务有限公司 A kind of real-time sequencing sequence joint filling-up hole method and system of two generation sequences and three generations's unimolecule
CN108875306A (en) * 2018-05-31 2018-11-23 福建农林大学 A kind of method and system for searching Sex Determination sequence

Also Published As

Publication number Publication date
CN103805689B (en) 2015-08-19

Similar Documents

Publication Publication Date Title
AU2018210188B2 (en) Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths
US20230272483A1 (en) Systems and methods for analyzing circulating tumor dna
CA2869574C (en) Sequence assembly
US8428882B2 (en) Method of processing and/or genome mapping of diTag sequences
CN103805689B (en) A kind of sex chromosome with heterotype sequence assembling method of feature based kmer and application thereof
US20120102054A1 (en) Systems and Methods for Annotating Biomolecule Data
WO2017143585A1 (en) Method and apparatus for assembling separated long fragment sequences
Scheibye-Alsing et al. Sequence assembly
US11862299B2 (en) Algorithms for sequence determinations
US20210375397A1 (en) Methods and systems for determining fusion events
US20200075123A1 (en) Genetic variant detection based on merged and unmerged reads
Goussarov et al. Introduction to the principles and methods underlying the recovery of metagenome‐assembled genomes from metagenomic data
CN111192636A (en) mRNA next-generation sequencing result analysis method suitable for oligodT enrichment
Goswami et al. RNA-Seq for revealing the function of the transcriptome
US20150120204A1 (en) Transcriptome assembly method and system
CN104428423A (en) Method and system for determining integration manner of foreign gene in human genome
Schulz Data structures and algorithms for analysis of alternative splicing with RNA-seq data
Chuang et al. GABOLA: A Reliable Gap-Filling Strategy for de novo Chromosome-Level Assembly
Adam et al. Nanopore guided assembly of segmental duplications near telomeres
Wong et al. GoldRush: A de novo long read genome assembler with linear time complexity
Thielen Nanopore DNA Sequencing to Expand Genetic Context in Non-Model Species Research
Jiang et al. The Bioinformatic Applications of Hi-C and Linked Reads
Meleshko Novel Synthetic Long-Read Methods for Structural Variant Discovery and Transcriptomic Assembly
Kakrana Integrated, scalable tools for small RNA genomics: novel algorithms and their application to characterize germline-associated sRNA pathways in diverse species
Shenker Leveraging High Throughout Sequencing To Characterize Alternative Polyadenylation Across Species

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant