WO2013097143A1 - 估计基因组杂合率的方法和装置 - Google Patents
估计基因组杂合率的方法和装置 Download PDFInfo
- Publication number
- WO2013097143A1 WO2013097143A1 PCT/CN2011/084915 CN2011084915W WO2013097143A1 WO 2013097143 A1 WO2013097143 A1 WO 2013097143A1 CN 2011084915 W CN2011084915 W CN 2011084915W WO 2013097143 A1 WO2013097143 A1 WO 2013097143A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequencing
- sequence
- heterozygous
- sequences
- genome
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Definitions
- the present invention relates to the field of bioinformatics technology, and in particular to a method and apparatus for estimating genomic heterozygosity. Background technique
- Hybridization of a genome refers to a single difference between a diploid individual or a polyploid individual at the same position on the homologous chromosome.
- the second generation of DNA sequencing technology is a high-throughput, low-cost sequencing technology.
- the basic principle is sequencing while synthesizing. Taking the solexa sequencing method as an example, the DNA strand is randomly interrupted by physical means, and then a specific linker is added to both ends of the fragment, and an amplified primer sequence is added to the linker.
- DNA polymerase synthesizes the complementary strand of the fragment to be tested, and reads the sequence of the fragment to be tested by detecting the fluorescent signal carried by the newly synthesized base. These sequences are called sequencing fragments or reads ( Http://www.illumina.com ).
- Sequencing and re-engineering a species of DNA molecules generally requires a general understanding of the sequence of the species. Since sequence assembly is to restore the sequence information of the genome by overlapping the overlapping segments. In this case, if the heterozygosity is too high, the results of Denovo assembly using the sequencing data obtained by the whole genome shotgun method will not be ideal. Therefore, it is often necessary to perform a Genome Survey prior to assembly in Denovo to understand the heterozygous content of the genome.
- the kmer frequency distribution map is obtained by using the reads data, thereby estimating the genomic heterozygosity rate.
- the specific method is to assume that there is a complete continuous sequence and randomly select the length of the segment. For K, the fragment is called kmer. Therefore, when the length of reads is L and the length of kmer is taken as K, then L-K+1 kmer can be obtained on one read. Then, by counting the frequency of occurrence of different types of kmer on all reads, the kmer frequency distribution map can be obtained.
- the specific process is shown in Figure 1.
- the frequency distribution of the genome kmer can be approximated as ⁇ from the Poisson distribution.
- the depth of sequencing corresponding to the peak is the average sequencing depth of the genome.
- the gene group has a high heterozygosity, a heterozygous peak appears at one-half of the main peak of the kmer distribution.
- data from other genomes is needed to simulate, for example, the Arabidopsis genome to mimic the heterozygous rate of the target genome.
- the present invention has been made in view of the above problems.
- a first aspect of the invention provides a method for estimating a genomic heterozygosity rate, comprising: obtaining a RAD single-end sequencing sequence (reads) of a body genome; The RAD single-end sequencing sequence is filtered to remove unqualified sequencing sequences; the sequenced sequences are sequenced to obtain depth information for each sequencing sequence; the sequencing sequence with a sequencing depth of 1 is filtered out; each sequencing sequence obtained is obtained A pairwise alignment of the gaps is not allowed to determine the heterozygous locus; the heterozygosity of the individual genome is obtained based on the total number of heterozygous loci.
- the number of allowable mismatches for the pairwise alignment of the non-allowing gaps is determined according to the length of the sequence of the sequences, i.e., the alignment conditions of the pairwise alignments that do not allow the gaps are determined based on the length of the sequencing sequence.
- said performing a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences to determine the heterozygous sites comprises: performing a pairwise alignment of the unallowable gaps between each of the sequencing sequences; Sequencing sequences satisfying the alignment conditions are clustered; clustering results of only two sequencing sequences in the clustering results are selected, and the positions of the sequencing sequences are heterozygous sites.
- the method further comprises: removing the heterozygous site in the repeat region of the genomic sequence.
- the sequencing sequence has multiple copies on the genome and has a high depth of sequencing, one of which has a heterozygous site on the corresponding homologous chromosome .
- the higher sequencing depth refers to twice the average sequencing depth.
- the unsuccessful sequencing sequence comprises: a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the sequencing result in the sequencing sequence is uncertain a sequence in which the number of bases exceeds 10% of the number of bases of the entire sequencing sequence; and/or a sequence in which the exogenous sequence is present; and/or a sequence in which the first few bases are not the ends of the restriction endonuclease .
- obtaining the heterozygosity rate of the individual genome according to the total number of heterozygous sites comprises: dividing the total number of heterozygous sites by the total length of the sequencing sequence of the non-repetitive region, To sequence the heterozygous rate of individual RAD sequencing sites and approximate the heterozygosity of the entire genome.
- Another aspect of the present invention provides an apparatus for estimating a genomic heterozygosity, comprising: a sequencing sequence acquisition device for obtaining a RAD single-end sequencing sequence of a certain genome; a sequencing sequence filtering device for obtaining an RAD single The end sequencing sequence is filtered to remove the unqualified sequencing sequence; the sequencing depth determining device is used to count the sequencing sequence with the same sequence to obtain the depth information of each sequencing sequence; the sequence depth filtering device is used to filter out the sequencing depth 1 a sequencing sequence; a heterozygous site determining device for performing a pairwise alignment of unacceptable gaps between each of the obtained sequencing sequences to determine a heterozygous site; a heterozygous rate acquisition device for use according to the heterozygous site The total number gives the heterozygosity of the individual's genome.
- the number of allowable mismatches for the pairwise alignment of the non-allowing gaps is determined according to the length of the sequence of the sequences, i.e., the alignment conditions of the pairwise alignments that do not allow the gaps are determined based on the length of the sequencing sequence.
- the hybrid site determining device comprises: a matching unit for performing a pairwise alignment of unallowable gaps between each sequencing sequence; a clustering unit for performing all sequencing sequences satisfying the alignment conditions Clustering; a heterozygous locus determining unit for selecting clustering results of only two sequencing sequences in the clustering result, and the position of the sequencing sequence is a heterozygous locus.
- a repeating region hybrid site removal device is further included for removing hybrid sites in the repeat region of the genome sequence.
- the repeat region hybrid site removal device determines that the following conditions are met as heterozygous sites in the repeat region of the genomic sequence: the sequencing sequence has multiple copies on the genome and has a higher sequencing depth, one of which is a copy A heterozygous site is present on the corresponding homologous chromosome.
- the higher sequencing depth refers to twice the average sequencing depth.
- the unqualified sequencing sequence comprises: the sequencing quality is lower than a predetermined low quality
- the number of bases of the threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the number of bases whose sequencing results are undefined in the sequencing sequence exceeds 10% of the number of bases of the entire sequencing sequence a sequencing sequence; and/or a sequencing sequence in which a foreign sequence is present; and/or a starting sequence of several bases that are not a restriction endonuclease sequence.
- An advantage of the present invention is that the genomic hybridization rate can be easily estimated by partial sequencing of the genome, which reduces the cost of sequencing and the cost of computational resources, and does not require known genomic data for simulation, simplifying the processing steps.
- 1 is a schematic flow chart showing a kmer frequency distribution map by sequencing reads in the prior art
- the abscissa in the figure represents the sequencing depth of kmer, and the ordinate represents the percentage of the kmer species with a certain sequencing depth as a percentage of the total kmer species;
- Figure 2 is a schematic diagram showing the hybridization rate of a target genome by the Arabidopsis genome in the prior art
- Figure 3 shows a schematic diagram of the various steps of the RAD sequencing technique
- Figure 4 is a flow chart showing one embodiment of a method of estimating genomic heterozygosity of the present invention
- Figure 5 is a schematic diagram showing an example of RAD single-end sequencing of a genome
- Figure 6 is a schematic diagram showing the depth information of a sequencing sequence
- Figure 7 is a schematic diagram showing the depth information storage of a sequencing sequence
- Figure 8 is a flow chart showing an example of the sequencing sequence alignment of the present invention
- Figure 9 is a schematic diagram showing an example of a heterozygous site located in a repeat region
- Figure 10 is a view showing the method of estimating the genomic heterozygosity ratio of the present invention.
- Figure 11 is a view showing the configuration of an embodiment of the apparatus for estimating the genomic heterozygosity of the present invention
- Fig. 12 is a view showing the configuration of another embodiment of the apparatus for estimating the genomic heterozygosity of the present invention. detailed description
- the present disclosure provides a new bioinformatics analysis program that processes RAD (estriction-site Associated ⁇ , P ⁇ -inscribed site-related DNA) data to find miscellaneous on RAD sequencing fragments. Closing point information to calculate the hybridity rate, simplifying the processing steps in the prior art, and reducing the measurement Order costs and computing resource costs.
- RAD engineered-site Associated ⁇ , P ⁇ -inscribed site-related DNA
- RAD sequencing technology adopts a new database construction method.
- the specific process of sequencing is shown in Figure 3.
- the specific site of DNA is cleaved by restriction endonuclease, and the DNA molecules after digestion are randomly interrupted by physical methods.
- the agarose gel DNA separation technique selects a DNA molecule of a specific length, and then adds a specific amplification linker and a sequencing linker at the end of the selected DNA to construct a library for high-throughput sequencing.
- the heterozygosity rate refers to the percentage of heterozygous sites in the non-repetitive region of the sequencing sequence as a percentage of the total length of the sequencing sequence in the non-repetitive region.
- the pairwise alignment that does not allow the gap means that the vacancy is not allowed when the alignment is performed. That is, the situation of the open space alignment is not considered.
- the comparison result of the following two sequences does not satisfy the two-two comparison condition that does not allow the gap:
- the average sequencing depth is the total depth of the clustering results divided by the number of clustering results.
- Figure 4 is a flow chart showing one embodiment of a method of estimating genomic heterozygosity of the present invention.
- step 402 a RAD single-end sequencing sequence of a certain body genome is obtained.
- Figure 5 shows a schematic of an example of RAD single-ended sequencing.
- the palindrome sequence of the "G A AATTC" on the DNA molecule is identified by the restriction endonuclease Ecorl, and the DNA molecule is cleaved between G and A, and the DNA molecule after digestion is physically used.
- the method interrupts a short sequence fragment, and adds a linker at one end of the restriction enzyme and single-end sequencing the DNA fragment.
- the sequencing read length is generally 50 nt or 100 nt.
- the RAD single-ended sequencing sequence is filtered to remove unqualified sequencing sequences.
- the sequencing sequence is filtered to remove unqualified sequences.
- High-throughput sequencing technology can be Illumina GA sequencing technology can also be used for other high-throughput sequencing technologies available.
- the unqualified sequencing sequence includes, for example, that the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequence, and is considered to be a non-conforming sequence.
- the low quality threshold is determined by the specific sequencing technology and sequencing environment, for example, the single base sequencing quality is lower than 20; the number of bases with undefined sequencing results in the sequencing sequence (such as N in Illumina GA sequencing results) exceeds the whole number. 10% of the number of bases in the sequencing sequence is considered to be a non-conforming sequence; in addition to the sample linker sequence, it is aligned with other exogenous sequences introduced by experiments, such as various linker sequences.
- the exogenous sequence is present in the sequence, it is considered to be a non-conforming sequence; in the sequencing sequence, if the first few bases are not the end-cut sequence, then it is filtered out (such as the restriction endonuclease Ecorl), if the sequencing sequence starts without "AATTC” filters out the entire sequencing sequence).
- Step 406 performs statistics on the sequencing sequences of the same sequence to obtain depth information of each sequencing sequence. For example, sequencing sequences with the same sequence are counted statistically, and each sequencing sequence is assembled into a stack, so that the sequencing depth information of each sequencing sequence can be obtained.
- the specific process is shown in Figure 6.
- the information of the heap can be in the manner of FIG. 7.
- the first column indicates the RAD sequencing sequence information
- the second column indicates the number of times the sequence is sequenced, that is, the depth information
- the third column indicates The ID of the sequence information.
- Step 408 filters out the sequencing sequence with a sequencing depth of one. Sequences with a depth of 1 are usually caused by sequencing errors, filtering out sequence information with a depth of 1 and reducing SNPs due to sequencing errors.
- Step 410 performs a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences to determine a heterozygous site.
- the number of mismatches allowed at the time of alignment is determined by the length of the sequence. For example, in the case where the sequencing length is less than 50 nt, the allowable number of mismatches is 1, and in the case of degrees 100 nt, the number of mismatches allowed is 2.
- Step 412 obtains the heterozygosity rate of the individual genome based on the total number of heterozygous sites.
- Step 412 by directly processing the RAD sequencing sequence data, searching for the heterozygous sites on the RAD fragment, further obtaining the heterozygosity information, and not relying on the data information of the known genome, overcoming some conventional methods for obtaining the heterozygosity rate.
- Technical bottlenecks By RAD sequencing, specific regions of the genome will be enriched and sequenced, which will reduce the amount of data sequencing, and reduce the computational resources and sequencing costs required for analysis due to differences in analytical methods and reduction in data volume.
- an embodiment of the present invention proposes a new alignment method, the basic idea of which is: performing a pairwise alignment of unallowable gaps between each sequencing sequence, using comparison software It can be any sequence alignment software, such as blast, blat, etc.; all the sequencing sequences satisfying the alignment conditions of the allowable mismatch are clustered, and only one kind of reads clustering results indicate that the position of the sequencing reads does not exist. For heterozygous sites, clustering results with only two reads indicate that there are heterozygous sites at the position of the sequencing reads, usually this heterozygous site will not be in the repeat region.
- step 802 a pairwise alignment of the unallowable gaps is performed between each of the sequencing sequences.
- Step 804 clustering all the sequencing sequences satisfying the alignment conditions.
- Step 806 Select a clustering result of only two sequencing sequences in the clustering result, and the position of the sequencing sequence is a heterozygous site.
- Sequence 1 has multiple copies on the genome with a high depth of sequencing; one of the copies has a heterozygous position on the corresponding homologous chromosome, and the alignment result in Figure 9 appears.
- the A higher sequencing depth refers to twice the average sequencing depth.
- the heterozygous sites in the repeat region are filtered out during processing.
- Fig. 10 is a view showing an application example of the method for estimating the genomic heterozygosity of the present invention.
- the data of this example used wild ⁇ white, flowering ⁇ white, common ⁇ white RAD sequencing sequence data (ie, reads data).
- the RAD sequencing method is a well-known method in the field, for example, the following documents can be referred to:
- step 1002 the three kinds of white sequencing read data are filtered according to the sequencing quality value, the N content, and whether the end-cut sequence is included, and the unqualified sequencing sequence is removed.
- the valid data statistics are shown in Table 1. Table 1 Three kinds of white RAD sequencing effective data statistics
- step 1004 the sequencing sequences with the same sequence are statistically counted to obtain the depth of each sequencing sequence, and the sequencing sequence with a sequencing depth of 1 is filtered out.
- Table 2 three kinds of white reading data statistics
- step 1006 the sequencing sequence data of the same sequence is subjected to pairwise alignment to determine the heterozygous site.
- the number of mismatches allowed for comparison is, for example, 1, which is the maximum allowable for one read.
- the alignment condition is that only one radii is different between the two sequences, and the two sequences are classified into one class. If there is only one radii between the A sequence and the B sequence, and there is only one other singularity between B and C, then the three sequences are grouped into one class, and so on, through the alignment between all sequencing sequences. , all sequencing sequences that satisfy the alignment conditions can be clustered. Pick out clustering results with only one reads and two reads in the clustering results.
- the clustering result with only one reads indicates that there is no heterozygous site at the position of the sequencing reads. Only the clustering results of the two reads indicate that there is a heterozygous site at the position of the sequencing reads. Usually, this heterozygous site will not be in Repeat the area.
- Step 1008 removing the heterozygous sites of the repeat region.
- Step 1010 calculating the heterozygosity rate of the genome according to the number of heterozygous sites.
- Fig. 11 is a view showing the configuration of an embodiment of the apparatus for estimating the genomic heterozygosity of the present invention.
- the apparatus comprises: a sequencing sequence acquisition device 111 for obtaining a RAD single-end sequencing sequence of a certain body genome.
- Sequencing sequence filtering device 112. The obtained RAD single-ended sequencing sequence is filtered to remove unqualified sequencing sequences.
- the unqualified sequencing sequence includes, for example, a sequencing sequence in which the number of bases whose sequencing quality is lower than a predetermined low quality threshold exceeds 50% of the number of bases of the entire sequencing sequence; and/or the base in which the sequencing result is indeterminate in the sequencing sequence. a sequencing sequence that exceeds 10% of the number of bases of the entire sequencing sequence; and/or a sequencing sequence in which the exogenous sequence is present; and/or a few bases that are not the sequencing sequence of the restriction endonuclease sequence.
- the sequencing depth determining device 113 performs statistics on the sequencing sequences having the same sequence to obtain depth information of each sequencing sequence.
- a sequence depth filtering device 114 is used to filter out sequencing sequences with a sequencing depth of one.
- the heterozygous site determining device 115 performs a pairwise alignment of the unacceptable gaps between each of the obtained sequencing sequences to determine a heterozygous site.
- the heterozygosity rate acquisition device 117 obtains the heterozygosity rate of the individual genome based on the total number of heterozygous sites.
- Fig. 12 is a view showing the structure of another embodiment of the apparatus for estimating a genomic heterozygous site of the present invention.
- this embodiment also includes a repeating location to be prepared 126.
- the repeat region removal device 126 removes the heterozygous sites in the repeat region of the genomic DNA sequence.
- the repeat region removal device 126 determines that the heterozygous position in the repeat region of the DNA sequence is satisfied as follows:
- the sequencing sequence is present in the genome in multiple copies, and has a higher sequencing depth, one copy and corresponding There are heterozygous sites on the homologous chromosome.
- the higher sequencing depth refers to twice the average sequencing depth.
- the hybrid site determining apparatus 115 includes: a matching unit 1151 for performing a pairwise alignment of unallowable gaps between each of the sequencing sequences; a clustering unit 1152, for All the sequencing sequences satisfying the alignment conditions are clustered; the heterozygous locus determining unit 1153 is configured to select clustering results of only two sequencing sequences in the clustering result, and the position of the sequencing sequence is a heterozygous locus .
- a code can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, or any combination of instructions, data structures, or program statements.
- the code can be located on a computer readable medium.
- the computer readable medium can include one or more storage devices including, for example, RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM, or any other form known in the art. Storage medium.
- the computer readable medium can also include a carrier wave that encodes the data signal.
- the labeling method and device for providing genomic SNP sites directly correspond to the RAD sequencing data of two individuals to determine the SNP locus information on the RAD fragment, which breaks the bottleneck of the lack of reference sequence of the non-model organism, and simplifies the genome. The complexity of the analysis process also reduces the cost of sequencing.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Physics & Mathematics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Theoretical Computer Science (AREA)
- Organic Chemistry (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Molecular Biology (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
本发明公开了一种估计基因组杂合率的方法和装置。所述方法包括:获得某个体基因组的RAD单端测序序列;对RAD单端测序序列进行过滤以去除不合格的测序序列;对序列相同的测序序列进行统计,得到每种测序序列的深度信息;过滤掉测序深度为1的测序序列;在得到的每种测序序列之间进行不容许空隙的两两比对以确定杂合位点;根据杂合位点总数得到该个体基因组的杂合率。
Description
估计基因组杂合率的方法和装置 技术领域
本发明涉及生物信息学技术领域, 尤其涉及一种估计基因组 杂合率的方法和装置。 背景技术
基因组的杂合是指二倍体个体或者多倍体个体在同源染色体 同一个位置上的单 差异。
第二代 DNA测序技术是一种高通量低成本的测序技术, 基本 原理是边合成边测序。 以 solexa测序方法为例, 先用物理方法将 DNA链随机打断, 然后在片段两端加上特定接头, 接头上有扩增 引物序列。 测序时, DNA聚合酶合成待测片段的互补链, 通过检 测新合成碱基所携带的荧光信号读取 序列, 从而获得待测片 段的序列, 这些序列称为测序片段或测序序列 ( reads ) ( http://www.illumina.com ) 。
对一个物种的 DNA分子进行测序并进行重新(Denovo )组 装一般需要对物种的序列情况先有个大概的了解。 由于序列组装 是通过测序片段之间的重叠关系来还原基因组的序列信息。 在这 种情况下, 如果杂合率过高的话, 利用全基因组鸟枪法获得的测 序数据进行 Denovo 组装的效果不会太理想。 因此通常需要在 Denovo组装前进行基因组勘测(Genome Survey ) , 以了解基因 组的杂合率含量。
对基因组进行 Survey的传统方式需要进行全基因组测序, 测 序深度大概在 20~30x的数据之间。 在得到测序数据之后, 利用 reads数据得到 kmer 频数分布图, 从而进行基因组杂合率的估 计。 具体方法为, 假设存在完整的连续序列, 随机选取片段长度
为 K, 该片段称为 kmer。 因此, 当 reads长度为 L, kmer长度 取为 K时, 则一个 reads上面可以得到 L-K+1个 kmer。 接着统 计所有 reads上不同种类 kmer出现的频数, 就可以得到 kmer频 率分布图。 具体过程如图 1所示。
根据 Lander-Waterman统计, 基因组 kmer的频数分布可以 近似地认为 ^从泊松分布。 根据泊松分布的理论, 峰值对应的 测序深度即为基因组的平均测序深度。 对于二倍体而言, 如果基 因组杂合率比较高的话, 会在 kmer 分布主峰的二分之一处出现 杂合峰。 要估计基因组的杂合率, 需要用其他基因组的数据来进 行模拟, 比如通过拟南芥的基因组来模拟目标基因组的杂合率。 在拟南芥中, 通过人为设置特定杂合率, 生成与目标基因组测序 深度一致的模拟 reads, 接着通过模拟 reads得到 kmer频数分布 图。 通过比较模拟生成的 kmer 频数分布与目标基因组 kmer 分 布的一致性, 设置不同的杂合率, 从而估计出目标基因组的杂合 率, 具体如图 2所示。
由于这种传统的基因组勘测方法需要进行全基因组测序, 测 序深度大概在 20~30x的数据之间, 因此成本比较高; 由于测序数 据量大, 在处理数据的时候需要较多的计算资源; 而且需要已知 基因组的数据进行模拟, 进一步增加了处理步骤和数据处理量。 因此亟需一种新的基因组勘测方法, 利用较少的测序数据量即可 方便地估计出基因组的杂合率, 以降低传统方法所需要的极高的 测序成本和计算资源成本。 发明内容
鉴于以上问题提出本发明。
本发明的第一方面提供了一种估计基因组杂合率的方法, 包括: 获得某个体基因组的 RAD 单端测序序列 (reads ); 对
RAD单端测序序列进行过滤以去除不合格的测序序列; 对序列相 同的测序序列进行统计, 得到每种测序序列的深度信息; 过滤掉 测序深度为 1 的测序序列; 在得到的每种测序序列之间进行不容 许空隙的两两比对以确定杂合位点; 根据杂合位点总数得到该个 体基因组的杂合率。
优选地, 所述不容许空隙的两两比对的容许的错配数根据测 序序列的长度确定, 即根据测序序列的长度确定不容许空隙的两 两比对的比对条件。
优选地, 所述在得到的每种测序序列之间进行不容许空隙的 两两比对以确定杂合位点包括: 在每种测序序列之间进行不容许 空隙的两两比对; 将所有满足比对条件的测序序列进行聚类; 挑 选出聚类结果中只有两种测序序列的聚类结果, 该测序序列的位 置即存在杂合位点。
优选地, 还包括: 去除处于基因组序列的重复区域中的杂合 位点。
满足如下条件作为处于基因组序列的重复区域中的杂合位 点: 测序序列在基因组上存在多个拷贝, 且具有较高的测序深 度, 其中一个拷贝与对应的同源染色体上存在杂合位点。 在本发 明的一个实施方案中, 所述较高的测序深度是指平均测序深度的 两倍。
优选地, 不合格的测序序列包括: 测序质量低于预定的低质 量阈值的碱基个数超过整条测序序列碱基个数的 50%的测序序 列; 和 /或测序序列中测序结果不确定的碱基个数超过整条测序序 列碱基个数的 10%的测序序列; 和 /或存在外源序列的测序序 列; 和 /或起始的几个碱基不是酶切末端序列的测序序列。
优选地, 根据杂合位点总数得到该个体基因组的杂合率包 括: 将杂合位点总数除以非重复区域的测序序列的总长, 即可得
到测序个体 RAD 测序位置的杂合率, 并近似估计整个基因组的 杂合率。
本发明的另一方面提供了一种估计基因组杂合率的装置, 包括: 测序序列获取设备, 用于获得某个体基因组的 RAD单端 测序序列; 测序序列过滤设备, 用于对获得的 RAD单端测序序 列进行过滤以去除不合格的测序序列; 测序深度确定设备, 用于 统计序列相同的测序序列, 得到每种测序序列的深度信息; 序列 深度过滤设备, 用于过滤掉测序深度为 1 的测序序列; 杂合位点 确定设备, 用于在得到的每种测序序列之间进行不容许空隙的两 两比对以确定杂合位点; 杂合率获取设备, 用于根据杂合位点总 数得到该个体基因组的杂合率。
优选地, 所述不容许空隙的两两比对的容许的错配数根据测 序序列的长度确定, 即根据测序序列的长度确定不容许空隙的两 两比对的比对条件。
优选地, 杂合位点确定设备包括: 比对单元, 用于在每种测 序序列之间进行不容许空隙的两两比对; 聚类单元, 用于将所有 满足比对条件的测序序列进行聚类; 杂合位点确定单元, 用于挑 选出聚类结果中只有两种测序序列的聚类结果, 该测序序列的位 置即存在杂合位点。
优选地, 还包括重复区杂合位点去除设备, 用于去除处于基 因组序列的重复区域中的杂合位点。
优选地, 重复区杂合位点去除设备判断满足如下条件作为处 于基因组序列的重复区域中的杂合位点: 测序序列在基因组上存 在多个拷贝, 且具有较高的测序深度, 其中一个拷贝与对应的同 源染色体上存在杂合位点。 在本发明的一个实施方案中, 所述较 高的测序深度是指平均测序深度的两倍。
优选地, 不合格的测序序列包括: 测序质量低于预定的低质
量阈值的碱基个数超过整条测序序列碱基个数的 50%的测序序 列; 和 /或测序序列中测序结果不确定的碱基个数超过整条测序序 列碱基个数的 10%的测序序列; 和 /或存在外源序列的测序序 列; 和 /或起始的几个碱基不是酶切末端序列的测序序列。
本发明的一个优点在于, 通过基因组的部分测序即可方便 地估计出基因组的杂合率, 降低了测序成本和计算资源成本, 同 时不需要已知的基因组数据进行模拟, 简化了处理步骤。
通过以下参照附图对本发明的示例性实施例的详细描述, 本发明的其它特征及其优点将会变得清楚。 附图说明
图 1示出现有技术中通过测序 reads得到 kmer频数分布图 的流程示意图;
图中的横坐标表示 kmer 的测序深度, 纵坐标表示具有某一 特定测序深度的 kmer种类数占总的 kmer种类数的百分比;
图 2示出现有技术中通过拟南芥基因组模拟目标基因组杂合 率的示意图;
图 3示出 RAD测序技术的各个步骤的示意图;
图 4示出本发明的估计基因组杂合率的方法的一个实施例的 流程图;
图 5示出基因组的 RAD单端测序的一个例子的示意图; 图 6示出测序序列的深度信息统计示意图;
图 7示出测序序列的深度信息存储示意图;
图 8示出本发明的测序序列比对的一个例子的流程图; 图 9示出位于重复区域的杂合位点的例子的示意图; 图 10 示出本发明的估计基因组杂合率的方法的一个应用例 的示意图;
图 11 示出本发明的估计基因组杂合率的装置的一个实施例 的结构图;
图 12 示出本发明的估计基因组杂合率的装置的另一个实施 例的结构图。 具体实施方式
现在将参照附图来详细描述本发明的各种示例性实施例。 应注意到: 除非另外具体说明, 否则在这些实施例中阐述的部 件和步驟的相对布置、 数字表达式和数值不限制本发明的范 围。
同时, 应当明白, 为了便于描述, 附图中所示出的各个部 分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性 的, 决不作为对本发明及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、 方法和设备可能 不作详细讨论, 但在适当情况下, 技术、 方法和设备应当被视 为授权说明书的一部分。
在这里示出和讨论的所有示例中, 任何具体值应被解释为 仅仅是示例性的, 而不是作为限制。 因此, 示例性实施例的其 它示例可以具有不同的值。
应注意到: 相似的标号和字母在下面的附图中表示类似 项, 因此, 一旦某一项在一个附图中被定义, 则在随后的附图 中不需要对其进行进一步讨论。
针对现有技术的问题, 本公开提供了一种新的生物信息学分 析方案, 处理 RAD ( estriction-site Associated ΝΑ, P艮制性内 切位点相关 DNA )数据, 寻找 RAD 测序片段上的杂合位点信 息, 以计算杂合率, 简化了现有技术中的处理步骤, 也降低了测
序成本和计算资源成本。
下面介绍几个本发明的技术方案涉及的概念。
RAD测序技术采用了新的建库方式, 其测序具体过程如图 3 所示, 用限制性内切酶切断 DNA特定的位点, 再用物理方法将 酶切之后的 DNA分子随机打断, 通过琼脂糖胶 DNA分离技术挑 选特定长度的 DNA分子, 然后在挑选出来的 DNA末端添加特定 的扩增接头与测序接头, 从而构建上机文库进行高通量测序。
杂合率是指测序序列非重复区域的杂合位点数占非重复区域 的测序序列总长度的百分比。
不容许空隙的两两比对是指比对的时候不容许开空位。 即不 考虑开空位比对上的情况, 例如以下两条序列的比对结果就不满 足不容许空隙的两两比对条件:
序列 1: AATTCATCGAC
序列 2: AA CATCGTC。
平均测序深度是指聚类结果的总深度除以聚类结果数。
图 4示出本发明的估计基因组杂合率的方法的一个实施例的 流程图。
如图 4所示, 步骤 402, 获得某个体基因组的 RAD单端测序 序列。 图 5 示出了 RAD单端测序的一个例子的示意图。 在图 5 中显示了用限制性内切酶 Ecor l, 识别 DNA 分子上 "GAAATTC" 的回文序列, 并在 G 与 A之间将 DNA分子切 断, 将酶切后的 DNA分子用物理方法打断成短的序列片段, 并 在其中酶切的一端加上接头并对 DNA 片段进行单末端测序, 测 序读长一般为 50nt, 也可以为 100nt。
步骤 404, 对 RAD单端测序序列进行过滤以去除不合格的测 序序列。 例如, 接收到高通量 RAD单端测序序列后, 对测序序 列进行过滤, 去除不合格的序列。 其中高通量测序技术可以为
Illumina GA 测序技术, 也可以为现有的其他高通量测序技术。 不合格测序序列例如包括: 测序质量低于预定的低质量阈值的碱 基个数超过整条序列碱基个数的 50%则认为是不合格序列。 低质 量阈值由具体测序技术及测序环境而定, 例如设定为单碱基测序 质量低于 20; 测序序列中测序结果不确定的碱基(如 Illumina GA测序结果中的 N )个数超过整条测序序列碱基个数的 10%则 认为是不合格序列; 除样本接头序列外, 与其它实验引入的外源 序列比对, 如各种接头序列。 若序列中存在外源序列则认为是不 合格序列; 在测序序列中, 若起始的几个碱基不是酶切末端序列 则过滤掉 (如限制性内切酶 Ecor l , 测序序列开头若不是 "AATTC" 则过滤掉整个测序序列)。
步骤 406对序列相同的测序序列进行统计, 得到每种测序序 列的深度信息。 例如, 将序列相同的测序序列进行统计计数, 每 种测序序列集合为一堆(Stack ), 这样就可以得到每一种测序序 列的测序深度信息。 具体过程如图 6所示。 堆的信息可以以图 7 的方式^ ·, 在图 7 中, 第一列表示的是 RAD测序序列信息; 第二列表示的是该序列被测序的次数, 即深度信息; 第三列是该 序列信息的 ID。
步骤 408过滤掉测序深度为 1的测序序列。 深度为 1的测序 序列通常是由测序错误导致的, 过滤掉深度为 1 的测序序列信 息, 减少由于测序错误引起的 的 SNP位点。
步骤 410在得到的每种测序序列之间进行不容许空隙的两两 比对以确定杂合位点。 比对的时候容许的错配数随测序的长度来 定, 例如在测序长度小于 50nt的情况下, 容许的错配数为 1 , 和 度在 lOOnt的情况下, 容许的错配数为 2。
步骤 412根据杂合位点总数得到该个体基因组的杂合率。
上述实施例中, 通过直接处理 RAD 测序序列数据, 寻找 RAD片段上的杂合位点, 进一步获得杂合率信息, 不依赖于已知 基因组的数据信息, 克服了传统获得杂合率方法的一些技术瓶 颈。 通过 RAD 测序方式将会对基因组的特定区域进行富集测 序, 从而降低了数据测序量, 并且由于分析方法的不同和数据量 的减少, 降低了分析所需的计算资源和测序成本。
根据本发明的方法, 本发明的一个实施例提出一种新的比对 方法, 该方法的基本思路为: 在每种测序序列之间进行不容许空 隙的两两比对, 使用的比对软件可以是任何一款序列比对软件, 如 blast、 blat等; 将所有满足容许错配的比对条件的测序序列进 行聚类, 其中只有一种 reads的聚类结果表明在测序 reads的位 置不存在杂合位点, 只有两条 reads的聚类结果表明在测序 reads 的位置存在杂合位点, 通常这个杂合位点不会处于重复区域。
具体过程如图 8所示:
步骤 802 , 在每种测序序列之间进行不容许空隙的两两比 对。
步骤 804, 将所有满足比对条件的测序序列进行聚类。
步骤 806 , 挑选出聚类结果中只有两种测序序列的聚类结 果, 该测序序列的位置即存在杂合位点。
通过上述实施例的比对方法, 运算量小, 速度快、 效率高, 简化了传统方式中的处理步骤。
在本发明的一个实施例中 , 通过测序序列比对确定杂合位点 后, 还需要过滤掉重复区域的杂合位点。 图 9示出了位于重复区 域的杂合位点的情况:
序列 1 在基因组上存在多个拷贝, 具有较高的测序深度; 其 中一个拷贝上与对应的同源染色体上存在杂合位点 , 比对的时候 就会出现图 9 中的比对结果。 在本发明的一个实施方案中, 所述
较高的测序深度是指平均测序深度的两倍。
在处理过程中都会把重复区域的杂合位点过滤掉。
通过 RAD 测序序列数据的过滤, 比对, 重复区域的筛选, 最终得到具有足够深度信息支持的 RAD 测序位置的杂合位点集 合, 进而得到 RAD测序位置的杂合率。
由于在基因组序列上, 杂合位点的分布是比较均匀的, RAD 测序方法相当于随机抽取了基因组 DNA序列上的某些片段, 并 通 RAD测序片段的分析, 得到所有 RAD测序片段位置的杂 合率。 由于 RAD 测序方法能够测到基因组百分之三到百分之六 的序列信息, 因此, 抽样的样本容量大。 这样就可以用测序位置 的杂合率来近似估计整个基因组的杂合率。 图 10示出本发明的估计基因组杂合率的方法的一个应用例的 示意图。 该实施例数据采用野生茭白, 开花茭白, 普通茭白的 RAD测序序列数据(即 reads数据) 。 其中 RAD测序方法为本领 域公知的方法, 例如可参考以下文献:
(1) Michael R Miller , Tressa S Atwood, B Frank Eames, et al, RAD marker micr oar rays enable rapid mapping of
zebrafishmutations, Genome Biology , 2007, 8(6):R 105.1-R 105.10;
(2) Michael R M:iller, J oseph P. Dunham, Angel Amores,et al, Rapid and cost-effective polymor hism iden t if ica t ion and
gen o typing using restriction site associated DNA(RAD) markers, Genome Research, 2007, 17? 240-248?
(3) Nathan A. Baird l, Paul D. Etter , Tressa S. Atwood, et al, Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers, PLoS ONE, 2008,3(10), e3376,
d. o i : 10 , 1371 / j o 11 r n a I. p o 11 e .0003376 ,
利用传统方法得知普通茭白的杂合率大于野生茭白, 野生茭 白的杂合率大于开花茭白。
实施例具体操作流程如图 10所示, 步骤 1002, 将三种茭白 的测序 reads数据, 根据测序质量值, N的含量, 以及是否含有 酶切末端序列进行过滤, 去除不合格的测序序列, 得到的有效数 据统计如表 1所示。 表 1三种茭白 RAD测序有效数据统计
步骤 1006, 将序列相同的测序序列数据进行两两比对确定杂 合位点。 比对容许的错配数例如为 1, 即一个 reads 上最多容许
存在 1 个杂合位点。 具体地, 比对条件为两条序列之间只有一个 威基不相同, 则这两条序列归为一类。 如果 A序列与 B序列之间 只有一个威基不相同, 而 B与 C之间只有另外一个威基不相同, 则三条序列归为一类, 以此类推, 通过所有测序序列之间的比 对, 可以将所有满足比对条件的测序序列进行聚类。 挑选出聚类 结果中只有一条 reads和两条 reads的聚类结果。 其中只有一条 reads的聚类结果表明在测序 reads的位置不存在杂合位点, 只有 两条 reads的聚类结果表明在测序 reads的位置存在杂合位点, 通常这个杂合位点不会处于重复区域。
步骤 1008, 去除重复区域的杂合位点。
步骤 1010, 根据杂合位点数计算基因组的杂合率。
综上, 通过以上步骤的处理并计算杂合率, 得到的结果如表 3所示。
表 3、 reads非重复区域聚类结果信息统计
可以看出, 利用 RAD 测序技术对基因组进行抽样测序并估 计基因组的杂合率方法的结果与传统分析方法的结果一致。 图 11 示出本发明的估计基因组杂合率的装置的一个实施例 的结构图。 如图 11所示, 该装置包括: 测序序列获取设备 111, 获得某个体基因组的 RAD 单端测序序列。 测序序列过滤设备
112, 对获得的 RAD单端测序序列进行过滤以去除不合格的测序 序列。 不合格的测序序列例如包括: 测序质量低于预定的低质量 阈值的碱基个数超过整条测序序列碱基个数的 50%的测序序列; 和 /或测序序列中测序结果不确定的碱基个数超过整条测序序列碱 基个数的 10%的测序序列; 和 /或存在外源序列的测序序列; 和 / 或起始的几个碱基不是酶切末端序列的测序序列。 测序深度确定 设备 113, 对序列相同的测序序列进行统计, 得到每种测序序列 的深度信息。 序列深度过滤设备 114, 用于过滤掉测序深度为 1 的测序序列。 杂合位点确定设备 115 , 在得到的每种测序序列之 间进行不容许空隙的两两比对以确定杂合位点。 杂合率获取设备 117, 根据杂合位点总数得到该个体基因组的杂合率。
图 12示出本发明的估计基因组杂合位点的装置的另一个实 施例的结构图。 与图 11相比, 该实施例中还包括重复区位点去 殳备 126。 重复区位点去除设备 126去除处于基因组 DNA序列 的重复区域中的杂合位点。 例如, 重复区位点去除设备 126当判 断满足如下条件作为处于 DNA序列的重复区域中的杂合位点: 测序序列在基因组在存在多个拷贝, 且具有较高的测序深度, 其 中一个拷贝与对应的同源染色体上存在杂合位点。 在本发明的一 个实施方案中, 所述较高的测序深度是指平均测序深度的两倍。
才艮据本发明的一个实施例, 杂合位点确定设备 115 包括: 比 对单元 1151, 用于在每种测序序列之间进行不容许空隙的两两比 对; 聚类单元 1152, 用于将所有满足比对条件的测序序列进行聚 类; 杂合位点确定单元 1153, 用于挑选出聚类结果中只有两种测 序序列的聚类结果, 该测序序列的位置即存在杂合位点。
对于图 11、 12 中各个装置或单元的功能, 可以参考上文中 关于本发明方法的实施例中对应部分的说明, 为简洁起见, 在此 不再详述。
本领域的技术人员应当理解, 对于图 11、 12 中的各个装 置, 可以通过单独的计算处理设备实现, 或者将其集成为一个独 立的设备实现。 在图 11、 12 中用框示出以说明它们的功能。 这 些功能块可以用硬件、 软件、 固件、 中间件、 微代码、 硬件描述 语音或者它们的任意组合来实现。 举例来说, 一个或者两个功能 块都可以利用运行在微处理器、 数字信号处理器(DSP )或任何 其他适当计算设备上的代码实现。 代码可以表示过程、 功能、 子 程序、 程序、 例行程序、 子例行程序、 模块或者指令、 数据结构 或程序语句的任意组合。 代码可以位于计算机可读介质中。 计算 机可读介质可以包括一个或者多个存储设备, 例如, 包括 RAM 存储器、 闪存存储器、 ROM 存储器、 EPROM 存储器、 EEPROM存储器、 寄存器、 硬盘、 移动硬盘、 CD-ROM或本领 域公知的其他任何形式的存储介质。 计算机可读介质还可以包括 编码数据信号的载波。
开提供的基因组 SNP位点的标记方法和装置, 直接对两 个个体的 RAD测序数据进行对应, 以确定 RAD 片段上的 SNP 位点信息, 突破了非模式生物缺少参考序列的瓶颈, 简化了基因 组分析处理的复杂度, 也减少了测序成本。
至此, 已经详细描述了根据本发明的估计基因组杂合位点的 方法和装置。 为了避免遮蔽本发明的构思, 没有描述本领域所公 知的一些细节。 本领域技术人员根据上面的描述, 完全可以明白 如何实施这里公开的技术方案。
虽然已经通过示例对本发明的一些特定实施例进行了详细 说明, 但是本领域的技术人员应该理解, 以上示例仅是为了进 行说明, 而不是为了限制本发明的范围。 本领域的技术人员应 该理解, 可在不脱离本发明的范围和精神的情况下, 对以上实 施例进行修改。 本发明的范围由所附权利要求来限定。
Claims
1. 一种估计基因组杂合率的方法, 其特征在于, 包括: 获得某个体基因组的 RAD单端测序序列;
对 RAD单端测序序列进行过滤以去除不合格的测序序列; 对序列相同的测序序列进行统计, 得到每种测序序列的深度 信息;
过滤掉测序深度为 1的测序序列;
在得到的每种测序序列之间进行不容许空隙的两两比对以确 定杂合位点;
才艮据杂合位点总数得到该个体基因组的杂合率。
2. 根据权利要求 1 所述的方法, 其特征在于, 根据测序序 列的长度确定所述不容许空隙的两两比对的容许的错配数。
3. 根据权利要求 1 所述的方法, 其特征在于, 在得到的每 种测序序列之间进行不容许空隙的两两比对以确定杂合位点包 括 ··
在每种测序序列之间进行不容许空隙的两两比对;
将所有满足比对条件的测序序列进行聚类;
选出聚类结果中只有两种测序序列的聚类结果, 该测序序 列的位置即存在杂合位点。
4. 根据权利要求 1所述的方法, 其特征在于, 还包括: 去除处于基因组序列的重复区域中的杂合位点。
5. 根据权利要求 4 所述的方法, 其特征在于, 满足如下条 件作为处于基因组序列的重复区域中的杂合位点: 测序序列在基因组在存在多个拷贝, 且具有较高的测序深 度, 其中一个拷贝与对应的同源染色体上存在杂合位点; 所述较 高的测序深度例如是指平均测序深度的两倍。
6. 根据权利要求 1 所述的方法, 其特征在于, 所述不合格 的测序序列包括:
测序质量低于预定的低质量阈值的碱基个数超过整条测序序 列碱基个数的 50%的测序序列; 和 /或
测序序列中测序结果不确定的碱基个数超过整条测序序列碱 基个数的 10%的测序序列; 和 /或
存在外源序列的测序序列; 和 /或
起始的几个碱基不是酶切末端序列的测序序列。
7. 根据权利要求 1 所述的方法, 其特征在于, 所述根据杂 合位点总数得到该个体基因组的杂合率包括: 将杂合位点总数除 以非重复区域的测序序列的总长, 即可得到测序个体 RAD 测序 位置的杂合率, 并近似估计整个基因组的杂合率。
8. 一种估计基因组杂合率的装置, 其特征在于, 包括: 测序序列获取设备, 用于获得某个体基因组的 RAD单端测 序序列;
测序序列过滤设备, 用于对获得的 RAD单端测序序列进行 过滤以去除不合格的测序序列;
序列相同的测序序列统计设备, 用于对序列相同的测序序列 进行统计, 得到每种测序序列的深度信息;
序列深度过滤设备, 用于过滤掉测序深度为 1的测序序列; 杂合位点确定设备, 用于在得到的每种测序序列之间进行不 容许空隙的两两比对以确定杂合位点;
杂合率获取设备, 用于根据杂合位点总数得到该个体基因组 的杂合率。
9. 根据权利要求 8 所述的装置, 其特征在于, 所述不容许 空隙的两两比对的容许的错配数根据测序序列的长度确定。
10. 根据权利要求 8所述的装置, 其特征在于, 所述杂合位 点确定设备包括: 比对单元, 用于在每种测序序列之间进行不容 许空隙的两两比对; 聚类单元, 用于将所有满足比对条件的测序 序列进行聚类; 杂合位点确定单元, 用于挑选出聚类结果中只有 两种测序序列的聚类结果, 该测序序列的位置即存在杂合位点。
11. 根据权利要求 8所述的装置, 其特征在于, 还包括: 重复区杂合位点去除设备, 用于去除处于基因组序列的重复 区域中的杂合位点。
12. 根据权利要求 11 的装置, 其特征在于, 所述重复区杂 合位点去除设备判断满足如下条件作为处于基因组序列的重复区 域中的杂合位点:
测序序列在基因组在存在多个拷贝, 且具有较高的测序深 度, 其中一个拷贝与对应的同源染色体上存在杂合位点; 所述较 高的测序深度例如是指平均测序深度的两倍。
13. 根据权利要求 8所述的装置, 其特征在于, 所述不合格 的测序序列包括: 测序质量低于预定的低质量阈值的碱基个数超过整条测序序 列碱基个数的 50%的测序序列; 和 /或
测序序列中测序结果不确定的碱基个数超过整条测序序列碱 基个数的 10%的测序序列; 和 /或
存在外源序列的测序序列; 和 /或起始的几个碱基不是酶切末 端序列的测序序列。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2011/084915 WO2013097143A1 (zh) | 2011-12-29 | 2011-12-29 | 估计基因组杂合率的方法和装置 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2011/084915 WO2013097143A1 (zh) | 2011-12-29 | 2011-12-29 | 估计基因组杂合率的方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013097143A1 true WO2013097143A1 (zh) | 2013-07-04 |
Family
ID=48696223
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2011/084915 WO2013097143A1 (zh) | 2011-12-29 | 2011-12-29 | 估计基因组杂合率的方法和装置 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2013097143A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108192979A (zh) * | 2017-07-20 | 2018-06-22 | 中国水产科学研究院长江水产研究所 | 一种中国大鲵雌性特异标记及应用 |
CN108192954A (zh) * | 2017-05-04 | 2018-06-22 | 中国水产科学研究院长江水产研究所 | 一种rad测序筛选中国大鲵雌性特异片段及遗传性别检测方法 |
-
2011
- 2011-12-29 WO PCT/CN2011/084915 patent/WO2013097143A1/zh active Application Filing
Non-Patent Citations (3)
Title |
---|
CATCHEN, J.M. ET AL.: "Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences.", G3 (BETHESDA)., vol. 1, no. 3, August 2011 (2011-08-01), pages 171 - 82 * |
DAVEY, J.W. ET AL.: "RADSeq: next-generation population genetics.", BRIEFINGS IN FUNCTIONAL GENOMICS., vol. 9, no. 5-6, December 2010 (2010-12-01), pages 416 - 23 * |
HOHENLOHE, P.A. ET AL.: "Next-generation RAD sequencing identifies thousands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout.", MOL ECOL RESOUR., vol. 1, March 2011 (2011-03-01), pages 117 - 22 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108192954A (zh) * | 2017-05-04 | 2018-06-22 | 中国水产科学研究院长江水产研究所 | 一种rad测序筛选中国大鲵雌性特异片段及遗传性别检测方法 |
CN108192954B (zh) * | 2017-05-04 | 2021-03-05 | 中国水产科学研究院长江水产研究所 | 一种rad测序筛选中国大鲵雌性特异片段及遗传性别检测方法 |
CN108192979A (zh) * | 2017-07-20 | 2018-06-22 | 中国水产科学研究院长江水产研究所 | 一种中国大鲵雌性特异标记及应用 |
CN108192979B (zh) * | 2017-07-20 | 2021-03-23 | 中国水产科学研究院长江水产研究所 | 一种中国大鲵雌性特异标记及应用 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230366046A1 (en) | Systems and methods for analyzing viral nucleic acids | |
Lowe et al. | Transcriptomics technologies | |
US11694764B2 (en) | Method for large scale scaffolding of genome assemblies | |
Chen et al. | TIGRA: a targeted iterative graph routing assembler for breakpoint assembly | |
Yang et al. | The draft genome sequence of a desert tree Populus pruinosa | |
CN103080333B (zh) | 一种基因组结构性变异检测方法和系统 | |
WO2017127741A1 (en) | Methods and systems for high fidelity sequencing | |
WO2013097257A1 (zh) | 一种检验融合基因的方法及系统 | |
WO2016000267A1 (zh) | 确定探针序列的方法和基因组结构变异的检测方法 | |
WO2012116658A2 (zh) | 组装基因组序列的方法和装置 | |
WO2013097048A1 (zh) | 基因组单核苷酸多态性位点的标记方法和装置 | |
Ma et al. | The analysis of ChIP-Seq data | |
JP7361774B2 (ja) | シーケンスリードの独立したアラインメントおよびペアリングによって高度に相同なシーケンスにおける遺伝的変異を検出するための方法 | |
Kremer et al. | Approaches for in silico finishing of microbial genome sequences | |
Lopez de Heredia et al. | RNA-seq analysis in forest tree species: bioinformatic problems and solutions | |
CN109524060B (zh) | 一种遗传病风险提示的基因测序数据处理系统与处理方法 | |
CN115101128A (zh) | 一种杂交捕获探针脱靶危险性评估的方法 | |
Khattra et al. | Large-scale production of SAGE libraries from microdissected tissues, flow-sorted cells, and cell lines | |
WO2013097143A1 (zh) | 估计基因组杂合率的方法和装置 | |
WO2019132010A1 (ja) | 塩基配列における塩基種を推定する方法、装置及びプログラム | |
Chepelev | Detection of RNA editing events in human cells using high-throughput sequencing | |
WO2013097149A1 (zh) | 估计基因组重复序列含量的方法和装置 | |
WO2013097328A1 (zh) | 基因组indel位点标记方法和装置 | |
WO2013097060A1 (zh) | 一种基于MspJI酶切的DNA甲基化分析方法 | |
JP5946277B2 (ja) | アセンブリ誤り検出のための方法およびシステム(アセンブリ誤り検出) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11878735 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/11/2014) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 11878735 Country of ref document: EP Kind code of ref document: A1 |