WO2017000859A1 - 字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用 - Google Patents

字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用 Download PDF

Info

Publication number
WO2017000859A1
WO2017000859A1 PCT/CN2016/087300 CN2016087300W WO2017000859A1 WO 2017000859 A1 WO2017000859 A1 WO 2017000859A1 CN 2016087300 W CN2016087300 W CN 2016087300W WO 2017000859 A1 WO2017000859 A1 WO 2017000859A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
algorithm
database
interval
seed
Prior art date
Application number
PCT/CN2016/087300
Other languages
English (en)
French (fr)
Inventor
许跃生
陈颖
叶纬材
张永东
Original Assignee
中山大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中山大学 filed Critical 中山大学
Priority to US15/740,038 priority Critical patent/US20180174681A1/en
Publication of WO2017000859A1 publication Critical patent/WO2017000859A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the invention relates to the technical field of searching for character sequence similar substrings and character database searching, in particular to a leaping search algorithm for character sequence similar substrings and a search application thereof on a biological sequence database.
  • the algorithm is used to search for similar substrings between character sequences. By finding the seed of similar substrings, the purpose of quickly retrieving similar substrings is achieved. Seed-based similar character substring search algorithms have been widely used. For example, BLAST and BWA algorithms commonly used in biological sequence analysis are representative of them. The algorithm is described below by taking a biological sequence search as an example (but not limited to a biological sequence). To search for a perfectly matched seed with a length of at least w between the character sequence database and the query sequence, existing solutions can be classified into three categories.
  • the first type of algorithm constructs a lookup table for the query sequence.
  • the lookup table is a hash table, and each entry is a linear linked list that records all occurrences of a sequence of length k in the query sequence.
  • This type of algorithm then performs a leaping scan of the database sequence.
  • a spanning scan refers to detecting a subsequence of length k every w-k+1 positions.
  • the detection process includes finding a position of the subsequence in the query sequence by looking up a table, each position corresponding to a seed of length k, and then comparing the left and right sides of each k seed, checking whether the k seed is contained in one w seed species.
  • This seed finding algorithm is applied to MegaBLAST.
  • the second type of algorithm builds a lookup table for the database.
  • This type of lookup table is also a hash table. Each entry corresponds to a short sequence of length k.
  • the first solution is represented by Indexed MegaBLAST.
  • a sub-sequence of length k is taken from the database every w-k+1 position, and its position is added to the corresponding hash table entry.
  • the algorithm detects all sub-sequences of length k in the query sequence, finds the k-seed, and then checks whether these k-seeds are included in the w-seed by the practice of the first scheme.
  • the second solution is represented by BLAT, and all sub-sequences of length w in the database sequence are taken out, their positions are recorded in the corresponding hash entries, and then all sub-sequences of length w of the query sequence are checked. Find the location list in the Hash table, and each position in the list corresponds to a w seed.
  • the third type of solution establishes an FM index or an FMD index for the database, and uses the index to find the largest matching area with a length of at least W.
  • the maximum matching area refers to a perfectly matched area that cannot continue to extend to the left and right sides.
  • the sequence alignment software Cushat uses the FM index, starting with the first character of the query sequence, adding a character from the right to each step until the search result is an empty set. The algorithm then proceeds from where the previous step stopped, continuing the process.
  • the sequence alignment software BWA-MEM uses FMD to find the super-maximum match, and the super-maximum match is also the largest match, but their segments on the query sequence cannot be covered by other segments with the largest match on the query sequence.
  • the first solution mentioned above belongs to the originally proposed solution and is the least efficient seed finding solution among the three types of solutions.
  • IndexedMegaBLAST can run very fast on small databases and short query sequences.
  • lookup tables will become very large, and performance will drop dramatically, even more than MegaBLAST.
  • BLAT is not an exact algorithm, it does not guarantee to find all w seeds.
  • the third type of solution although performing well, does not guarantee that all seeds of at least length w can be found, resulting in a decrease in the final search accuracy.
  • the main object of the present invention is to overcome the shortcomings and deficiencies of the prior art, and to provide a leaping search algorithm for character sequence similar substrings and a search application thereof on a biological sequence database.
  • the lookup table may have multiple implementations, each entry corresponding to a short sequence of length k, and storing a double interval obtained by searching the short sequence in the FMD index;
  • this step gradually expands the matching area on the left side of the k seed by using a backward search algorithm
  • step S3 Perform a forward search algorithm on the interval before the reduction in step S2 to find a matching area on the right side of the k seed;
  • step S4 checking whether the current detection position is located at the end of the query sequence, and if so, the algorithm terminates, otherwise, step S5 is performed;
  • the FMD index is specifically:
  • a subsequence of length k of a sequence of characters is called a k subsequence.
  • a perfectly matched segment of length w between the query sequence and the database sequence is called a w seed, and the sequence P is searched in the FMD index of the database.
  • the result is expressed in the form of a double interval, and the double interval is represented by three integers.
  • the double interval of the given nucleic acid sequence P and the character a, a are one of the elements in the character table, and the backward search algorithm can obtain the aP Double interval; by the forward search algorithm, the double interval of Pa can be obtained.
  • the number of elements in the double interval is called the size of the interval, which indicates the number of occurrences of P in the database. If the double interval of P is an empty interval, then Indicates that P does not appear in the database.
  • the lookup table has a small occupied space and does not become larger as the database increases.
  • step S2 in the backward search process, if a double interval reduction or a double interval is empty, it indicates that some k seeds are extended to the left and a mismatched character pair is encountered. For the interval before the reduction, the algorithm also needs to find the matching part on the right side of the corresponding seed by the forward search algorithm, trying to find the seed of length w.
  • the algorithm outputs it as the result.
  • step S5 it is not necessary to detect all k subsequences in the query sequence. Instead, it only needs to be detected once every w-k+1 positions.
  • the search of the seed is performed by combining the FMD index and the lookup table, and the lookup table may adopt various implementations.
  • the invention also provides an application for searching for a sequence of similar substrings of a character sequence in a biological sequence database.
  • the FMD index is constructed on the biological sequence database or the biological sequence query.
  • the lookup table is built on the FMD index.
  • the biological sequence includes DNA, protein or RNA.
  • the present invention has the following advantages and beneficial effects:
  • the lookup table proposed by the present invention is different from the existing linear linked list based lookup table in that the size of the lookup table based on the linear linked list increases as the database grows. In a large database, the lookup table is used. Will occupy very large storage space, long linear linked list will also make the seed search process very time consuming.
  • the lookup table proposed by the present invention stores the double interval of the short sequence, and the lookup table constructed based on the FM index stores the suffix array interval.
  • the seed search algorithm proposed by the present invention has the advantages of high precision and high efficiency compared with the existing algorithm.
  • the traditional seed search algorithm can not find all the w seeds, and can not guarantee the search accuracy.
  • the seed search algorithm proposed by the present invention also adopts a leaping scanning method, but it can check whether a batch of k seeds is included in the w seed at a time, which makes it have a good performance advantage.
  • FIG. 1 is a flow diagram of a leaping seed lookup algorithm in accordance with the present invention in conjunction with an FMD index and lookup table.
  • the biological sequence search is taken as an example, and the present invention is applied to the accelerated megablast algorithm, and the acceleration is more than ten times while maintaining the same result.
  • This embodiment mainly includes two parts: a lookup table and a seed search algorithm combined with an FMD index.
  • the lookup table is constructed to address the shortcomings of the lookup table employed by the second type of algorithm in the above background art.
  • each entry in the lookup table is a linear linked list, which makes the lookup table take up a huge amount of storage space in a large database.
  • Some algorithms, such as BLAT, etc. use a short sequence of index parts in order to reduce the size of the lookup table, which sacrifices search accuracy.
  • the lookup table of the present invention stores a double interval for each entry, and each double interval only needs to be represented by three numbers, so the size of the lookup table is fixed and does not change as the database grows.
  • the seed lookup algorithm is directed to the shortcomings of the second and third types of algorithms described above.
  • IndexedMegaBLAST can find all the w seeds, but it is not efficient under large databases, because the linear list will be very long, so it is necessary to frequently check whether the k seeds are included in the w seed.
  • Other algorithms have adopted a method of sacrificing precision in order to improve efficiency, and they can only find part of the w seed.
  • the seed search algorithm of the present invention adopts the leaping scanning method in the first type of algorithm, and can check whether a batch of k seeds is included in the W seed, which makes the algorithm accurate and has very high execution efficiency.
  • the FMD index is an abbreviation for bidirectionalFM-index, and FM is the abbreviation of the two authors Ferragina Paolo and Manzini Giovanni who proposed the FM index.
  • a subsequence of length k of the nucleic acid sequence is referred to as a k subsequence.
  • a perfectly matched segment of length w between the query sequence and the database sequence is called a w seed.
  • the sequence P is searched in the FMD index of the database, the search results are represented in the form of double intervals, and the double intervals are represented by three integers.
  • a double interval of aP can be obtained by the backward search algorithm; by the forward search algorithm, Pa can be obtained.
  • Double interval The number of elements in a double interval is called the size of the interval, which indicates the number of times P appears in the database. If the double interval of P is an empty interval, it means that P does not appear in the database.
  • the lookup table may have multiple implementations (including but not limited to a hash table), each entry corresponding to a short sequence of length k, and the search is to search for this short in the FMD index.
  • the double interval obtained by the sequence;
  • this step gradually expands the matching area on the left side of the k seed by using a backward search algorithm
  • step S3 Perform a forward search algorithm on the interval before the reduction in step S2 to find a matching area on the right side of the k seed;
  • step S4 checking whether the current detection position is located at the end of the query sequence, and if so, the algorithm terminates, otherwise, step S5 is performed;
  • the lookup table in the present invention is constructed on the basis of the FMD index of the nucleic acid sequence database. It is also a hash table. Each entry corresponds to a short sequence of length k, which holds the double interval obtained by searching the short sequence in the FMD index.
  • the size of this lookup table is independent of the database size. Since each character can only be one of A, C, G, and T, the lookup table has 4k entries. Using this lookup table, you can immediately get the double interval of the k subsequence in the query sequence.
  • the second part of the algorithm is the seed search algorithm.
  • the flow is shown in Figure 1.
  • the algorithm starts with the first k subsequence of the query sequence and performs five main steps step by step.
  • the first step of the algorithm is the S1 portion of Figure 1, which computes the hash value of the k subsequence and takes its corresponding double interval from the lookup table.
  • the lookup table not only can the leaping scan be realized, but also the double interval of the k subsequence can be obtained at one time, without the need to gradually utilize the forward or backward search algorithm, which saves a lot of time.
  • the second major step of the algorithm is the S2 part in Figure 1.
  • This step gradually finds the matching area to the left of the k seed by the backward search algorithm.
  • the backward search process if there is a double interval reduction or a double interval is empty, it indicates that some k seeds have encountered unmatched character pairs when expanding to the left.
  • the algorithm also needs to find the matching part to the right of the corresponding seed, trying to find the seed of length w.
  • the third main step of the algorithm corresponds to the S3 portion of Figure 1, which performs a forward search algorithm on the interval before the reduction in step 2 to find the matching region to the right of the k seed.
  • the forward search process if the search interval is empty, it means that the region in the query sequence does not exist in the database; otherwise, the double interval of the w subsequence in the query sequence will be obtained, and the algorithm takes this as The result is output.
  • the fourth step of the algorithm is the S4 portion of Figure 1, which checks to see if the current detected position is at the end of the query sequence. If yes, the algorithm terminates, otherwise, step 5.
  • the fifth step of the algorithm is the S5 part in Fig. 1, which jumps the current detection position by w-k+1 positions, and repeats steps 2 to 5, which is called leaping scanning.
  • the leaping scan does not need to detect all k subsequences in the query sequence. Instead, it only needs to be detected once every w-k+1 positions. This leaping scanning method is also very efficient while ensuring that all w seeds can be found.
  • the embodiment further provides an application for searching for a slewing search algorithm of a character sequence similar substring in a biological sequence database.
  • the FMD index is constructed on the biological sequence database or the biological sequence.
  • the lookup table is constructed on the FMD index, which is a biological sequence including, but not limited to, DNA, protein or RNA, and other types of biological sequences are equally applicable to the technical solution of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Operations Research (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用,算法包括:S0、构建数据库的FMD索引及其查找表格;S1、从查找表格中取出查询序列中长度为k的子序列的双区间;S2、通过向后搜索算法,逐步找出k种子左边的匹配区域;S3、对步骤S2中缩小前的区间执行向前搜索算法,以找出k种子右边的匹配区域;S4、检查当前检测位置是否位于查询序列的尾部,如果是,则算法终止,否则,执行步骤S5;S5、将当前检测位置向前跳跃w-k+1个位置,重复执行步骤S2-S5。本发明提出的查找表格具有占用空间少、访问效率高的特点;通过结合查找表格和FMD索引,本发明能够快速地找出所有w种子,另外本发明已经成功应用在生物序列比对上。

Description

字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用 技术领域
本发明涉及字符序列相似子串的查找和字符数据库搜索的技术领域,特别涉及字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用。
背景技术
本算法用于搜索字符序列间的相似子串,通过找出相似子串的种子(seed),达到快速检索出相似子串的目的。基于种子的相似字符子串查找算法已经得到地广泛应用,例如生物序列分析中常用的BLAST、BWA算法就是其中的代表。下面以生物序列搜索为例(但不限于生物序列),对本算法进行说明。要搜索在字符序列数据库和查询序列之间长度至少为w的完全匹配的种子,现有的解决方案可以归为三类。
第一类算法对查询序列构建一个查找表格(lookup table)。该查找表格是一个Hash表,每个表项是一个线性链表,记录了一个长度为k的序列在查询序列中出现的所有位置。这类算法接着对数据库序列进行跨越式的扫描。跨越式的扫描指的是每隔w-k+1个位置就检测一次长度为k的子序列。检测过程包括通过查找表格找出这个子序列在查询序列中出现的位置,每个位置对应一个长度为k的种子,然后比较每个k种子的左右两边的区域,检查这个k种子是否包含在一个w种子种。这种种子查找算法被应用于MegaBLAST中。
第二类算法对数据库构建查找表格,这类查找表格也是一个Hash表,每个表项对应一个长度为k的短序列,典型的解决方案有两种。第一种解决方案以Indexed MegaBLAST为代表,对数据库每隔w-k+1个位置取出一个长度为k的子序列,将其位置加入相应的Hash表项中。算法然后检测查询序列中所有长度为k的子序列,找到k种子,然后通过方案一的做法,检查这些k种子是否包含在w种子中。第二种解决方案以BLAT为代表,取出数据库序列中所有互不重叠的长度为w的子序列,将其位置记录在相应的Hash表项中,然后检查查询序列的所有长度为w的子序列,在Hash表中找出其位置链表,链表中的每个位置对应一个w种子。
第三类解决方案对数据库建立FM索引或FMD索引,利用该索引查找长度至少为W的最大匹配区域。最大匹配区域指的是不能继续向左右两边延伸的一段完全匹配区域。序列比对软件Cushat用FM索引,从查询序列的第一个字符开始,每一步从右边添加一个字符,直到搜索结果为空集为止。算法接着从上一步停止的地方开始,继续上述过程。序列比对软件BWA-MEM利用FMD查找超级最大匹配,超级最大匹配也是最大匹配,但是它们在查询序列上的区段不能被其他最大匹配在查询序列上的区段覆盖。
上述第一种方案属于最初被提出的方案,也是这三类解决方案中效率最低的种子查找方案。第二种方案中,IndexedMegaBLAST在小型数据库和短查询序列上可以运行得很快,然而在大型数据库和长查询序列上,查找表格将会变得非常庞大,而且性能会急剧下降,以至比MegaBLAST还要低效。BLAT不是精确的算法,它无法保证找出所有w种子。第三类解决方案虽然性能好,然而它们同样不能保证能够找出所有长度至少为w的种子,从而导致最终搜索精度的下降。
发明内容
本发明的主要目的在于克服现有技术的缺点与不足,提供一种字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用。
为了达到上述目的,本发明采用以下技术方案:
本发明字符序列相似子串的跨越式查找算法,包括下述步骤:
S0、构建数据库FMD索引的查找表格,这个查找表格可以有多种实现,每一个表项对应一个长度为k的短序列,保存的是在FMD索引中搜索这个短序列所得到的双区间;
S1、计算查询序列中k子序列的查询值,并从查找表格中取出其相应的长度为k的种子的双区间;
S2、这个步骤通过向后搜索算法,逐步扩大k种子左边的匹配区域;
S3、对步骤S2中缩小前的区间执行向前搜索算法,以找出k种子右边的匹配区域;
S4、检查当前检测位置是否位于查询序列的尾部,如果是,则算法终止,否则,执行步骤S5;
S5、将当前检测位置向前跳跃w-k+1个位置,重复执行步骤S2-S5,其中w是要查找的种子的长度。
作为优选的技术方案,FMD索引具体为:
字符序列的一个长度为k的子序列称为k子序列,查询序列与数据库序列之间的一段长度为w的完全匹配区段称为w种子,在数据库的FMD索引中搜索序列P,其搜索结果以双区间的形式表示,而双区间则由三个整数表示,给定核酸序列P的双区间和字符a,a为字符表中的其中一个元素,由向后搜索算法,可以得到aP的双区间;由向前搜索算法,可以得到Pa的双区间,双区间中元素的个数称为该区间的大小,它表示P在数据库中出现的次数,如果P的双区间为空区间,则表示P没有在数据库中出现。
作为优选的技术方案,所述查找表格的占用空间较小,并且不随数据库的增大而变大。
作为优选的技术方案,所述步骤S2中,在向后搜索过程中,如果出现双区间缩小或者双区间为空的情况,就表明某些k种子向左扩展时遇到了不匹配的字符对,对缩小前的区间,算法还需要通过向前搜索算法找出其对应的种子右边的匹配部分,试图找出长度为w的种子。
作为优选的技术方案,所述步骤S3中,在向前搜索过程中,如果出现搜索区间为空的情况,就说明查询序列中的这段区域在数据库中不存在;否则的话,将得到查询序列中w子序列的双区间,算法将其作为结果输出。
作为优选的技术方案,所述步骤S5中,并不需要检测查询序列中所有的k子序列,相反,它只需每隔w-k+1个位置检测一次即可。
作为优选的技术方案,所述步骤S0中,其特征在于结合FMD索引和查找表格进行种子的查找,而查找表格可以采用多种实现。
本发明还提供了一种字符序列相似子串的跨越式查找算法在在生物序列数据库上查找的应用,在处理的字符序列为生物序列时,FMD索引是构建在生物序列数据库上或者生物序列查询数据集合上,查找表格是构建在该FMD索引上。
作为优选的技术方案,所述为生物序列包括DNA、蛋白质或RNA。
本发明与现有技术相比,具有如下优点和有益效果:
1、本发明提出的查找表格与现有的基于线性链表的查找表格的区别之处在于,基于线性链表的查找表格的大小会随着数据库的增长而增加,在大型数据库中,这种查找表格将占用非常大的存储空间,长的线性链表也会使得种子查找过程非常耗时。与基于FM索引构建的查找表格相比,本发明提出的查找表格保存的是短序列的双区间,而基于FM索引构建的查找表格保存的则是后缀数组区间。
2、本发明提出的种子查找算法与现有的算法相比,具有精度高、效率快的优势。传统的种子查找算法有的无法找出所有的w种子,无法保证搜索精度。有的虽然也采用了跨越式的扫描方式,但在大型数据库和长的查询序列下性能低下,它们只能逐个检查k种子是否包含在w种子中。
3、本发明提出的种子查找算法同样采用跨越式的扫描方式,但是它能够一次检查一批k种子是否包含在w种子中,这使得它具很好的性能优势。
附图说明
图1是本发明结合FMD索引和查找表格的跨越式种子查找算法的流程图。
具体实施方式
下面结合实施例及附图对本发明作进一步详细的描述,但本发明的实施方式不限于此。
实施例
本实施例以生物序列搜索为例,将本发明应用于加速megablast算法,在保持相同结果的前提下,加速十倍以上。本实施例主要包括两个部分:结合FMD索引的查找表格和种子查找算法。构建查找表格针对的是上述背景技术中第二类算法所采用的查找表格的缺点。在这类算法中,查找表格的每个表项都是一个线性链表,这使得查找表格在大型数据库中将会占用非常巨大的存储空间。一些算法例如BLAT等,为了减少查找表格的大小从而采用索引部分短序列的方式,这会牺牲搜索精度。本发明的查找表格每个表项均保存一个双区间,每个双区间只需由三个数字表示,因此查找表格的大小是固定的,不会随着数据库的增长而发生变化。
种子查找算法针对的是上述第二、第三类算法中的缺点。这类算法中,IndexedMegaBLAST能够找出所有的w种子,但其在大型数据库下的效率不高,因为这时候线性链表会非常长,从而需要频繁地检查k种子是否包含在w种子种。其他算法为了提高效率从而采用了牺牲精度的方式,它们只能够找出部分的w种子。本发明的种子查找算法采用了第一类算法中的跨越式扫描方式,能够检查一批k种子是否包含在W种子中,这使得该算法精确的同时具有非常高的执行效率。
FMD索引是bidirectionalFM-index的缩写,FM是提出FM索引的两个作者FerraginaPaolo禾口ManziniGiovanni的名字缩写。
在详细介绍这两个部分内容之前,本实施例先说明与FMD索引相关的信息。核酸序列的一个长度为k的子序列称为k子序列。查询序列与数据库序列之间的一段长度为w的完全匹配区段称为w种子。在数据库的FMD索引中搜索序列P,其搜索结果以双区间的形式表示,而双区间则由三个整数表示。给定核酸序列P的双区间和字符a(a为八、C、G、T中的其中一个),由向后搜索算法,可以得到aP的双区间;由向前搜索算法,可以得到Pa的双区间。双区间中元素的个数称为该区间的大小,它表示P在数据库中出现的次数。如果P的双区间为空区间,则表示P没有在数据库中出现。
本发明的基本流程如下所述:
S0、构建数据库FMD索引的查找表格,这个查找表格可以有多种实现(包括但不限于Hash表),每一个表项对应一个长度为k的短序列,保存的是在FMD索引中搜索这个短序列所得到的双区间;
S1、计算查询序列中k子序列的查询值,并从查找表格中取出其相应的长度为k的种子的双区间;
S2、这个步骤通过向后搜索算法,逐步扩大k种子左边的匹配区域;
S3、对步骤S2中缩小前的区间执行向前搜索算法,以找出k种子右边的匹配区域;
S4、检查当前检测位置是否位于查询序列的尾部,如果是,则算法终止,否则,执行步骤S5;
S5、将当前检测位置向前跳跃w-k+1个位置,重复执行步骤S2-S5,其中w是要查找的种子的长度。
本发明中的查找表格是在核酸序列数据库的FMD索引的基础上构建的。它也是一个Hash表,每一个表项对应一个长度为k的短序列,保存的是在FMD索引中搜索这个短序列所得到的双区间。这种查找表格的大小与数据库大小无关。由于每个字符只能是A、C、G、T中的一个,因此查找表格一个有4k个表项。利用这个查找表格,可以立刻得到查询序列中k子序列的双区间。
算法的第二部份是种子查找算法,其流程如图1所示,该算法以查询序列最开始的k子序列开始,逐步执行5个主要步骤。
算法的第一个步骤是图1中的S1部分,它计算k子序列的Hash值,并从查找表格中取出其相应的双区间。利用查找表格,不仅可以实现跨越式的扫描,还可以一次性得到k子序列的双区间,而不需要逐步利用向前或向后搜索算法,这节省了大量的时间。
算法的第二个主要步骤在图1中是S2部分。这个步骤通过向后搜索算法,逐步找出k种子左边的匹配区域。在向后搜索过程中,如果出现双区间缩小或者双区间为空的情况,就表明某些k种子向左扩展时遇到了不匹配的字符对。对缩小前的区间,算法还需要找出其对应的种子右边的匹配部分,试图找出长度为w的种子。
算法的第三个主要步骤对应图1中的S3部分,它对步骤2中缩小前的区间执行向前搜索算法,以找出k种子右边的匹配区域。在向前搜索过程中,如果出现搜索区间为空的情况,就说明查询序列中的这段区域在数据库中不存在;否则的话,将得到查询序列中w子序列的双区间,算法将其作为结果输出。
算法的第四个步骤为图1中的S4部分,这个步骤检查当前检测位置是否位于查询序列的尾部。如果是,则算法终止,否则的话,执行步骤5。
算法的第五个步骤在图1中是S5部分,它将当前检测位置向前跳跃w-k+1个位置,重复执行步骤二至步骤五,这就是所谓的跨越式扫描。跨越式扫描并不需要检测查询序列中所有的k子序列,相反,它只需每隔w-k+1个位置检测一次即可。这种跨越式的扫描方式在保证能够找到所有w种子的同时,也具有非常高的效率。
本实施例还提供了一种字符序列相似子串的跨越式查找算法在在生物序列数据库上查找的应用,在处理的字符序列为生物序列时,FMD索引是构建在生物序列数据库上或者生物序列查询数据集合上,查找表格是构建在该FMD索引上,所述为生物序列包括但不限于DNA、蛋白质或RNA,其他类型的生物序列同样适用于本发明的技术方案。
上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。

Claims (9)

  1. 字符序列相似子串的跨越式查找算法,其特征在于,包括下述步骤:
    S0、构建数据库FMD索引的查找表格,每一个表项对应一个长度为k的短序列,保存的是在FMD索引中搜索这个短序列所得到的双区间;
    S1、计算查询序列中k子序列的查询值,并从查找表格中取出其相应的长度为k的种子的双区间;
    S2、这个步骤通过向后搜索算法,逐步扩大k种子左边的匹配区域;
    S3、对步骤S2中缩小前的区间执行向前搜索算法,以找出k种子右边的匹配区域;
    S4、检查当前检测位置是否位于查询序列的尾部,如果是,则算法终止,否则,执行步骤S5;
    S5、将当前检测位置向前跳跃w-k+1个位置,重复执行步骤S2-S5,其中w是要查找的种子的长度。
  2. 根据权利要求1所述的字符序列相似子串的跨越式查找算法,其特征在于,FMD索引具体为:
    字符序列的一个长度为k的子序列称为k子序列,查询序列与数据库序列之间的一段长度为w的完全匹配区段称为w种子,在数据库的FMD索引中搜索序列P,其搜索结果以双区间的形式表示,而双区间则由三个整数表示,给定核酸序列P的双区间和字符a,a为字符表中的其中一个元素,由向后搜索算法,可以得到aP的双区间;由向前搜索算法,可以得到Pa的双区间,双区间中元素的个数称为该区间的大小,它表示P在数据库中出现的次数,如果P的双区间为空区间,则表示P没有在数据库中出现。
  3. 根据权利要求1所述的字符序列相似子串的跨越式查找算法,其特征在于,所述查找表格的占用空间较小,并且不随数据库的增大而变大。
  4. 根据权利要求1所述的字符序列相似子串的跨越式查找算法,其特征在于,所述步骤S2中,在向后搜索过程中,如果出现双区间缩小或者双区间为空的情况,就表明某些k种子向左扩展时遇到了不匹配的字符对,对缩小前的区间,算法还需要通过向前搜索算法找出其对应的种子右边的匹配部分,试图找出长度为w的种子。
  5. 根据权利要求1所述的字符序列相似子串的跨越式查找算法,其特征在于,所述步骤S3中,在向前搜索过程中,如果出现搜索区间为空的情况,就说明查询序列中的这段区域在数据库中不存在;否则的话,将得到查询序列中w子序列的双区间,算法将其作为结果输出。
  6. 根据权利要求1所述的字符序列相似子串的跨越式查找算法,其特征在于,所述步骤S5中,并不需要检测查询序列中所有的k子序列,相反,它只需每隔w-k+Ι个位置检测一次即可。
  7. 根据权利要求1所述的字符序列相似子串的跨越式查找算法,其特征在于结合FMD索引和查找表格进行种子的查找,而查找表格可以采用多种实现。
  8. 字符序列相似子串的跨越式查找算法在在生物序列数据库上查找的应用,其特征在于,在处理的字符序列为生物序列时,FMD索引是构建在生物序列数据库上或者生物序列查询数据集合上,查找表格是构建在该FMD索引上。
  9. 根据权利要求7所述的字符序列相似子串的跨越式查找算法在在生物序列数据库上查找的应用,其特征在于,所述为生物序列包括DNA、蛋白质或RNA。
PCT/CN2016/087300 2015-06-29 2016-06-27 字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用 WO2017000859A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/740,038 US20180174681A1 (en) 2015-06-29 2016-06-27 Leaping search algorithm for similar sub-sequences in character sequences and application thereof in searching in biological sequence database

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510373462.2A CN105138534B (zh) 2015-06-29 2015-06-29 基于fmd索引和快表的跨越式种子查找算法
CN201510373462.2 2015-06-29

Publications (1)

Publication Number Publication Date
WO2017000859A1 true WO2017000859A1 (zh) 2017-01-05

Family

ID=54723884

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087300 WO2017000859A1 (zh) 2015-06-29 2016-06-27 字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用

Country Status (3)

Country Link
US (1) US20180174681A1 (zh)
CN (1) CN105138534B (zh)
WO (1) WO2017000859A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063394A (zh) * 2019-12-13 2020-04-24 人和未来生物科技(长沙)有限公司 基于基因序列的物种快速查找及建库方法、系统和介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138534B (zh) * 2015-06-29 2018-08-03 中山大学 基于fmd索引和快表的跨越式种子查找算法
EP3364314B1 (en) * 2017-02-15 2022-10-19 QlikTech International AB Methods and systems for indexing using indexlets
CN114090840A (zh) * 2020-08-24 2022-02-25 华为技术有限公司 序列查找方法、装置、设备及介质
CN112488526B (zh) * 2020-12-01 2022-12-27 广东电网有限责任公司佛山供电局 一种工作票安全措施布施地点的正确性校验方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5490258A (en) * 1991-07-29 1996-02-06 Fenner; Peter R. Associative memory for very large key spaces
CN101441620A (zh) * 2008-11-27 2009-05-27 温州大学 基于近似串匹配距离的电子文本文档抄袭识别方法
CN101675430A (zh) * 2007-05-01 2010-03-17 国际商业机器公司 用于近似串匹配的方法和系统
CN101763405A (zh) * 2009-11-16 2010-06-30 陆嘉恒 基于同义词规则的近似字符串搜索技术
CN105138534A (zh) * 2015-06-29 2015-12-09 中山大学 基于fmd索引和快表的跨越式种子查找算法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8527549B2 (en) * 2010-02-22 2013-09-03 Sookasa Inc. Cloud based operating and virtual file system
US20130091121A1 (en) * 2011-08-09 2013-04-11 Vitaly L. GALINSKY Method for rapid assessment of similarity between sequences

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5490258A (en) * 1991-07-29 1996-02-06 Fenner; Peter R. Associative memory for very large key spaces
CN101675430A (zh) * 2007-05-01 2010-03-17 国际商业机器公司 用于近似串匹配的方法和系统
CN101441620A (zh) * 2008-11-27 2009-05-27 温州大学 基于近似串匹配距离的电子文本文档抄袭识别方法
CN101763405A (zh) * 2009-11-16 2010-06-30 陆嘉恒 基于同义词规则的近似字符串搜索技术
CN105138534A (zh) * 2015-06-29 2015-12-09 中山大学 基于fmd索引和快表的跨越式种子查找算法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIU, BOREN ET AL.: "Bioindex: an Efficient Index for Similarity Queries of Biological Sequences", COMPUTER APPLICATIONS AND SOFTWARE, vol. 26, no. 10, 31 October 2009 (2009-10-31), pages 1 - 4, XP055342381 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063394A (zh) * 2019-12-13 2020-04-24 人和未来生物科技(长沙)有限公司 基于基因序列的物种快速查找及建库方法、系统和介质
CN111063394B (zh) * 2019-12-13 2023-07-11 人和未来生物科技(长沙)有限公司 基于基因序列的物种快速查找及建库方法、系统和介质

Also Published As

Publication number Publication date
CN105138534A (zh) 2015-12-09
US20180174681A1 (en) 2018-06-21
CN105138534B (zh) 2018-08-03

Similar Documents

Publication Publication Date Title
WO2017000859A1 (zh) 字符序列相似子串的跨越式查找算法及其在生物序列数据库上的查找应用
Jiang et al. String similarity joins: An experimental evaluation
KR101313087B1 (ko) Ngs를 위한 서열 재조합 방법 및 장치
US7640256B2 (en) Data collection cataloguing and searching method and system
US20130091121A1 (en) Method for rapid assessment of similarity between sequences
EP3072076B1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
CN111445952B (zh) 超长基因序列的相似性快速比对方法及系统
WO2017128763A1 (zh) 数据压缩装置及方法
US20100287165A1 (en) Indexing a reference sequence for oligomer sequence mapping
US9330159B2 (en) Techniques for finding a column with column partitioning
CN106599097B (zh) 海量特征串集合的匹配方法和装置
US8615365B2 (en) Oligomer sequences mapping
CN109545283B (zh) 一种基于序列模式挖掘算法的系统发生树构建方法
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
Chakraborty et al. conLSH: context based locality sensitive hashing for mapping of noisy SMRT reads
Comin et al. Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes
CN111402958B (zh) 一种建立基因比对表的方法、系统、设备及介质
CN105447135A (zh) 数据查找方法和装置
WO2011073680A1 (en) Improvements relating to hash tables
US20130041593A1 (en) Method for fast and accurate alignment of sequences
CN102591941B (zh) 一种SQLite空闲链表节点的解析方法和装置
Esmat et al. A parallel hash‐based method for local sequence alignment
Chen et al. CGAP-align: a high performance DNA short read alignment tool
US20060155479A1 (en) Efficiently calculating scores for chains of sequence alignments
Greenstein et al. Short read error correction using an FM-index

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16817223

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15740038

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16817223

Country of ref document: EP

Kind code of ref document: A1