WO2022041881A1 - 序列查找方法、装置、设备及介质 - Google Patents

序列查找方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022041881A1
WO2022041881A1 PCT/CN2021/095825 CN2021095825W WO2022041881A1 WO 2022041881 A1 WO2022041881 A1 WO 2022041881A1 CN 2021095825 W CN2021095825 W CN 2021095825W WO 2022041881 A1 WO2022041881 A1 WO 2022041881A1
Authority
WO
WIPO (PCT)
Prior art keywords
subsequence
sequence
character
starting
length
Prior art date
Application number
PCT/CN2021/095825
Other languages
English (en)
French (fr)
Inventor
王正
杨德志
陈亮宇
王龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022041881A1 publication Critical patent/WO2022041881A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present application relates to the field of computing technology, and in particular, to a sequence search method, apparatus, device, and computer-readable storage medium.
  • a sequence refers to a string formed by multiple characters in an order relationship. Based on the type differences of the characters that make up the sequence, the sequence can be divided into number sequence, letter sequence, Chinese character sequence, and mixed sequence composed of multiple types of characters.
  • the sequence of numbers may include phone numbers, bank card numbers, etc.
  • the sequence of letters may include gene sequences (usually including the letters A, C, G, T, used to characterize different types of bases), and the like.
  • BW transform and full-text index (burrows wheeler transform-full text index in minute space, BWT-FM) algorithm to search.
  • the reference genome undergoes BW transformation to output the index BWT (a string composed of the last characters of the sorted circular string), and the suffix array (SA).
  • BWT a string composed of the last characters of the sorted circular string
  • SA suffix array
  • a two-dimensional array occurrence, OCC
  • the target sequence can be found by accessing the OCC.
  • the present application provides a sequence search method.
  • the method divides a target sequence to be searched into segments, and then accelerates the search for subsequences obtained by the segmentation based on a pre-built acceleration library, so as to avoid searching for the characters of the target sequence one by one, and improve the search efficiency.
  • the present application also provides apparatuses, devices, computer-readable storage media, and computer program products corresponding to the above methods.
  • the present application provides a sequence search method.
  • the method can be performed by any processing device with data processing capabilities.
  • the processing device may determine at least one subsequence whose length is a set length value from the target sequence, and the subsequence takes a character in the target sequence as a starting point, and then the processing device is dedicated to speeding up the search for the sequence of the set length value.
  • the subsequence is searched in the acceleration library, and the position of the subsequence or the maximum exact match of the subsequence starting from the one character in the reference sequence is obtained.
  • the method divides the target sequence into segments according to the set length value, and accelerates the search for the subsequences obtained by the segmentation based on the pre-built acceleration library, so as to avoid searching for the characters of the target sequence one by one, improve the search efficiency, and thus improve the search performance.
  • the acceleration library includes at least one information structure, and the information structure is used to indicate a sample sequence or a range of maximum exact matching of the sample sequence starting from the first character.
  • the processing device can directly obtain the subsequence or the maximum exact matching range starting from the first character of the subsequence according to the information indicated by the information structure in the acceleration library, which improves the search efficiency.
  • the information structure includes at least one of a presence field and a length field and a range field.
  • the existence field is used to represent whether a sample sequence exists in the reference sequence
  • the range field is used to represent the sample sequence or the maximum exact matching range of the sample sequence starting from the first character
  • the length field is used to represent the sample sequence or the length of the maximum exact match of the sample sequence.
  • the information structure may include a presence field and a scope field. In other embodiments, the information structure may include a range field and a length field. Of course, the information structure may also include a presence field, a range field and a length field.
  • the processing device can obtain the subsequence or the range of the maximum exact match starting from the first character of the subsequence or subsequence according to at least one of the presence field and the length field and the range field in the information structure, so that there is no need to compare characters one by one, Improved search efficiency.
  • the information structure of the sample sequence in the acceleration library may be stored in the corresponding storage address according to the mapping relationship between the sequence and the storage address.
  • the processing device can determine the storage address corresponding to the subsequence according to the mapping relationship between the sequence and the storage address, and then the processing device can access the acceleration library according to the storage address to obtain the subsequence or The position in the reference sequence of the maximum exact match of the subsequence starting from the one character.
  • the processing device when searching for the target sequence, for the part of the subsequence, the processing device can obtain the search result only by accessing the memory once, which reduces the number of times of accessing the memory, improves the search efficiency, and improves the search performance.
  • the acceleration library includes a first acceleration library located in a memory, and the set length value is a first length value.
  • the memory is also called internal memory, and its function is to temporarily store the operation data in the processor and exchange data with external memory (also called external memory) such as a disk.
  • the first acceleration library is located in the memory, and the processing device does not need to load the first acceleration library into the memory, which saves time for loading the first acceleration library and improves search efficiency.
  • the first length value is determined according to the size of the memory.
  • the first acceleration library is located in the memory, therefore, the storage space occupied by the information structure of the sample sequence in the first acceleration library should not be larger than the storage space of the memory. That is, the first length value should satisfy the following formula:
  • P represents the size of the memory.
  • m represents the number of possible values included in the value space of each character in the target sequence.
  • m can be 4.
  • w represents the size of the space occupied by each information structure. For example, the existence field occupies 1 byte, the range field occupies 8+8 bytes, and the length field occupies 8 bytes, so the value of w is 25.
  • the memory can be prevented from being exhausted and the sequence search will be affected, and the search performance can be guaranteed.
  • the acceleration library includes a second acceleration library located in an external memory, and the set length value is a second length value.
  • the external memory refers to a device other than the memory in the storage device.
  • the external storage includes any one or more of magnetic disks, solid state drives (SSDs), flash memory, and the like.
  • the storage space of the external memory is generally larger than the storage space of the internal memory, a subsequence with a longer length can be searched in the second acceleration library, which can improve the efficiency and improve the search performance.
  • the second length value is determined according to the size of the external memory.
  • the second length value can satisfy the following formula:
  • Q represents the size of the external memory, such as the size of the disk.
  • m represents the number of possible values included in the value space of each character in the target sequence.
  • w represents the size of the space occupied by each information structure.
  • the time for the processing device to randomly access the external memory once is ⁇ times as long as the time for one random access to the memory, that is, the time-consuming ratio of external memory access is ⁇ , and the third length value len E can be set, which satisfies the following formula:
  • the processing device may set len' to be greater than lenC + lenE .
  • len' can be set as:
  • len′ lenC + lenE + lenF
  • len F is the fourth length value
  • the second length value is equal to the sum of the first length value, the third length value and the fourth length value.
  • the processing device may iterate the above formula to a formula that the second length value should satisfy, so as to obtain len F by solving.
  • the query time can be greatly shortened, the query efficiency and the query performance can be improved.
  • the second information structure further includes a comparison field.
  • the comparison field is used to represent whether the length value of the maximum exact match is greater than the preset length threshold.
  • the preset length threshold is determined according to the size of the memory and the time-consuming ratio of external memory memory access. For example, the preset length threshold may be len c +len E . In this way, the processing device can quickly obtain the size of the length of the maximum exact match and the preset length threshold according to the comparison field, and the comparison result can provide help for the subsequent search process.
  • the processing device can perform sequence search in combination with the memory breakpoint search method and the external memory breakpoint search method, so that the advantages of the memory breakpoint search method and the external memory breakpoint search method can be integrated, and the search efficiency can be further improved .
  • the processing device may determine at least one first subsequence and at least one second subsequence from the target sequence, where the at least one first subsequence and the at least one second subsequence start from a character in the target sequence, the The second subsequence is longer than the first subsequence, and the processing device may look up the first subsequence in a first acceleration library located in the memory, and look up the second subsequence in a second acceleration library located in the memory.
  • the method combines the memory breakpoint search method and the external memory breakpoint search method, and can query the maximum exact match of any length, and is not limited to the maximum exact match within a limited length. Moreover, the method can realize asynchronous parallel search of multiple branches, which improves search efficiency.
  • the search for the first acceleration library is stopped.
  • the first subsequence when the maximum exact match of the first subsequence starting from the one character is found in the first acceleration library, stop searching for the second subsequence in the second acceleration library sequence.
  • the branch parallel to the branch can stop searching, so that resource waste can be avoided.
  • the first length value is determined according to the size of the memory, or is determined according to the time-consuming ratio of external memory memory access.
  • the first length value may be len c .
  • the first length value may be len E . In this way, even if the length of the maximum exact match is small, this branch can be searched through memory breakpoints, subsequences can be quickly searched, and search efficiency can be improved.
  • the processing device may search the reference sequence for a sample sequence, and obtain a search result, where the search result is used to characterize the sample sequence or the maximum value of the sample sequence starting from the first character An exact match is made to the position in the reference sequence, and then the acceleration library is constructed from the results of the lookup. In this way, it can provide help for subsequent sequence search and improve search efficiency.
  • the processing device may search for a sample sequence in the reference sequence through a BWT algorithm according to the index BWT of the reference sequence, the suffix array SA, and the two-dimensional array OCC to obtain a search result.
  • the search result is used to characterize whether the sample sequence exists in the reference sequence, and the sample sequence or the maximum exact match of the sample sequence starting from the first character is within the range of the two-dimensional array, and The sample sequence or the maximum exact match length value of the sample sequence starting from the first character.
  • the processing device can speed up the search for the sample sequence, speed up the construction process of the acceleration library, and improve the efficiency of the construction of the acceleration library through the above method.
  • the sequence is a gene sequence.
  • the position of the gene sequence in the genome can be quickly located in the gene sequencing scenario, and the efficiency of gene sequence search can be improved.
  • the present application provides a sequence search apparatus.
  • the device includes:
  • a determination module for determining at least one subsequence from the target sequence, and the subsequence takes a character in the target sequence as a starting point;
  • a search module configured to search the subsequence in an acceleration library, and obtain the subsequence or the position of the subsequence with the maximum exact match starting from the one character in the reference sequence, and the acceleration library is used for accelerating A sequence of set length values is searched, and the length of the subsequence is the set length value.
  • the acceleration library includes at least one information structure, and the information structure is used to indicate a sample sequence or a range of maximum exact matching of the sample sequence starting from the first character.
  • the information structure includes at least one of a presence field and a length field, and a range field, where the presence field is used to represent whether a sample sequence exists in the reference sequence, and the range The field is used to characterize the sample sequence or the range of the maximum exact match of the sample sequence starting from the first character, and the length field is used to characterize the length of the sample sequence or the maximum exact match of the sample sequence.
  • the search module is specifically used for:
  • the acceleration library is accessed according to the storage address to obtain the subsequence or the position in the reference sequence of the subsequence or the maximum exact match of the subsequence starting from the one character.
  • the acceleration library includes a first acceleration library located in a memory, and the set length value is a first length value.
  • the first length value is determined according to the size of the memory.
  • the acceleration library includes a second acceleration library located in an external memory, and the set length value is a second length value.
  • the second length value is determined according to the size of the external memory.
  • the second information structure further includes a comparison field, where the comparison field is used to represent whether the length value of the maximum exact match is greater than a preset length threshold, and the preset length threshold is based on the size of the memory And the time-consuming ratio of external memory memory access is determined.
  • the determining module is specifically used for:
  • At least one first subsequence and at least one second subsequence are determined from the target sequence, the at least one first subsequence and the at least one second subsequence start with a character in the target sequence, the The second subsequence is longer than the first subsequence;
  • the acceleration library includes a first acceleration library located in a memory and a second acceleration library located in an external memory;
  • the search module is specifically used for:
  • the first subsequence is looked up in the first accelerated library, and the second subsequence is looked up in the second accelerated library.
  • the search module is specifically used for:
  • the first length value is determined according to the size of the memory, or is determined according to the time-consuming ratio of external memory memory access.
  • the apparatus further includes:
  • the building blocks are specifically used to:
  • the suffix array SA and the two-dimensional array OCC According to the index BWT of the reference sequence, the suffix array SA and the two-dimensional array OCC, the sample sequence is searched in the reference sequence by the BWT algorithm, and the search result is obtained, and the search result is used to characterize whether the sample sequence exists in the reference sequence.
  • the maximum exact match of the sample sequence or the sample sequence starting from the first character is within the range of the two-dimensional array, and the sample sequence or the sample sequence starting with the first character The maximum exact match length value for the starting point.
  • the sequence is a gene sequence.
  • the present application provides a computing device including a processor and a memory.
  • the processor and the memory communicate with each other.
  • the processor is configured to execute the instructions stored in the memory to cause the computing device to perform the sequence finding method as in the first aspect or any one of the implementations of the first aspect.
  • the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and the instruction instructs a computing device to execute the first aspect or any one of the implementation manners of the first aspect. Sequence lookup method.
  • the present application provides a computer program product containing instructions, which, when run on a computing device, enables the computing device to execute the sequence search method described in the first aspect or any implementation manner of the first aspect .
  • the present application may further combine to provide more implementation manners.
  • FIG. 1 is a scene architecture diagram of a sequence search method provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a processing device according to an embodiment of the present application.
  • FIG. 3 is a flowchart of a sequence search method provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a sequence search method provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a sequence search apparatus according to an embodiment of the present application.
  • first and second in the embodiments of the present application are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • a sequence refers to a string formed by multiple characters in an order relationship. Based on the type differences of the characters that make up the sequence, the sequence can be divided into number sequence, letter sequence, Chinese character sequence, and mixed sequence composed of multiple types of characters.
  • the sequence of numbers may include phone numbers, bank card numbers, etc., for example, a sequence of numbers may be 132xxxx2323.
  • Alphabetical sequences may include gene sequences, eg, GGGCCAACTACC. Among them, the letters A, C, G, and T in the gene sequence are used to characterize different types of bases.
  • Sequence lookup refers to finding a short sequence within a long sequence.
  • the long sequence may also be referred to as a reference sequence, and the short sequence may be referred to as a target sequence.
  • Sequence search is to find the target sequence in the reference sequence. If the target sequence exists in the reference sequence, return the position of the target sequence in the reference sequence. If the target sequence does not exist in the reference sequence, return the target sequence in the reference sequence. Maximum exact match, specifically, the maximum exact match of the target sequence starting from the specified position (specified character).
  • the reference sequence can be a long string R
  • the target sequence can be a short string s, starting from the character at position c in the short string s, and the longest of all s substrings that are exactly matched in R
  • the substring is called the largest exact match of s starting at c.
  • maximum exact matching is described below with reference to specific examples.
  • the long string R is "addsdfyihadsdk” and the short string s is "dsdfyask”, starting from the character d in position 1 in the short string s, and all substrings that successfully match exactly in R include “dsdfy " and "dfy”.
  • the longest substring is "dsdfy”
  • "dsdfy" is the maximum exact match of s starting from the character d.
  • the industry mainly uses the BWT-FM algorithm for sequence search. Specifically, after the reference sequence undergoes BW transformation, the index BWT and the suffix number SA can be output. Among them, a two-dimensional array OCC can also be generated according to the index BWT. When searching for the target sequence, it is usually necessary to access the memory (specifically, the two-dimensional data OCC in the memory) multiple times, which leads to low search efficiency and reduced search performance.
  • the embodiments of the present application provide an efficient sequence search method.
  • the method may be performed by a processing device having data processing capabilities.
  • the processing device may be a server or a terminal, where the terminal includes but is not limited to a desktop computer, a notebook computer, a tablet computer, and a smart phone.
  • the processing device may also be a cluster.
  • the processing device may determine at least one subsequence from the target sequence, and the subsequence specifically starts with a character in the target sequence.
  • the processing device looks up the subsequence in a pre-built acceleration library.
  • the acceleration library is used to speed up the search for sequences of set length values.
  • the length of the subsequence is the set length value. In this way, the processing device can directly search for the subsequence according to the acceleration library without comparing the characters included in the subsequence one by one, and obtain the subsequence or the maximum exact match of the subsequence starting from one character in the reference sequence. in the location.
  • the target sequence to be searched is segmented, and then the subsequence obtained by the segment is accelerated based on a pre-built acceleration library, so as to avoid searching for the characters of the target sequence one by one, and improve the search efficiency.
  • the method can also directly access the storage address corresponding to the subsequence according to the mapping relationship between the sequence and the storage address, and obtain the maximum exact matching position of the subsequence or the subsequence starting from one character in the reference sequence, which reduces the processing time.
  • the number of times the device accesses the memory especially when the length of the maximum exact match is less than the length of the subsequence, only needs to access the memory randomly once, which can improve the search efficiency, reduce the search cost, and improve the search performance.
  • the acceleration library may include at least one information structure.
  • the information structure is used to indicate the sample sequence or the maximum exact matching range of the sample sequence starting from the first character.
  • the information structure includes a presence field and a scope field. The presence field is used to characterize whether a sample sequence of the same length as the subsequence exists in the reference sequence, and the range field is used to characterize (when the subsequence exists in the reference sequence) the range of the subsequence, or (the subsequence does not exist in the reference sequence) The maximum exact matching range of the subsequence starting from the above one character.
  • the information structure includes a range field and a length field.
  • the length field is used to characterize the length of the subsequence, or the length of the maximum exact match of the subsequence starting with one character.
  • the information structure includes the above-mentioned existence field, range field and length field. In this way, the processing device can obtain the subsequence or the position in the reference sequence of the maximum exact match of the subsequence starting from the one character.
  • the processing device can directly determine whether the subsequence exists in the reference sequence based on the acceleration library, and if so, return the position of the subsequence in the reference sequence and the length of the subsequence, and if not, return the subsequence with the first subsequence of the subsequence The length of the maximum exact match that starts with one character.
  • sequence search method is described below in combination with a gene sequencing scenario.
  • the scenario includes a detection device 100 , a processing device 200 and a user terminal 300 .
  • a communication connection is established between the detection device 100 and the user terminal 300
  • a communication connection is established between the processing device 200 and the user terminal 300 .
  • FIG. 1 uses the processing device 200 as a server for illustration. In other implementation manners, the processing device 200 may be a device such as a terminal or a cluster.
  • the detection device 100 is used to detect biological tissues such as blood and saliva to obtain the target sequence.
  • the detection device 100 may send the target sequence to the user terminal 300 , and the user terminal 300 may submit the target sequence to the processing device 200 .
  • the processing device 200 receives the target sequence, it determines at least one subsequence from the target sequence, and the subsequence starts with a character of the target sequence, and then searches the acceleration library for the subsequence, and obtains the subsequence or the subsequence starting with a character. is the position in the reference genome (reference sequence) of the largest exact match to the origin.
  • the processing device 200 can obtain the search result by accessing the memory only once for the subsequence, which reduces the number of times of accessing the memory, improves the search efficiency, and improves the search performance.
  • FIG. 2 shows a schematic structural diagram of the processing device 200 . It should be understood that FIG. 2 only shows part of the hardware structure and part of the software modules in the above-mentioned processing device 200. During specific implementation, the processing device 200 may also include more hardware structures, such as indicator lights, buzzers, etc., and More software modules, such as various applications, etc.
  • the processing device 200 includes a bus 201 , a processor 202 , a communication interface 203 and a memory 204 .
  • the processor 202 , the memory 204 and the communication interface 203 communicate through the bus 201 .
  • the bus 201 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, a peripheral component interconnect express (PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus or the like.
  • PCI peripheral component interconnect
  • PCIe peripheral component interconnect express
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 2, but it does not mean that there is only one bus or one type of bus.
  • the processor 202 can be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (microprocessor, MP), or a digital signal processor (digital signal processor, DSP), etc. any one or more of the devices.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the communication interface 203 is used for external communication, such as receiving the target sequence sent by the user terminal 300, and returning to the user terminal 300 the position of the subsequence in the reference sequence or the position in the reference sequence of the maximum exact match of the subsequence starting from one character etc.
  • Memory 204 may include volatile memory, such as random access memory (RAM).
  • the memory 204 may also include non-volatile memory (non-volatile memory), also such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state hard disk drive (solid state hard disk drive). state drive, SSD).
  • non-volatile memory also such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state hard disk drive (solid state hard disk drive). state drive, SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state hard disk drive
  • RAM and ROM are called memory
  • HDD and SSD are called external memory.
  • Programs or instructions are stored in the memory 204, for example, programs or instructions required to implement the sequence search method provided by the embodiments of the present application.
  • the processor 202 executes the program or instructions to perform the aforementioned sequence finding method.
  • the method includes:
  • S302 The processing device 200 determines at least one subsequence from the target sequence.
  • the subsequence starts with a character in the target sequence.
  • the processing device 200 may determine a set of subsequences by using a set of characters in the target sequence separated by a preset length value as a starting point.
  • the set of subsequences includes at least one subsequence.
  • the lengths of the multiple subsequences are equal.
  • the special characters refer to characters other than the normal characters that make up the sequence.
  • the base can be Marked as N (specifically characters other than A, C, G, T).
  • the processing device 200 may first determine a character in the target sequence as the starting point, and then use the character spaced from the starting point by a preset length value as the ending point, and then determine whether special characters are included between the starting point and the ending point.
  • the starting point is updated to a character after the special character, and the above steps are re-executed, that is, the ending point is re-determined, and whether special characters are included between the starting point and the ending point, until the starting point and the ending point Special characters are not included between.
  • the processing device 200 can determine a subsequence according to the characters between the start point and the end point. Further, the processing device 200 may update the starting point, perform the above steps again, and determine the next subsequence.
  • the processing device 200 searches the subsequence in the acceleration library, and obtains the subsequence or the position in the reference sequence of the maximum exact match of the subsequence starting from the above-mentioned one character.
  • the acceleration library is used to speed up the search for sequences of set length values.
  • the length of the subsequence is the set length value.
  • the processing device 200 can directly search for the subsequence according to the acceleration library, and obtain the subsequence or the maximum exact match of the subsequence starting from the first character of the subsequence in the reference sequence. in the location.
  • the acceleration library includes at least one information structure including a presence field, a range field, and a length field.
  • the existence field is used to identify whether a sample sequence with the same length as the subsequence is in the reference sequence
  • the range field is used to identify the sample sequence or the maximum exact matching range of the sample sequence starting from the first character of the sample sequence
  • the length field is used to identify the length of the sample sequence or the maximum exact match of the sample sequence starting from the first character of the sample sequence.
  • the value of the existence field can be a boolean value.
  • the boolean value can be true or false.
  • the field value can also be a value of 1 or 0, which is used to represent true or false.
  • the range identifier may specifically include a start identifier and an end identifier, and the start identifier and the end identifier may be represented by start and end.
  • the length field can be characterized by length.
  • the sample sequence can be randomly sampled in the value space for each character in the sequence.
  • the value space of each character is ⁇ A, C, G, T ⁇ .
  • the sample sequence may include AACT, GATT, CAGG and so on.
  • the processing device 200 may, for at least one sample sequence, search for the sample sequence in the reference sequence to obtain a search result.
  • the search result is used to characterize the sample sequence or the position of the largest exact match of the sample sequence starting from the first character.
  • the processing device 200 may construct an acceleration library according to the above search results, so that when a subsequence is searched subsequently, it can directly return the subsequence or the position of the maximum exact match of the subsequence with the first character as the starting point in the reference sequence.
  • the processing device 200 may search for the sample sequence in the reference sequence through a hash search method or a BWT-FM method to obtain a search result.
  • a hash search method or a BWT-FM method to obtain a search result.
  • the processing device 200 searches for the sample sequence in the reference sequence through the BWT algorithm according to the index BWT of the reference sequence, the suffix array SA, and the two-dimensional array OCC, and obtains the search result.
  • the search result is used to characterize whether the sample sequence exists in the reference sequence, and the maximum exact match of the sample sequence or the sample sequence starting from the first character is within the range of the two-dimensional array OCC ( This range can be used to determine the position in the reference sequence), and the length value of the sample sequence or the maximum exact match of the sample sequence starting from the first character.
  • the processing device can obtain an information structure for at least one sample sequence according to the search result, and an acceleration library can be constructed based on the above-mentioned information structure.
  • the processing device 200 may also establish a mapping relationship between sequences and storage addresses.
  • the information structure corresponding to a gene sequence occupies K bytes
  • the bytes occupied by the 7 values 0-6 are 7*K
  • the storage address (specifically, the starting address) of ACGT can be 0x00+7*K.
  • the processing device 200 may store the information structure of the sequence at the corresponding storage address according to the mapping relationship.
  • the processing device 200 can determine the storage address corresponding to the subsequence according to the correspondence between the sequence and the storage address, and then access the acceleration library according to the storage address to obtain the subsequence or the subsequence The position in the reference sequence of the largest exact match starting from the one character. Thereby, the sequence search efficiency can be further improved.
  • the acceleration library can be stored in memory and/or external memory for use when looking up sequences.
  • the size of the memory is usually smaller than that of the external memory. Therefore, the scale of the acceleration library stored in the memory is generally smaller than that of the acceleration library stored in the external memory.
  • the embodiment of the present application refers to the acceleration library stored in the memory as the first acceleration library, and the acceleration library stored in the external memory as the second acceleration library.
  • the processing device 200 may search for the subsequence through the first acceleration library to improve search efficiency.
  • the first acceleration library includes at least one first information structure.
  • the presence field of each first information structure is used to represent whether a sample sequence whose length is the first length value exists in the reference sequence.
  • the range field is used to characterize the sample sequence or the maximum exact matching range of the sample sequence starting from the first character of the sample sequence.
  • the length field is used to represent the sample sequence or the maximum exact matching length value of the sample sequence starting from the first character of the sample sequence.
  • the processing device 200 determines a sample sequence matching the subsequence from the first acceleration library, obtains a first information structure of the sample sequence, and determines whether the subsequence exists in the reference sequence according to the existence field in the first information structure . If the value of the existence field is true or 1, it indicates that the subsequence exists in the reference sequence. The processing device 200 determines the range of the subsequence in the reference sequence according to the value of the range field, and determines the length of the subsequence according to the value of the length field. If the value of the existence field is false or 0, it indicates that the subsequence does not exist in the reference sequence.
  • the processing device 200 determines, according to the value of the range field, the maximum exact match of the subsequence starting with the first character of the subsequence in the range of the reference sequence, and according to the value of the length field, determines that the subsequence starts with the first character of the subsequence The length of the maximum exact match.
  • the above method of using the first acceleration library search sequence may also be referred to as a memory breakpoint search method.
  • the length of the subsequence is the first length value, and the first length value can be recorded as len C . Since the length of the subsequence is equal to the length of the sample sequence, len C satisfies the following formula:
  • P represents the size of the memory.
  • m represents the number of possible values included in the value space of each character in the target sequence.
  • m can be 4.
  • w represents the size of the space occupied by each information structure. For example, the existence field occupies 1 byte, the range field occupies 8+8 bytes, and the length field occupies 8 bytes, so the value of w is 25.
  • the first length value can be determined by the size of the memory. Specifically, the processing device 200 substitutes the size of the memory into the above formula (1), and then solves len C .
  • the processing device 200 determines from the target sequence a subsequence s[c:c+ lenC ] starting from c and having a length of lenC, and then in the first Find the s[c:c+len C ] in an acceleration library, if s[c:c+len C ] is in R, return the position of s[c:c+len C ] in R, if s[c:c+len C ] :c+len C ] not in R, returns the largest exact match in s starting at c in R and the position of that exact match in R.
  • the processing device 200 adopts the memory breakpoint search method, if the maximum exact matching length is less than len C , the query only needs to access the memory only once at random, and the query cost is negligible, which greatly improves query efficiency and query performance.
  • the processing device 200 may search the sequence through the second acceleration library to improve search efficiency.
  • the second acceleration library includes at least one second information structure.
  • the presence field of each second information structure is used to represent whether a sample sequence whose length is the second length value exists in the reference sequence.
  • the range field is used to characterize the sample sequence or the maximum exact matching range of the sample sequence starting from the first character of the sample sequence.
  • the length field is used to represent the sample sequence or the maximum exact matching length value of the sample sequence starting from the first character of the sample sequence.
  • the processing device 200 determines a sample sequence matching the subsequence from the second acceleration library, obtains a second information structure of the sample sequence, and determines whether the subsequence exists in the reference sequence according to the existence field in the second information structure . If the value of the existence field is true or 1, it indicates that the subsequence exists in the reference sequence. The processing device 200 determines the range of the subsequence in the reference sequence according to the value of the range field, and determines the length of the subsequence according to the value of the length field. If the value of the existence field is false or 0, it indicates that the subsequence does not exist in the reference sequence.
  • the processing device 200 determines, according to the value of the range field, the maximum exact match of the subsequence starting with the first character of the subsequence in the range of the reference sequence, and according to the value of the length field, determines that the subsequence starts with the first character of the subsequence The length of the maximum exact match.
  • the above-mentioned method of using the second accelerated library search sequence may also be referred to as an external memory breakpoint search method.
  • the second acceleration library when the second acceleration library is stored in the disk of the external memory, it can be called the disk breakpoint search method.
  • the external storage breakpoint search method when the external storage breakpoint search method is adopted, the length of the subsequence is the second length value, and the second length value may be greater than the above-mentioned first length value.
  • the second length value can be denoted as len'. Since the length of the subsequence is equal to the length of the sample sequence, len' satisfies the following formula:
  • Q represents the size of the external memory, such as the size of the disk.
  • m represents the number of possible values included in the value space of each character in the target sequence.
  • w represents the size of the space occupied by each information structure.
  • the size of the second length value may be determined according to the size of an external memory (eg, a disk). Specifically, the processing device 200 substitutes the size of the external memory into the above formula (2), and then solves len'.
  • an external memory eg, a disk
  • the time for the processing device 200 to randomly access the external memory once is ⁇ times as long as the time for the random access to the memory once, that is, the time-consuming ratio of external memory access is ⁇ , and a third length value len E can be set, which satisfies the following formula:
  • processing device 200 may set len' to be greater than len C +len E .
  • len' can be set as:
  • len′ len C + len E + len F (4)
  • len F is the fourth length value
  • the second length value is equal to the sum of the first length value, the third length value and the fourth length value.
  • the processing device 200 may substitute the above formula (4) into the above formula (2), so as to obtain len F by solving.
  • the processing device 200 determines, from the target sequence s, a subsequence s[c:c+ with c as a starting point and a length of len C +len E +len F len C +len E +len F ], then look for that s[c:c+len C +len E +len F ] in the first acceleration library, if s[c:c+len C +len E +len F ] In R, return the position of s[c:c+len C +len E +len F ] in R, if s[c:c+len C +len E +len F ] is not in R, return in s The largest exact match in R starting at c and the position of that exact match in R.
  • the second information structure may further include a comparison field.
  • the comparison field is used to represent whether the length value of the maximum exact match is greater than the preset length threshold.
  • the preset length threshold is determined according to the size of the memory and the time-consuming ratio of external memory memory access.
  • the preset length threshold may be len C +len E in one example.
  • the first information structure may include b 1 , start 1 , end 1 and length 1
  • the second information structure may include b 2 , b 3 , start 2 , end 2 and length 2
  • b 1 and b 2 respectively represent the value of the field in the first information structure and the second information structure
  • b 3 represents the value of the comparison field in the second information structure
  • end 2 respectively represents the value of the range field in the first information structure and the second information structure
  • length 1 and length 2 respectively represent the value of the length field in the first information structure and the second information structure.
  • the processing device 200 may also combine the memory breakpoint search method and the external memory breakpoint search method to exert each The advantages of the breakpoint search method further improve the search efficiency. Further, the processing device 200 can also combine the memory breakpoint search method, the external memory breakpoint search method and the BWT-FM method to search for the sequence to improve search efficiency.
  • the method includes:
  • S402 The processing device 200 acquires the target sequence.
  • the processing device 200 may receive the target sequence sent by the user terminal 300, so as to search for the target sequence in the reference sequence, and determine the position of the target sequence or the maximum exact match of the target sequence in the reference sequence. In some possible implementations, the processing device 200 may also directly receive the target sequence sent by the detection device 100, so as to search for the target sequence in the reference sequence.
  • S404 The processing device 200 determines at least one first subsequence from the target sequence.
  • the first subsequence starts with a character in the target sequence.
  • the length of the first subsequence may be len C .
  • the first subsequence determined by the processing device 200 from the target sequence may be s[c:c+len c ]. It should be noted that when a special character is included in s[c:c+len c ], the processing device 200 may skip the special character, and re-determine the first subsequence s[c: c+len c ].
  • the processing device 200 determines at least one second subsequence from the target sequence.
  • the second subsequence starts with a character in the target sequence.
  • the starting point of the second subsequence may be the same as the starting point of the first subsequence.
  • the length of the second subsequence may be len C + len E + len F .
  • the second subsequence determined by the processing device 200 from the target sequence may be s[c: c +lenc+ lenE + lenF ].
  • the processing device 200 may skip the special character, and re-determine the first Subsequence s[c:c+len c +len E +len F ].
  • S408 The processing device 200 searches the first acceleration library for the first subsequence. When the first subsequence is in the reference sequence, the processing device 200 performs S410. When the first subsequence is not in the reference sequence, the processing device 200 performs S414.
  • S410 The processing device 200 obtains the position of the first subsequence in the reference sequence.
  • the processing device 200 can obtain the first subsequence in the two-dimensional array OCC according to the first information structure of the sample sequence in the first acceleration library that matches the first subsequence, specifically the range field of the first information structure. Scope. The processing device 200 may then determine the position of the first subsequence in the reference sequence based on the range.
  • the processing device 200 uses the BWT-FM method to search for a character whose length is the third length value after the first subsequence.
  • the processing device 200 uses the BWT-FM algorithm to sequentially search for the length of len E after s[c:c+len c ], if the maximum exact match is found between the positions of c+len c and c+len c +len E , then the maximum exact match of the target sequence starting at c (that is, starting from the first character of the subsequence) and the position of the maximum exact match in the reference sequence R are returned.
  • the length of the largest exact match is greater than or equal to len c and less than or equal to len c +len E .
  • S414 The processing device 200 obtains the maximum exact match of the first subsequence starting from the first character, and the position of the maximum exact match in the reference sequence.
  • the processing device 200 obtains the first subsequence according to the first information structure of the sample sequence matching the first subsequence s[c:c+len c ] in the first acceleration library, specifically the range field of the first information structure The maximum exact match of (the maximum exact match of the target sequence starting at c) in the range of the two-dimensional array OCC. The processing device 200 then determines the position of the first subsequence in the reference sequence based on the range.
  • S416 The processing device 200 searches the second acceleration library for the second subsequence. When the second subsequence is in the reference sequence, the processing device 200 performs S418. When the second subsequence is not in the reference sequence, the processing device 200 performs S422.
  • S418 The processing device 200 obtains the position of the second subsequence in the reference sequence.
  • the processing device 200 can obtain the second subsequence in two dimensions according to the second information structure of the sample sequence in the second acceleration library that matches the second subsequence, specifically the range field of the second information structure.
  • the range in the array OCC may then determine the position of the second subsequence in the reference sequence based on the range.
  • S420 The processing device 200 searches for characters following the second subsequence using the BWT-FM method.
  • the processing device 200 uses the BWT-FM algorithm to sequentially search for characters after s[c:c+len C +len E +len F ] until the maximum exact match of the target sequence is found, and returns the maximum exact match and the maximum exact match The position of the match in the reference sequence. where the length of the largest exact match is greater than or equal to len C + len E + len F .
  • S422 The processing device 200 obtains the maximum exact match of the second subsequence starting from the first character and the position of the maximum exact match in the reference sequence.
  • the processing device 200 is based on the second information structure of the sample sequence in the second acceleration library that matches the second subsequence s[c:c+len c +len E +len F ], specifically the range field of the second information structure , obtain the range of the maximum exact match of the second subsequence (the maximum exact match of the target sequence starting at c) in the two-dimensional array OCC. The processing device 200 then determines the position of the second subsequence in the reference sequence based on the range.
  • the processing device 200 may further acquire the value of the comparison field.
  • the value of the comparison field is true or 1
  • the processing device 200 can return the target sequence s with c as the The largest exact match of the origin in the reference sequence R and the position of the largest exact match in the reference sequence R.
  • the processing device 200 can end the current Operation, return the maximum exact match and the position of the maximum exact match in the reference sequence through S412.
  • S404 and S408 may be executed in parallel with S406 and S416, or may be executed sequentially according to the set order.
  • S408 and S416 are executed in parallel, if the processing device 200 first searches the second acceleration library to find the maximum exact match of the second subsequence starting from the one character (S422 is executed first), the processing device 200 stops at The first subsequence is searched in the first acceleration library (S408). Similarly, when the processing device 200 first finds the maximum exact match of the first subsequence starting from the one character in the first acceleration library (S414 is executed first), the processing device 200 stops at the second acceleration library to search for the second subsequence (S416).
  • S410 is completed first in S410, S414, S418, and S422, the processing device 200 continues to execute S412. In the process of executing S412, if the execution of S422 is completed first, the execution of S412 is stopped, and if the execution of S422 is not completed, the execution of S412 and S418 is continued. If S418 is completed first in S410, S414, S418, and S422, the processing device 200 continues to execute S420. In the execution of S410, S414, S418, and S422, the execution of S422 is completed first. If the length of the maximum exact match is greater than len c +len E , the execution of S410, S412, S414, and S418 may be stopped. Through the parallel search of multiple branches, the search result can be obtained in a shorter time. When a branch finds the result first, the branch parallel to the branch can stop the search, which can avoid wasting resources.
  • the length of the first subsequence in the memory breakpoint search method may also be set equal to len E .
  • len E ⁇ len C .
  • the length of the sample sequence in the first acceleration library is also equal to len E , which can ensure that the search time is at least not more than the BWT-FM method, thereby ensuring the sequence search efficiency.
  • the embodiments of the present application combine the memory breakpoint search method, the external memory breakpoint search method and BWT-FM to query the maximum exact match of any length, not limited to the maximum exact match within a limited length. Moreover, the method can realize asynchronous parallel search of multiple branches, which improves search efficiency.
  • this method can greatly improve the search performance. Based on the difference in the maximum exact match length, there are certain differences in the improvement of search performance, as follows:
  • the query time is a fixed value (corresponding to the time for BWT-FM algorithm to query the length of len C ), and the average performance is improved by at least 4 times;
  • sequence search method provided in the embodiment of the present application will be described in detail below in the context of gene sequencing. Referring to the flowchart of the sequence search method shown in FIG. 5, the method includes:
  • S604 Determine whether there are special characters in the base string starting from position c and having a length of len C , if so, jump to S6041, otherwise, jump to S605.
  • the first information structure includes Boolean value b 1 , OCC query range start 1 , end 1 and length length 1 .
  • S606 Determine whether b 1 in the first information structure is True, and if so, jump to S607, otherwise, jump to S6061.
  • the relevant information of the maximum exact match includes the position of the maximum exact match in the reference and the length of the maximum exact match length 1 .
  • the position of the maximum exact match in the reference can be determined according to the range start 1 and end 1 of the OCC at the maximum exact match.
  • the suffix array SA is searched according to the interval [start 1 , end 1 ] to obtain the matching value of each integer in the interval in the SA, and the matching value is the starting position of the maximum exact match in the reference.
  • the position of the maximum exact match in the reference can be determined based on the starting position and the maximum exact match length.
  • S607 Determine whether there are special characters in the base string starting from position c and having a length of Len C + Len E , and if so, skip to S6071; otherwise, skip to S608.
  • S608 Use the BWT-FM method, and use OCC and SA to continue querying until the length reaches Len C +Len E , or determine the maximum exact match before the length reaches Len C +Len E .
  • Len max represents the length of the maximum exact match.
  • S704 Determine whether the base string whose position c is the starting point and whose length is Len C + Len E + Len F contains special characters, and if so, skip to S7041; otherwise, skip to S705.
  • the second information structure includes Boolean values b 2 , b 3 and OCC query ranges start 2 , end 2 and length length 2 .
  • S706 Determine whether b 2 in the second information structure is True; if so, jump to S708, otherwise jump to S707.
  • S707 Determine whether b 3 in the second information structure is True, and if so, jump to S7071, otherwise end the current process.
  • S708 Using the BWT-FM method, continue the query by using OCC and SA to determine the maximum exact match. Then jump to S7081.
  • the relevant information of the maximum exact match between the original starting position and the updated starting position can also be determined. For example, in S6071, c to c+Len C have been matched, and BWT-FM can continue to search for the maximum exact match until the updated starting position is found.
  • the above sequence search method provided in the embodiment of the present application may be provided to the user in the form of a cloud service.
  • the cloud service provider can run the code corresponding to the sequence search method in the cloud environment, so as to provide the sequence search service in the form of a cloud service.
  • the cloud server provided by the cloud service provider can present a sequence search interface to the user, such as a graphical user interface (graphical user interface, GUI) for sequence search, and then the cloud server receives the target to be searched entered by the user through the GUI sequence.
  • the cloud server in the background may determine at least one subsequence from the target sequence, where the subsequence starts with a character in the target sequence. Then, the cloud server searches the subsequence in the acceleration library, and obtains the subsequence or the position in the reference sequence of the subsequence or the maximum exact match of the subsequence starting from the one character.
  • the above sequence search method can be implemented by code, and the code can be packaged as a software package.
  • a terminal computing device such as a terminal for short
  • a terminal such as a desktop computer, a notebook, and a smart phone, or a server
  • the terminal or server can execute the above-mentioned sequence finding method.
  • a hardware vendor may also release (eg open source) an acceleration package for the hardware.
  • the acceleration package is specifically used to speed up the process of finding a target sequence in a reference sequence.
  • the CPU or GPU may receive the user's selection information, where the selection information indicates whether to enable the acceleration package, and if so, executes the sequence search method shown in the embodiment of the present application to improve search efficiency.
  • the apparatus 600 includes:
  • a determination module 602 configured to determine at least one subsequence from the target sequence, and the subsequence takes a character in the target sequence as a starting point;
  • the search module 604 is configured to search for the subsequence in the acceleration library, and obtain the subsequence or the position in the reference sequence of the maximum exact match of the subsequence starting from the one character, and the acceleration library is used for The search for a sequence of set length values is accelerated, and the length of the subsequence is the set length value.
  • the acceleration library includes at least one information structure, and the information structure is used to indicate a sample sequence or a range of maximum exact matching of the sample sequence starting from the first character.
  • the information structure includes at least one of a presence field and a length field, and a range field, where the presence field is used to represent whether a sample sequence exists in the reference sequence, and the range The field is used to characterize the sample sequence or the range of the maximum exact match of the sample sequence starting from the first character, and the length field is used to characterize the length of the sample sequence or the maximum exact match of the sample sequence.
  • the search module 604 is specifically used for:
  • the acceleration library is accessed according to the storage address to obtain the subsequence or the position in the reference sequence of the maximum exact match of the subsequence starting from the one character.
  • the acceleration library includes a first acceleration library located in a memory, and the set length value is a first length value.
  • the first length value is determined according to the size of the memory.
  • the acceleration library includes a second acceleration library located in an external memory, and the set length value is a second length value.
  • the second length value is determined according to the size of the external memory.
  • the second information structure further includes a comparison field, where the comparison field is used to represent whether the length value of the maximum exact match is greater than a preset length threshold, and the preset length threshold is based on the size of the memory And the time-consuming ratio of external memory memory access is determined.
  • the determining module 602 is specifically configured to:
  • At least one first subsequence and at least one second subsequence are determined from the target sequence, the at least one first subsequence and the at least one second subsequence start with a character in the target sequence, the The second subsequence is longer than the first subsequence;
  • the acceleration library includes a first acceleration library located in a memory and a second acceleration library located in an external memory;
  • the search module 604 is specifically used for:
  • the first subsequence is looked up in the first accelerated library, and the second subsequence is looked up in the second accelerated library.
  • the search module 604 is specifically used for:
  • the first length value is determined according to the size of the memory, or is determined according to the time-consuming ratio of external memory memory access.
  • the apparatus 600 further includes:
  • the building blocks are specifically used to:
  • the suffix array SA and the two-dimensional array OCC According to the index BWT of the reference sequence, the suffix array SA and the two-dimensional array OCC, the sample sequence is searched in the reference sequence by the BWT algorithm, and the search result is obtained, and the search result is used to characterize whether the sample sequence exists in the reference sequence.
  • the maximum exact match of the sample sequence or the sample sequence starting from the first character is within the range of the two-dimensional array, and the sample sequence or the sample sequence starting with the first character The maximum exact match length value for the starting point.
  • the sequence is a gene sequence.
  • the sequence search apparatus 600 may correspond to executing the methods described in the embodiments of the present application, and the above-mentioned and other operations and/or functions of the respective modules/units of the sequence search apparatus 600 are for realizing FIG. 3 and FIG. 4 , respectively. .
  • the corresponding flow of each method in the embodiment shown in FIG. 5 is not repeated here for brevity.
  • the embodiment of the present application further provides a processing device 200 for implementing the function of the sequence search apparatus 600 in the embodiment shown in FIG. 6 .
  • the specific implementation of the processing device 200 may refer to the description of the related content in FIG. 2 , which will not be repeated here.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium includes instructions, the instructions instruct a computer to execute the above sequence search method applied to the sequence search apparatus 600 .
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium includes instructions, the instructions instruct a computer to execute the above sequence search method applied to the sequence search apparatus 600 .
  • An embodiment of the present application further provides a computer program product, and when the computer program product is executed by a computer, the computer executes any one of the foregoing sequence search methods.
  • the computer program product can be a software installation package, which can be downloaded and executed on a computer if any one of the aforementioned sequence finding methods needs to be used.

Abstract

一种序列查找方法,包括:从目标序列中确定至少一个长度为预设长度值的子序列,该子序列以目标序列中的一个字符为起点,然后在用于加速查找设定长度值的序列的加速库中查找子序列,得到子序列或者子序列以一个字符为起点的最大精确匹配在参考序列中的位置。该方法将待查找的目标序列分段,然后基于预先构建的加速库加速查找分段所得的子序列,避免逐个查找目标序列的字符。

Description

序列查找方法、装置、设备及介质
本申请要求于2020年08月24日提交中国专利局、申请号为202010856456.3、发明名称为“序列查找方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算技术领域,尤其涉及一种序列查找方法、装置、设备以及计算机可读存储介质。
背景技术
序列是指具有顺序关系的多个字符形成的字符串。基于组成序列的字符的类型差异,序列可以分为数字序列、字母序列、汉字序列以及由多种类型字符组成的混合序列。在一些示例中,数字序列可以包括电话号码、银行卡号等等,字母序列可以包括基因序列(通常包括字母A、C、G、T,用于表征不同类型碱基)等等。
在许多场景中,需要在一个给定的参考序列中查找是否存在目标序列。以基因测序为例,通常需要在参考基因组(为了便于描述,可以称之为reference)中查找对若干样本进行检测得到的目标序列(为了便于描述,可以称之为read),得到该目标序列或者目标序列的最大精确匹配在参考基因组中的位置。
目前,业界主要采用BW变换以及全文索引(burrows wheeler transform-full text index in minute space,BWT-FM)算法进行查找。具体地,在数据准备阶段,参考基因组经过BW变换输出索引BWT(排序后的循环字符串的末尾字符组成的字符串),以及后缀数组(suffix array,SA)。其中,根据索引BWT还可以确定二维数组(occurrence,OCC)。在查询阶段,通过访问OCC可以查找目标序列。
然而,上述方法查找效率较低,查找性能下降。业界亟需提供一种高效的序列查找方法。
发明内容
本申请提供了一种序列查找方法,该方法将待查找的目标序列分段,然后基于预先构建的加速库加速查找分段所得的子序列,避免逐个查找目标序列的字符,提高了查找效率。本申请还提供了上述方法对应的装置、设备、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供了一种序列查找方法。该方法可以由任意具有数据处理能力的处理设备执行。处理设备可以从目标序列中确定至少一个长度为设定长度值的子序列,该子序列以所述目标序列中的一个字符为起点,然后处理设备在专用于加速查找设定长度值的序列的加速库中查找所述子序列,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置。
其中,该方法将目标序列按照设定长度值分段,基于预先构建的加速库加速查找分段所得的子序列,避免逐个查找目标序列的字符,提高了查找效率,从而提升了查找性能。
在一些可能的实现方式中,所述加速库包括至少一个信息结构体,所述信息结构体用于指示样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围。如此,处理设备可以根据加速库中的信息结构体指示的信息,直接获得子序列或子序列以第一个字符为起点的最大精确匹配的范围,提高了查找效率。
在一些可能的实现方式中,所述信息结构体包括存在字段和长度字段中的至少一个字段以及范围字段。其中,所述存在字段用于表征一个样本序列是否存在于所述参考序列中,所述范围字段用于表征所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围,所述长度字段用于表征所述样本序列或者所述样本序列的最大精确匹配的长度。
在一些实施例中,信息结构体可以包括存在字段和范围字段。在另一些实施例中,信息结构体可以包括范围字段和长度字段。当然,信息结构体也可以包括存在字段、范围字段和长度字段。
如此,处理设备可以根据信息结构体中的存在字段和长度字段中的至少一个以及范围字段获得子序列或子序列以第一个字符为起点的最大精确匹配的范围,如此无需逐个比对字符,提高了查找效率。
在一些可能的实现方式中,加速库中样本序列的信息结构体可以根据序列与存储地址的映射关系存储在相应的存储地址中。如此,在查找子序列时,处理设备可以根据序列与存储地址的映射关系确定所述子序列对应的存储地址,然后处理设备可以根据所述存储地址访问所述加速库,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置。
如此,在查找目标序列时,针对子序列这一部分,处理设备仅通过访问一次内存即可获得查找结果,减少了访问内存的次数,提高了查找效率,提升了查找性能。
在一些可能的实现方式中,所述加速库包括位于内存的第一加速库,所述设定长度值为第一长度值。其中,内存也被称为内存储器,其作用是用于暂时存放处理器中的运算数据,以及与磁盘等外存(也称作外部存储器)交换数据。
其中,第一加速库位于内存,处理设备无需将第一加速库加载至内存,节省了加载第一加速库的时间,提高了查找效率。
在一些可能的实现方式中,所述第一长度值根据所述内存的大小确定。第一加速库位于内存,因此,第一加速库中样本序列的信息结构体占用的存储空间应当不大于内存的存储空间。即第一长度值应当满足如下公式:
Figure PCTCN2021095825-appb-000001
其中,P表示内存的大小。m表示目标序列中每个字符的取值空间包括的、可能的取值的数量,例如在基因测序场景中,m可以为4。w表示每个信息结构体占用的空间大小,例如存在字段占用1个字节,范围字段占用8+8个字节,长度字段占用8个字节,则w取值为25。
如此,可以避免内存被用尽,导致序列查找受到影响,保障查找性能。
在一些可能的实现方式中,所述加速库包括位于外存的第二加速库,所述设定长度值为第二长度值。其中,外存是指存储设备中除内存以外的设备。在一些实施例中,外存包括磁盘、固态驱动器(solid state drive,SSD)、闪存存储器等中的任意一种或多种。
由于外存的存储空间一般大于内存的存储空间,因此,在第二加速库中可以查找长度更长的子序列,如此可以提高效率,提升查找性能。
在一些可能的实现方式中,所述第二长度值根据所述外存的大小确定。第二长度值可以满足如下公式:
m len′*w≤Q
其中,Q表示外存的大小,例如磁盘的大小。m表示目标序列中每个字符的取值空间包括的、可能的取值的数量。w表示每个信息结构体占用的空间大小。
处理设备随机访问外存一次的时间是随机访问内存一次的时间的δ倍,即外存内存访问耗时比为δ,可以设置第三长度值len E,其满足如下公式:
Figure PCTCN2021095825-appb-000002
当子序列长度为len E时,访问内存查找子序列和访问外存查找子序列所耗费的时间相当。而针对长度为len C的子序列,访问内存的时间可以忽略。因此,子序列长度为len C+len E时,访问内存查找子序列和访问外存查找子序列所耗费的时间相当。为此,处理设备可以设置len′大于len C+len E。具体地,len′可以设置为:
len′=len C+len E+len F
其中,len F为第四长度值,第二长度值等于第一长度值、第三长度值以及第四长度值之和。具体地,处理设备可以将上述公式迭代至第二长度值应当满足的公式,从而求解得到len F
当最大精确匹配长度小于len C+len E+len F时,可以大幅缩短查询时间,提高查询效率,提升查询性能。
在一些可能的实现方式中,所述第二信息结构体还包括比较字段。所述比较字段用于表征最大精确匹配的长度值是否大于预设长度阈值。所述预设长度阈值根据内存的大小以及外存内存访问耗时比确定。例如预设长度阈值可以为len c+len E。如此,处理设备可以根据比较字段快速获得最大精确匹配的长度与预设长度阈值的大小,该比较结果可以为后续查找过程提供帮助。
在一些可能的实现方式中,处理设备可以结合内存断点查找法和外存断点查找法进行序列查找,如此可以综合内存断点查找法和外存断点查找法的优势,进一步提升查找效率。
具体地,处理设备可以从目标序列中确定至少一个第一子序列和至少一个第二子序列,至少一个第一子序列和至少一个第二子序列以目标序列中的一个字符为起点,该第二子序列长于所述第一子序列,处理设备可以在位于内存的第一加速库中查找所述第一子序列,以及在位于内存的第二加速库中查找所述第二子序列。
该方法将内存断点查找法、外存断点查找法结合,可以查询任意长度的最大精确匹配,不局限于有限长度内的最大精确匹配。而且,该方法可以实现多个分支异步并行查找,提高了查找效率。
在一些可能的实现方式中,当在所述第二加速库中查找到所述第二子序列以所述一个字符为起点的最大精确匹配时,停止在所述第一加速库中查找所述第一子序列,当在所述第一加速库中查找到所述第一子序列以所述一个字符为起点的最大精确匹配时,停止在所述第二加速库中查找所述第二子序列。
在该实现方式中,当一个分支先查找到结果,与该分支并行的分支可以停止查找,如此可以避免资源浪费。
在一些可能的实现方式中,所述第一长度值根据所述内存的大小确定,或者根据外存内存访问耗时比确定。在一些实施例中,第一长度值可以为len c。在另一些实施例中,第一长度值可以为len E。如此,即使最大精确匹配的长度较小,也可以通过内存断点查找这一分支,快速查找子序列,提升查找效率。
在一些可能的实现方式中,处理设备可以在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述参考序列中的位置,然后根据所述查找结果构建所述加速库。如此,可以为后续序列查找提供帮助,提高查找效率。
在一些可能的实现方式中,处理设备可以根据所述参考序列的索引BWT、后缀数组SA和二维数组OCC,通过BWT算法在所述参考序列中查找样本序列,获得查找结果。该查找结果用于表征所述样本序列是否存在于所述参考序列中,以及所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述二维数组的范围,以及所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的长度值。
其中,处理设备可以通过上述方式加速查找样本序列,加快加速库构建过程,提高加速库构建的效率。
在一些可能的实现方式中,所述序列为基因序列。由此,可以在基因测序场景中快速定位基因序列在基因组的位置,提高基因序列查找效率。
第二方面,本申请提供了一种序列查找装置。所述装置包括:
确定模块,用于从目标序列中确定至少一个子序列,所述子序列以所述目标序列中的一个字符为起点;
查找模块,用于在加速库中查找所述子序列,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置,所述加速库用于加速查找设定长度值的序列,所述子序列的长度为所述设定长度值。
在一些可能的实现方式中,所述加速库包括至少一个信息结构体,所述信息结构体用于指示样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围。
在一些可能的实现方式中,所述信息结构体包括存在字段和长度字段中的至少一个字段以及范围字段,所述存在字段用于表征一个样本序列是否存在于所述参考序列中,所述范围字段用于表征所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围,所述长度字段用于表征所述样本序列或者所述样本序列的最大精确匹配的长度。
在一些可能的实现方式中,所述查找模块具体用于:
根据序列与存储地址的映射关系确定所述子序列对应的存储地址;
根据所述存储地址访问所述加速库,得到所述子序列或者所述子序列以所述一个字符 为起点的最大精确匹配在参考序列中的位置。
在一些可能的实现方式中,所述加速库包括位于内存的第一加速库,所述设定长度值为第一长度值。
在一些可能的实现方式中,所述第一长度值根据所述内存的大小确定。
在一些可能的实现方式中,所述加速库包括位于外存的第二加速库,所述设定长度值为第二长度值。
在一些可能的实现方式中,所述第二长度值根据所述外存的大小确定。
在一些可能的实现方式中,所述第二信息结构体还包括比较字段,所述比较字段用于表征最大精确匹配的长度值是否大于预设长度阈值,所述预设长度阈值根据内存的大小以及外存内存访问耗时比确定。
在一些可能的实现方式中,所述确定模块具体用于:
从目标序列中确定至少一个第一子序列和至少一个第二子序列,所述至少一个第一子序列和所述至少一个第二子序列以所述目标序列中的一个字符为起点,所述第二子序列长于所述第一子序列;
所述加速库包括位于内存的第一加速库和位于外存的第二加速库;
所述查找模块具体用于:
在所述第一加速库中查找所述第一子序列,以及在所述第二加速库中查找所述第二子序列。
在一些可能的实现方式中,所述查找模块具体用于:
当在所述第二加速库中查找到所述第二子序列以所述一个字符为起点的最大精确匹配时,停止在所述第一加速库中查找所述第一子序列,当在所述第一加速库中查找到所述第一子序列以所述一个字符为起点的最大精确匹配时,停止在所述第二加速库中查找所述第二子序列。
在一些可能的实现方式中,所述第一长度值根据所述内存的大小确定,或者根据外存内存访问耗时比确定。
在一些可能的实现方式中,所述装置还包括:
构建模块,用于在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述参考序列中的位置,根据所述查找结果构建所述加速库。
在一些可能的实现方式中,所述构建模块具体用于:
根据所述参考序列的索引BWT、后缀数组SA和二维数组OCC,通过BWT算法在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列是否存在于所述参考序列中,以及所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述二维数组的范围,以及所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的长度值。
在一些可能的实现方式中,所述序列为基因序列。
第三方面,本申请提供一种计算设备,所述计算设备包括处理器和存储器。所述处理器、所述存储器进行相互的通信。所述处理器用于执行所述存储器中存储的指令,以使得 计算设备执行如第一方面或第一方面的任一种实现方式中的序列查找方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,所述指令指示计算设备执行上述第一方面或第一方面的任一种实现方式所述的序列查找方法。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述第一方面或第一方面的任一种实现方式所述的序列查找方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的一种序列查找方法的场景架构图;
图2为本申请实施例提供的一种处理设备的结构示意图;
图3为本申请实施例提供的一种序列查找方法的流程图;
图4为本申请实施例提供的一种序列查找方法的流程图;
图5为本申请实施例提供的一种序列查找方法的流程图;
图6为本申请实施例提供的一种序列查找装置的结构示意图。
具体实施方式
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
首先对本申请实施例中所涉及到的一些技术术语进行介绍。
序列是指具有顺序关系的多个字符形成的字符串。基于组成序列的字符的类型差异,序列可以分为数字序列、字母序列、汉字序列以及由多种类型字符组成的混合序列。其中,数字序列可以包括电话号码、银行卡号等等,例如一个数字序列可以为132xxxx2323。字母序列可以包括基因序列,例如GGGCCAACTACC。其中,基因序列中的字母A、C、G、T,用于表征不同类型碱基。
序列查找是指在一个长序列中查找一个短序列。其中,该长序列也可以称之为参考序列,短序列可以称之为目标序列。序列查找即是在参考序列中查找目标序列,若目标序列存在于参考序列中,返回目标序列在参考序列中的位置,若目标序列不存在于参考序列中,返回该目标序列在参考序列中的最大精确匹配,具体是以目标序列以指定位置(指定字符)为起点的最大精确匹配。
其中,参考序列可以是一个长字符串R,目标序列可以是一个短字符串s,以短字符串s中位置c的字符为起点,在R中精确匹配成功的所有s子串中最长的子串称为s以c为起点的最大精确匹配。为了便于理解,下面结合具体示例对最大精确匹配进行说明。在该示例中,长字符串R为“addsdfyihadsdk”,短字符串s为“dsdfyask”,以短字符串s中位 置1的字符d为起点,在R中精确匹配成功的所有子串包括“dsdfy”和“dfy”。其中,最长的子串为“dsdfy”,“dsdfy”即为s以字符d为起点的最大精确匹配。
目前,业界主要采用BWT-FM算法进行序列查找。具体地,参考序列经过BW变换后可以输出索引BWT以及后缀数字SA。其中,根据索引BWT还可以生成二维数组OCC。在查找目标序列时通常需要多次访问内存(具体是内存中的二维数据OCC),如此导致查找效率较低,查找性能下降。
有鉴于此,本申请实施例提供了一种高效的序列查找方法。该方法可以由具有数据处理能力的处理设备执行。该处理设备可以是服务器或者终端,其中,终端包括但不限于台式机、笔记本电脑、平板电脑和智能手机。在一些可能的实现方式中,该处理设备还可以是集群。
具体地,处理设备可以从目标序列中确定至少一个子序列,该子序列具体以目标序列中的一个字符为起点。然后,处理设备在预先构建的加速库中查找子序列。其中,加速库用于加速查找设定长度值的序列。子序列的长度为设定长度值,如此,处理设备可以无需逐个比较子序列包括的字符,直接根据加速库查找子序列,得到子序列或者子序列以一个字符为起点的最大精确匹配在参考序列中的位置。
该方法通过将待查找的目标序列分段,然后基于预先构建的加速库加速查找分段所得的子序列,避免逐个查找目标序列的字符,提高了查找效率。进一步地,该方法还可以根据序列与存储地址的映射关系,直接访问子序列对应的存储地址,获得子序列或者子序列以一个字符为起点的最大精确匹配在参考序列中的位置,减少了处理设备访问内存的次数,尤其是最大精确匹配的长度小于子序列的长度时,仅需随机访问一次内存,如此可以提高查找效率,降低查找成本,提升查找性能。
其中,加速库可以包括至少一个信息结构体。信息结构体用于指示样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围。在一些实现方式中,信息结构体包括存在字段和范围字段。存在字段用于表征与子序列等长的一个样本序列是否存在于参考序列中,范围字段用于表征(子序列存在于参考序列中时)子序列的范围,或者(子序列不存在于参考序列中时)子序列以上述一个字符为起点的最大精确匹配的范围。在另一些实现方式中,信息结构体包括范围字段和长度字段。长度字段用于表征子序列的长度,或者子序列以一个字符为起点的最大精确匹配的长度。进一步地,信息结构体包括上述存在字段、范围字段和长度字段。如此,处理设备可以得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置。
如此,处理设备可以基于该加速库直接确定子序列是否存在于参考序列,若是,则返回子序列在参考序列中的位置以及子序列的长度,若否,则返回子序列以该子序列的第一个字符为起点的最大精确匹配的长度。
为了便于理解本申请的技术方案,下面结合一基因测序场景对序列查找方法进行介绍。
参见图1所示的序列查找方法的应用场景示意图,该场景中包括检测设备100、处理设备200和用户终端300。检测设备100和用户终端300建立有通信连接,处理设备200和用户终端300建立有通信连接。其中,图1以处理设备200为服务器进行示例说明,在其他实现方式中,处理设备200可以为终端或者集群等设备。
具体地,检测设备100用于对血液、唾液等生物组织进行检测,得到目标序列。检测设备100可以向用户终端300发送目标序列,用户终端300可以将该目标序列提交至处理设备200。处理设备200接收到目标序列时,从目标序列中确定至少一个子序列,该子序列以目标序列的一个字符为起点,然后在加速库中查找该子序列,得到子序列或者子序列以一个字符为起点的最大精确匹配在参考基因组(参考序列)中的位置。
如此,在查找目标序列时,针对子序列这一部分,处理设备200仅通过访问一次内存即可获得查找结果,减少了访问内存的次数,提高了查找效率,提升了查找性能。
以上对序列查找方法的系统架构进行介绍。接下来,将从硬件实体化角度对系统中的处理设备200进行介绍。
图2示出了处理设备200的结构示意图。应理解,图2仅仅示出了上述处理设备200中的部分硬件结构和部分软件模块,具体实现时,处理设备200还可以包括更多的硬件结构,如指示灯、蜂鸣器等等,以及更多的软件模块,如各种应用程序等。
如图2所示,处理设备200包括总线201、处理器202、通信接口203和存储器204。处理器202、存储器204和通信接口203之间通过总线201通信。
总线201可以是外设部件互连标准(peripheral component interconnect,PCI)总线、快捷外设部件互连标准(peripheral component interconnect express,PCIe)或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图2中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
处理器202可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
通信接口203用于与外部通信,例如接收用户终端300发送的目标序列,向用户终端300返回子序列在参考序列中的位置或者子序列以一个字符为起点的最大精确匹配在参考序列中的位置等等。
存储器204可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器204还可以包括非易失性存储器(non-volatile memory),也例如只读存储器(read-only memory,ROM),快闪存储器,硬盘驱动器(hard disk drive,HDD)或固态硬盘驱动器(solid state drive,SSD)。其中,RAM、ROM称为内存,HDD、SSD称为外存。
存储器204中存储有程序或指令,例如实现本申请实施例提供的序列查找方法所需的程序或指令。处理器202执行该程序或指令以执行前述序列查找方法。
为了使得本申请的技术方案更加清楚、易于理解,下面结合附图对本申请实施例提供的序列查找方法进行详细介绍。
参见图3所示的序列查找方法的流程图,该方法包括:
S302:处理设备200从目标序列中确定至少一个子序列。
所述子序列以所述目标序列中的一个字符为起点。例如,处理设备200可以目标序列中一组间隔预设长度值的字符为起点,确定一组子序列。这一组子序列包括至少一个子序列。当这一组子序列包括多个子序列时,多个子序列的长度相等。
考虑到目标序列中可能存在特殊字符的情况,该特殊字符是指组成序列的正常字符之外的字符,例如基因测序场景中,检测设备100无法确定一个碱基的类型时,可以将该碱基标记为N(具体是A、C、G、T之外的字符)。处理设备200可以先确定目标序列中的一个字符为起点,然后以与该起点间隔预设长度值的字符为终点,接着确定起点和终点之间是否包括特殊字符。
当起点和终点之间包括特殊字符时,则将起点更新为该特殊字符之后的一个字符,重新执行上述步骤,即重新确定终点,以及确定起点和终点之间是否包括特殊字符,直至起点和终点之间不包括特殊字符。如此处理设备200可以根据该起点和终点之间的字符确定一个子序列。进一步地,处理设备200可以更新起点,再次执行上述步骤,确定下一个子序列。
S304:处理设备200在加速库中查找所述子序列,得到子序列或者子序列以上述一个字符为起点的最大精确匹配在参考序列中的位置。
其中,加速库用于加速查找设定长度值的序列。子序列的长度为所述设定长度值,如此,处理设备200可以直接根据加速库查找子序列,得到子序列或者子序列以该子序列的第一个字符为起点的最大精确匹配在参考序列中的位置。
在一些可能的实现方式中,所述加速库包括至少一个信息结构体,所述信息结构体包括存在字段、范围字段和长度字段。其中,存在字段用于标识一个与子序列等长的样本序列是否在参考序列中,范围字段用于标识样本序列或样本序列以该样本序列的第一个字符为起点的最大精确匹配的范围,长度字段用于标识样本序列或样本序列以该样本序列的第一个字符为起点的最大精确匹配的长度。
存在字段的值可以为布尔(bool)值。该布尔值可以为true或者false。在一些实施例中,字段值也可以为数值1、0,用于表征true或者false。范围标识具体可以包括起始标识和结束标识,起始标识和结束标识可以通过start和end表征。长度字段可以通过length表征。
样本序列可以对序列中的每一个字符在取值空间中随机采样生成。以基因测序场景为例,每一个字符的取值空间为{A、C、G、T}。假设样本序列长度为4,则样本序列可以包括AACT、GATT、CAGG等等。
处理设备200可以针对至少一个样本序列,在参考序列中查找该样本序列,获得查找结果。该查找结果用于表征所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配的位置。然后处理设备200可以根据上述查找结果构建加速库,以便后续查找子序列时,能够直接返回子序列或子序列以第一个字符为起点的最大精确匹配在参考序列中的位置。
在一些可能的实现方式中,处理设备200可以通过哈希(hash)查找法或者BWT-FM法,在参考序列中查找样本序列,获得查找结果。为了便于理解,下面以基因测序场景中通过BWT-FM法查找样本序列进行示例说明。
具体地,处理设备200根据参考序列的索引BWT、后缀数组SA和二维数组OCC,通过BWT算法在所述参考序列中查找样本序列,获得查找结果。该查找结果用于表征所述样本序列是否存在于所述参考序列中,以及所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述二维数组OCC的范围(该范围可以用于确定在参考序列中的位置),以及所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的长度值。如此,处理设备可以根据查找结果获得针对至少一个样本序列的信息结构体,基于上述信息结构体可以构建加速库。
在一些可能的实现方式中,处理设备200还可以建立序列与存储地址的映射关系。例如,基因序列可以按照四进制进行计数,其中,碱基ACGT分别表征0、1、2、3,则AACT代表四进制数值0013,该四进制数值可以被换算为0+0+1*4+3=7。假设一个基因序列对应的信息结构体占用K个字节,则0-6这7个数值占用字节为7*K,ACGT的存储地址(具体是起始地址)可以为0x00+7*K。处理设备200可以根据该映射关系将序列的信息结构体存储在对应存储地址。
如此,处理设备200在查找子序列时,可以根据序列与存储地址的对应关系确定子序列对应的存储地址,然后根据所述存储地址访问所述加速库,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置。由此可以进一步提高序列查找效率。
进一步地,加速库可以存储在内存和/或外存中,以便查找序列时使用。其中,内存的大小通常小于外存,因此,存储在内存中的加速库的规模一般小于存储在外存中的加速库。为了便于描述,本申请实施例将存储在内存中的加速库称之为第一加速库,存储在外存中的加速库称之为第二加速库。
在一些可能的实现方式中,处理设备200可以通过第一加速库查找子序列,以提高查找效率。具体地,第一加速库包括至少一个第一信息结构体。每个第一信息结构体的存在字段用于表征一个长度为所述第一长度值的样本序列是否存在于参考序列中。范围字段用于表征所述样本序列或者所述样本序列以所述样本序列的第一个字符为起点的最大精确匹配的范围。长度字段用于表征所述样本序列或者所述样本序列以所述样本序列的第一个字符为起点的最大精确匹配的长度值。
处理设备200从第一加速库中确定与子序列匹配的样本序列,获得该样本序列的第一信息结构体,根据该第一信息结构体中的存在字段确定该子序列是否存在于参考序列中。若存在字段的值为true或者1,则表明该子序列存在于参考序列中。处理设备200根据范围字段的值确定子序列在参考序列的范围,以及根据长度字段的值确定子序列的长度。若存在字段的值为false或者0,则表明该子序列不存在于参考序列中。处理设备200根据范围字段的值确定子序列以该子序列的第一个字符为起点的最大精确匹配在参考序列的范围,以及根据长度字段的值确定该子序列以其第一个字符为起点的最大精确匹配的长度。
上述采用第一加速库查找序列的方法也可以称之为内存断点查找法。采用内存断点查找法时,子序列的长度为第一长度值,该第一长度值可以记作len C,由于子序列的长度与样本序列的长度相等,因此,len C满足如下公式:
Figure PCTCN2021095825-appb-000003
其中,P表示内存的大小。m表示目标序列中每个字符的取值空间包括的、可能的取值的数量,例如在基因测序场景中,m可以为4。w表示每个信息结构体占用的空间大小,例如存在字段占用1个字节,范围字段占用8+8个字节,长度字段占用8个字节,则w取值为25。
基于此,第一长度值可以通过内存的大小确定。具体地,处理设备200将内存的大小代入上述公式(1),然后求解len C
在内存断点查找法中,针对参考序列R和目标序列s,处理设备200从目标序列中确定以c为起点,长度为len C的子序列s[c:c+len C],然后在第一加速库中查找该s[c:c+len C],如果s[c:c+len C]在R中,返回s[c:c+len C]在R中的位置,如果s[c:c+len C]不在R中,返回s中以c为起点在R中的最大精确匹配以及该精确匹配在R中的位置。
当处理设备200采用内存断点查找法时,若最大精确匹配长度小于len C,则查询仅需随机访问内存仅1次,查询成本可以忽略不计,极大地提高了查询效率,提升了查询性能。
在另一些可能的实现方式中,处理设备200可以通过第二加速库查找序列,以提高查找效率。具体地,第二加速库包括至少一个第二信息结构体。每个第二信息结构体的存在字段用于表征一个长度为所述第二长度值的样本序列是否存在于参考序列中。范围字段用于表征所述样本序列或者所述样本序列以所述样本序列的第一个字符为起点的最大精确匹配的范围。长度字段用于表征所述样本序列或者所述样本序列以所述样本序列的第一个字符为起点的最大精确匹配的长度值。
处理设备200从第二加速库中确定与子序列匹配的样本序列,获得该样本序列的第二信息结构体,根据该第二信息结构体中的存在字段确定该子序列是否存在于参考序列中。若存在字段的值为true或者1,则表明该子序列存在于参考序列中。处理设备200根据范围字段的值确定子序列在参考序列的范围,以及根据长度字段的值确定子序列的长度。若存在字段的值为false或者0,则表明该子序列不存在于参考序列中。处理设备200根据范围字段的值确定子序列以该子序列的第一个字符为起点的最大精确匹配在参考序列的范围,以及根据长度字段的值确定该子序列以其第一个字符为起点的最大精确匹配的长度。
上述采用第二加速库查找序列的方法也可以称之为外存断点查找法。其中,第二加速库存储在外存的磁盘中时,则可以称之为磁盘断点查找法。采用外存断点查找法时,子序列的长度为第二长度值,该第二长度值可以大于上述第一长度值。为了便于描述,该第二长度值可以记作len′,由于子序列的长度与样本序列的长度相等,因此,len′满足如下公式:
m len′*w≤Q            (2)
其中,Q表示外存的大小,例如磁盘的大小。m表示目标序列中每个字符的取值空间包括的、可能的取值的数量。w表示每个信息结构体占用的空间大小。
基于此,第二长度值的大小可以根据外存(例如磁盘)的大小而确定。具体地,处理设备200将外存的大小代入上述公式(2),然后求解len′。
其中,处理设备200随机访问外存一次的时间是随机访问内存一次的时间的δ倍,即外存内存访问耗时比为δ,可以设置第三长度值len E,其满足如下公式:
Figure PCTCN2021095825-appb-000004
当子序列长度为len E时,访问内存查找子序列和访问外存查找子序列所耗费的时间相当。而针对长度为len C的子序列,访问内存的时间可以忽略。因此,子序列长度为len C+len E时,访问内存查找子序列和访问外存查找子序列所耗费的时间相当。为此,处理设备200可以设置len′大于len C+len E。具体地,len′可以设置为:
len′=len C+len E+len F          (4)
其中,len F为第四长度值,第二长度值等于第一长度值、第三长度值以及第四长度值之和。具体地,处理设备200可以将上述公式(4)代入上述公式(2),从而求解得到len F
在外存断点查找法中,针对参考序列R和目标序列s,处理设备200从目标序列s中确定以c为起点,长度为len C+len E+len F的子序列s[c:c+len C+len E+len F],然后在第一加速库中查找该s[c:c+len C+len E+len F],如果s[c:c+len C+len E+len F]在R中,返回s[c:c+len C+len E+len F]在R中的位置,如果s[c:c+len C+len E+len F]不在R中,返回s中以c为起点在R中的最大精确匹配以及该精确匹配在R中的位置。
需要说明的是,第二信息结构体还可以包括比较字段。所述比较字段用于表征最大精确匹配的长度值是否大于预设长度阈值。其中,预设长度阈值根据内存的大小以及外存内存访问耗时比确定。在一个示例中预设长度阈值可以为len C+len E
基于此,第一信息结构体可以包括b 1、start 1、end 1和length 1,第二信息结构体可以包括b 2、b 3、start 2、end 2和length 2。其中,b 1、b 2分别表示第一信息结构体、第二信息结构体中存在字段的值,b 3表示第二信息结构体中比较字段的值,start 1、end 1以及start 2、end 2分别表示第一信息结构体、第二信息结构体中范围字段的值,length 1、length 2分别表示第一信息结构体、第二信息结构体中长度字段的值。
以上对内存断点查找法和外存断点查找法进行了详细说明,在一些可能的实现方式中,处理设备200还可以将内存断点查找法和外存断点查找法组合,以发挥各个断点查找法的优势,进一步提高查找效率。进一步地,处理设备200还可以将内存断点查找法、外存断点查找法与BWT-FM法结合,用于查找序列,提高查找效率。
参见图4所示的序列查找方法的流程图,该方法包括:
S402:处理设备200获取目标序列。
具体地,处理设备200可以接收用户终端300发送的目标序列,以便在参考序列中查找该目标序列,确定目标序列或者目标序列的最大精确匹配在参考序列中的位置。在一些可能的实现方式中,处理设备200也可以直接接收检测设备100发送的目标序列,以便在参考序列中查找目标序列。
S404:处理设备200从目标序列中确定至少一个第一子序列。
第一子序列以目标序列中的一个字符为起点。第一子序列的长度可以为len C。针对目标序列s,处理设备200从目标序列中确定的第一子序列可以为s[c:c+len c]。需要说明的是,当s[c:c+len c]中包括特殊字符时,处理设备200可以跳过该特殊字符,以特殊字符之 后的位置为c,重新确定第一子序列s[c:c+len c]。
S406:处理设备200从目标序列中确定至少一个第二子序列。
第二子序列以目标序列中的一个字符为起点。其中,第二子序列的起点可以和第一子序列的起点相同。第二子序列的长度可以为len C+len E+len F。针对目标序列s,处理设备200从目标序列中确定的第二子序列可以为s[c:c+len c+len E+len F]。需要说明的是,当s[c:c+len c+len E+len F]中包括特殊字符时,处理设备200可以跳过该特殊字符,以特殊字符之后的位置为c,重新确定第一子序列s[c:c+len c+len E+len F]。
S408:处理设备200在第一加速库中查找第一子序列。当第一子序列在参考序列中时,处理设备200执行S410。当第一子序列不在参考序列中时,处理设备200执行S414。
S410:处理设备200获得第一子序列在参考序列中的位置。
处理设备200可以根据第一加速库中与第一子序列匹配的样本序列的第一信息结构体,具体是该第一信息结构体的范围字段,获得第一子序列在二维数组OCC中的范围。然后,处理设备200可以基于该范围确定第一子序列在参考序列中的位置。
S412:处理设备200使用BWT-FM法查找第一子序列之后长度为第三长度值的字符。
具体地,处理设备200使用BWT-FM算法顺序查找s[c:c+len c]以后的len E长度,若在c+len c与c+len c+len E位置之间查找到最大精确匹配,则返回目标序列以c为起始位置(即以子序列的第一个字符为起点)的最大精确匹配以及该最大精确匹配在参考序列R中的位置。该最大精确匹配的长度大于或等于len c,且小于或等于len c+len E
S414:处理设备200获得第一子序列以第一个字符为起点的最大精确匹配,以及最大精确匹配在参考序列中的位置。
其中,s[c:c+len c]不在参考序列中,故s[c:c+len c]以第一个字符为起点的最大精确匹配也是目标序列以c为起始位置的最大精确匹配。处理设备200根据第一加速库中与第一子序列s[c:c+len c]匹配的样本序列的第一信息结构体,具体是第一信息结构体的范围字段,获得第一子序列的最大精确匹配(目标序列以c为起始位置的最大精确匹配)在二维数组OCC中的范围。然后,处理设备200基于该范围确定第一子序列在参考序列中的位置。
S416:处理设备200在第二加速库中查找第二子序列。当第二子序列在参考序列中时,处理设备200执行S418。当第二子序列不在参考序列中时,处理设备200执行S422。
S418:处理设备200获得第二子序列在参考序列中的位置。
与S410类似,处理设备200可以根据第二加速库中与第二子序列匹配的样本序列的第二信息结构体,具体是该第二信息结构体的范围字段,获得第二子序列在二维数组OCC中的范围。然后,处理设备200可以基于该范围确定第二子序列在参考序列中的位置。
S420:处理设备200使用BWT-FM法查找第二子序列之后的字符。
具体地,处理设备200使用BWT-FM算法顺序查找s[c:c+len C+len E+len F]之后的字符,直至查找到目标序列的最大精确匹配,返回该最大精确匹配以及最大精确匹配在参考序列中的位置。其中,最大精确匹配的长度大于或等于len C+len E+len F
S422:处理设备200获得第二子序列以第一个字符为起点的最大精确匹配以及最大精确匹配在参考序列中的位置。
与S414类似,s[c:c+len c+len E+len F]不在参考序列中,故s[c:c+len c+len E+len F]以第一个字符为起点的最大精确匹配也是目标序列以c为起始位置的最大精确匹配。处理设备200根据第二加速库中与第二子序列s[c:c+len c+len E+len F]匹配的样本序列的第二信息结构体,具体是第二信息结构体的范围字段,获得第二子序列的最大精确匹配(目标序列以c为起始位置的最大精确匹配)在二维数组OCC中的范围。然后,处理设备200基于该范围确定第二子序列在参考序列中的位置。
其中,第二信息结构体中还包括比较字段时,处理设备200还可以获取比较字段的值。比较字段的值为true或1时,表明目标序列以c为起始位置的最大精确匹配的长度大于预设长度阈值,例如大于len c+len E,处理设备200可以返回目标序列s以c为起点在参考序列R中的最大精确匹配以及该最大精确匹配在参考序列R中的位置。比较字段的值为false或0时,表明目标序列以c为起始位置的最大精确匹配的长度小于或等于预设长度阈值,例如,小于或等于len c+len E,处理设备200可以结束当前操作,通过S412返回最大精确匹配以及最大精确匹配在参考序列中的位置。
在上述实施例中,S404、S408可以和S406、S416可以并行执行,也可以按照设定的顺序先后执行。当S408和S416并行执行时,如果处理设备200先在所述第二加速库中查找到所述第二子序列以所述一个字符为起点的最大精确匹配(S422先执行完成)时,停止在所述第一加速库中查找所述第一子序列(S408)。类似地,处理设备200先在所述第一加速库中查找到所述第一子序列以所述一个字符为起点的最大精确匹配(S414先执行完成)时,停止在所述第二加速库中查找所述第二子序列(S416)。
如果S410、S414、S418、S422中S410先执行完成,则处理设备200继续执行S412。在执行S412过程中,若S422先执行完成,则停止执行S412,若S422未执行完成,则继续执行S412、S418。如果S410、S414、S418、S422中S418先执行完成,则处理设备200继续执行S420。在执行S410、S414、S418、S422中S422先执行完成,若最大精确匹配的长度大于len c+len E,则可以停止执行S410、S412、S414、S418。通过多个分支并行查找,可以实现以较短的时间获得查找结果。当一个分支先查找到结果,与该分支并行的分支可以停止查找,如此可以避免资源浪费。
在执行S412时,如果在len c和len E之间找到最大精确匹配,则停止执行S422,如果s[c+len c:c+len c+len E]也在参考序列中,则继续执行S422,获得目标序列以c为起始位置的最大精确匹配以及该最大精确匹配在参考序列中的位置。
在执行S422时,如果最大精确匹配的长度小于或等于len c+len E,则停止执行S422,继续执行S412,获得目标序列以c为起始位置的最大精确匹配以及该最大精确匹配在参考序列中的位置。如果最大精确匹配的长度大于len c+len E,则停止执行S412,通过S422获得目标序列以c为起始位置的最大精确匹配以及该最大精确匹配在参考序列中的位置。
在图4所示实施例中,最大精确匹配长度小于len E时,为了保障查找效率,还可以设置内存断点查找法中第一子序列的长度等于len E。其中,len E<len C。对应地,第一加速库中样本序列的长度也等于len E,如此可以保障查找时间至少不多于BWT-FM法,进而保障了序列查找效率。
基于上述内容描述,本申请实施例将内存断点查找法、外存断点查找法以及BWT-FM 结合,可以查询任意长度的最大精确匹配,不局限于有限长度内的最大精确匹配。而且,该方法可以实现多个分支异步并行查找,提高了查找效率。
与二维数组全部存储在内存进行查找相比,该方法能够大幅提升查找性能。基于最大精确匹配长度不同,查找性能提升幅度存在一定差异,具体如下:
(1)当最大精确匹配的长度小于len C时,查询时间可以忽略;
(2)当最大精确匹配的长度小于len C+len E时,随机访存减少了2*len C次,平均性能至少提升2.5倍;
(3)当最大精确匹配的长度在len C+len E与len C+len E+len F时,查询时间是个定值(对应BWT-FM算法查询len C长度的时间),平均性能至少提升4倍;
(4)当精确匹配长度大于len C+len E+len F时,减少了随机访问内存次数约为2*(len C+len F)次,在基因测序场景平均性能可以提升3倍。
下面以基因测序场景对本申请实施例提供的序列查找方法进行详细说明。参见图5所示的序列查找方法的流程图,该方法包括:
S501:标记read中特殊字符的位置,设置c=0,然后跳转至S502。
S502:设置read的c位置为起始位置,跳转至S503。
S503:c位置是否在read的范围内。若c位置超过read的长度范围,则跳转至步骤End,结束查询,否则分别跳转至S604和S704;
S604:判断以c位置为起点,长度为len C的碱基串中是否有特殊字符,若是,则跳转至S6041,否则跳转至S605。
S6041:将c位置更新为特殊字符后的位置。
S605:以c为起始位置,长度为Len C的碱基串为索引查询第一加速库,获取该碱基串对应的第一信息结构体。然后跳转至S606。
第一信息结构体包括布尔值b 1、OCC查询范围start 1、end 1以及长度length 1
S606:判断第一信息结构体中b 1是否为True,若是,则跳转至S607,否则跳转至S6061。
S6061:根据第一信息结构体确定最大精确匹配的相关信息;令c+=length 1,然后跳转至S502。
最大精确匹配的相关信息包括最大精确匹配在reference中的位置以及最大精确匹配的长度length 1。其中,最大精确匹配在reference中的位置可以根据在最大精确匹配在OCC的范围start 1和end 1确定。具体地,根据区间[start 1,end 1]查找后缀数组SA,获得区间内每个整数在SA中的匹配值,该匹配值即为最大精确匹配在reference中的起始位置。基于起始位置和最大精确匹配长度可以确定最大精确匹配在reference中的位置。
S607:判断以c位置为起点,长度为Len C+Len E的碱基串中是否有特殊字符,若是,则跳转至S6071;否则跳转至S608。
S6071:将c位置更新为特殊字符后的位置。
S608:使用BWT-FM法,利用OCC和SA继续查询,直至长度达到Len C+Len E,或者在长度达到Len C+Len E之前确定最大精确匹配。
S609:是否在长度达到Len C+Len E之前确定最大精确匹配。若是,则跳转至S6091。
S6091:确定最大精确匹配的相关信息,令c+=Len max+1,然后跳转至S502。
其中,Len max表征最大精确匹配的长度。
S704:判断位置c为起点,长度为Len C+Len E+Len F的碱基串中是否含有特殊字符,若是,则跳转至S7041;否则跳转至S705。
S7041:将c位置更新为特殊字符后的位置。
S705:以c位置为起始位置,长度为Len C+Len E+Len F的碱基串为索引,查找第二加速库,获得该碱基串对应的第二信息结构体。然后跳转至S706。
第二信息结构体包括布尔值b 2、b 3和OCC查询范围start 2、end 2以及长度length 2
S706:判断第二信息结构体中的b 2是否为True;若是,则跳转至S708,否则跳转至S707。
S707:判断第二信息结构体中的b 3是否为True,若是,则跳转至S7071,否则结束当前流程。
S7071:根据第二信息结构体确定最大精确匹配的相关信息;令c+=length 2,然后跳转至S502。
S708:使用BWT-FM法,利用OCC和SA继续查询至确定最大精确匹配。然后跳转至S7081。
S7081:确定最大精确匹配的相关信息。令c+=Len max+1。然后跳转至S502。
在上述实施例中,S6041、S6071以及S7041中将c位置更新为特殊字符后的位置时,还可以确定原起始位置和更新后的起始位置之间最大精确匹配的相关信息。例如,在S6071中,c到c+Len C已匹配,可以通过BWT-FM继续查找最大精确匹配,直至查找到更新后起始位置。
本申请实施例提供的上述序列查找方法可以以云服务的形式提供给用户使用。具体地,云服务提供商可以在云环境中运行序列查找方法对应的代码,从而实现以云服务方式提供序列查找服务。具体地,云服务提供商提供的云服务器可以向用户呈现序列查找界面,如用于序列查找的图形用户界面(graphical user interface,GUI),然后云服务器接收用户通过GUI输入的、待查找的目标序列。后台的云服务器可以从目标序列中确定至少一个子序列,该子序列以所述目标序列中的一个字符为起点。然后云服务器在加速库中查找所述子序列,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置。
在一些可能的实现方式中,上述序列查找方法可以通过代码实现,该代码可以被封装为软件包。台式机、笔记本、智能手机等终端计算设备(可以简称为终端),或者服务器可以获取软件包,并安装上述软件包。当软件包运行时,该终端或服务器可以执行上述序列查找方法。
在另一些可能的实现方式中,硬件商在发布硬件如CPU或GPU时,还可以发布(例如开源)针对该硬件的加速包。该加速包具体用于在加速在参考序列中查找目标序列的过程。CPU或GPU可以接收用户的选择信息,该选择信息指示是否启用加速包,若是,则执行本申请实施例所示的序列查找方法,提高查找效率。
上文结合图1至图5对本申请实施例提供的序列查找方法进行了详细介绍,下面将结 合附图对本申请实施例提供的装置、设备进行介绍。
参见图6所示的序列查找装置的结构示意图,该装置600包括:
确定模块602,用于从目标序列中确定至少一个子序列,所述子序列以所述目标序列中的一个字符为起点;
查找模块604,用于在加速库中查找所述子序列,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置,所述加速库用于加速查找设定长度值的序列,所述子序列的长度为所述设定长度值。
在一些可能的实现方式中,所述加速库包括至少一个信息结构体,所述信息结构体用于指示样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围。
在一些可能的实现方式中,所述信息结构体包括存在字段和长度字段中的至少一个字段以及范围字段,所述存在字段用于表征一个样本序列是否存在于所述参考序列中,所述范围字段用于表征所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围,所述长度字段用于表征所述样本序列或者所述样本序列的最大精确匹配的长度。
在一些可能的实现方式中,所述查找模块604具体用于:
根据序列与存储地址的映射关系确定所述子序列对应的存储地址;
根据所述存储地址访问所述加速库,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置。
在一些可能的实现方式中,所述加速库包括位于内存的第一加速库,所述设定长度值为第一长度值。
在一些可能的实现方式中,所述第一长度值根据所述内存的大小确定。
在一些可能的实现方式中,所述加速库包括位于外存的第二加速库,所述设定长度值为第二长度值。
在一些可能的实现方式中,所述第二长度值根据所述外存的大小确定。
在一些可能的实现方式中,所述第二信息结构体还包括比较字段,所述比较字段用于表征最大精确匹配的长度值是否大于预设长度阈值,所述预设长度阈值根据内存的大小以及外存内存访问耗时比确定。
在一些可能的实现方式中,所述确定模块602具体用于:
从目标序列中确定至少一个第一子序列和至少一个第二子序列,所述至少一个第一子序列和所述至少一个第二子序列以所述目标序列中的一个字符为起点,所述第二子序列长于所述第一子序列;
所述加速库包括位于内存的第一加速库和位于外存的第二加速库;
所述查找模块604具体用于:
在所述第一加速库中查找所述第一子序列,以及在所述第二加速库中查找所述第二子序列。
在一些可能的实现方式中,所述查找模块604具体用于:
当在所述第二加速库中查找到所述第二子序列以所述一个字符为起点的最大精确匹配时,停止在所述第一加速库中查找所述第一子序列,当在所述第一加速库中查找到所述第一子序列以所述一个字符为起点的最大精确匹配时,停止在所述第二加速库中查找所述 第二子序列。
在一些可能的实现方式中,所述第一长度值根据所述内存的大小确定,或者根据外存内存访问耗时比确定。
在一些可能的实现方式中,所述装置600还包括:
构建模块,用于在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述参考序列中的位置,根据所述查找结果构建所述加速库。
在一些可能的实现方式中,所述构建模块具体用于:
根据所述参考序列的索引BWT、后缀数组SA和二维数组OCC,通过BWT算法在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列是否存在于所述参考序列中,以及所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述二维数组的范围,以及所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的长度值。
在一些可能的实现方式中,所述序列为基因序列。
根据本申请实施例的序列查找装置600可对应于执行本申请实施例中描述的方法,并且序列查找装置600的各个模块/单元的上述和其它操作和/或功能分别为了实现图3、图4、图5所示实施例中的各个方法的相应流程,为了简洁,在此不再赘述。
本申请实施例还提供了一种处理设备200,用于实现图6所示实施例中序列查找装置600的功能。其中,处理设备200的具体实现可以参见图2相关内容描述,在此不再赘述。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括指令,所述指令指示计算机执行上述应用于序列查找装置600的序列查找方法。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质包括指令,所述指令指示计算机执行上述应用于序列查找装置600的序列查找方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机执行时,所述计算机执行前述序列查找方法的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述序列查找方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。

Claims (32)

  1. 一种序列查找方法,其特征在于,所述方法包括:
    从目标序列中确定至少一个子序列,所述子序列以所述目标序列中的一个字符为起点;
    在加速库中查找所述子序列,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置,所述加速库用于加速查找设定长度值的序列,所述子序列的长度为所述设定长度值。
  2. 根据权利要求1所述的方法,其特征在于,所述加速库包括至少一个信息结构体,所述信息结构体用于指示样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围。
  3. 根据权利要求2所述的方法,其特征在于,所述信息结构体包括存在字段和长度字段中的至少一个字段以及范围字段,所述存在字段用于表征一个样本序列是否存在于所述参考序列中,所述范围字段用于表征所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围,所述长度字段用于表征所述样本序列或者所述样本序列的最大精确匹配的长度。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述在加速库中查找所述子序列,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置,包括:
    根据序列与存储地址的映射关系确定所述子序列对应的存储地址;
    根据所述存储地址访问所述加速库,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述加速库包括位于内存的第一加速库,所述设定长度值为第一长度值。
  6. 根据权利要求5所述的方法,其特征在于,所述第一长度值根据所述内存的大小确定。
  7. 根据权利要求1至4任一项所述的方法,其特征在于,所述加速库包括位于外存的第二加速库,所述设定长度值为第二长度值。
  8. 根据权利要求7所述的方法,其特征在于,所述第二长度值根据所述外存的大小确定。
  9. 根据权利要求7或8所述的方法,其特征在于,所述第二信息结构体还包括比较字段,所述比较字段用于表征最大精确匹配的长度值是否大于预设长度阈值,所述预设长度阈值根据内存的大小以及外存内存访问耗时比确定。
  10. 根据权利要求1至4任一项所述的方法,其特征在于,所述从目标序列中确定至少一个子序列,包括:
    从目标序列中确定至少一个第一子序列和至少一个第二子序列,所述至少一个第一子序列和所述至少一个第二子序列以所述目标序列中的一个字符为起点,所述第二子序列长于所述第一子序列;
    所述加速库包括位于内存的第一加速库和位于外存的第二加速库;
    所述在加速库中查找所述子序列,包括:
    在所述第一加速库中查找所述第一子序列,以及在所述第二加速库中查找所述第二子序列。
  11. 根据权利要求10所述的方法,其特征在于,当在所述第二加速库中查找到所述第二子序列以所述一个字符为起点的最大精确匹配时,停止在所述第一加速库中查找所述第一子序列,当在所述第一加速库中查找到所述第一子序列以所述一个字符为起点的最大精确匹配时,停止在所述第二加速库中查找所述第二子序列。
  12. 根据权利要求10或11所述的方法,其特征在于,所述第一长度值根据所述内存的大小确定,或者根据外存内存访问耗时比确定。
  13. 根据权利要求1至12任一项所述的方法,其特征在于,所述方法还包括:
    在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述参考序列中的位置;
    根据所述查找结果构建所述加速库。
  14. 根据权利要求13所述的方法,其特征在于,所述在所述参考序列中查找样本序列,获得查找结果,包括:
    根据所述参考序列的索引BWT、后缀数组SA和二维数组OCC,通过BWT算法在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列是否存在于所述参考序列中,以及所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述二维数组的范围,以及所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的长度值。
  15. 根据权利要求1至14任一项所述的方法,其特征在于,所述序列为基因序列。
  16. 一种序列查找装置,其特征在于,所述装置包括:
    确定模块,用于从目标序列中确定至少一个子序列,所述子序列以所述目标序列中的一个字符为起点;
    查找模块,用于在加速库中查找所述子序列,得到所述子序列或者所述子序列以所述一个字符为起点的最大精确匹配在参考序列中的位置,所述加速库用于加速查找设定长度值的序列,所述子序列的长度为所述设定长度值。
  17. 根据权利要求16所述的装置,其特征在于,所述加速库包括至少一个信息结构体,所述信息结构体用于指示样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围。
  18. 根据权利要求17所述的装置,其特征在于,所述信息结构体包括存在字段和长度字段中的至少一个字段以及范围字段,所述存在字段用于表征一个样本序列是否存在于所述参考序列中,所述范围字段用于表征所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的范围,所述长度字段用于表征所述样本序列或者所述样本序列的最大精确匹配的长度。
  19. 根据权利要求16至18任一项所述的装置,其特征在于,所述查找模块具体用于:
    根据序列与存储地址的映射关系确定所述子序列对应的存储地址;
    根据所述存储地址访问所述加速库,得到所述子序列或者所述子序列以所述一个字符 为起点的最大精确匹配在参考序列中的位置。
  20. 根据权利要求16至19任一项所述的装置,其特征在于,所述加速库包括位于内存的第一加速库,所述设定长度值为第一长度值。
  21. 根据权利要求20所述的装置,其特征在于,所述第一长度值根据所述内存的大小确定。
  22. 根据权利要求16至19任一项所述的装置,其特征在于,所述加速库包括位于外存的第二加速库,所述设定长度值为第二长度值。
  23. 根据权利要求22所述的装置,其特征在于,所述第二长度值根据所述外存的大小确定。
  24. 根据权利要求22或23所述的装置,其特征在于,所述第二信息结构体还包括比较字段,所述比较字段用于表征最大精确匹配的长度值是否大于预设长度阈值,所述预设长度阈值根据内存的大小以及外存内存访问耗时比确定。
  25. 根据权利要求16至19任一项所述的装置,其特征在于,所述确定模块具体用于:
    从目标序列中确定至少一个第一子序列和至少一个第二子序列,所述至少一个第一子序列和所述至少一个第二子序列以所述目标序列中的一个字符为起点,所述第二子序列长于所述第一子序列;
    所述加速库包括位于内存的第一加速库和位于外存的第二加速库;
    所述查找模块具体用于:
    在所述第一加速库中查找所述第一子序列,以及在所述第二加速库中查找所述第二子序列。
  26. 根据权利要求25所述的装置,其特征在于,所述查找模块具体用于:
    当在所述第二加速库中查找到所述第二子序列以所述一个字符为起点的最大精确匹配时,停止在所述第一加速库中查找所述第一子序列,当在所述第一加速库中查找到所述第一子序列以所述一个字符为起点的最大精确匹配时,停止在所述第二加速库中查找所述第二子序列。
  27. 根据权利要求25或26所述的装置,其特征在于,所述第一长度值根据所述内存的大小确定,或者根据外存内存访问耗时比确定。
  28. 根据权利要求16至27任一项所述的装置,其特征在于,所述装置还包括:
    构建模块,用于在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述参考序列中的位置,根据所述查找结果构建所述加速库。
  29. 根据权利要求28所述的装置,其特征在于,所述构建模块具体用于:
    根据所述参考序列的索引BWT、后缀数组SA和二维数组OCC,通过BWT算法在所述参考序列中查找样本序列,获得查找结果,所述查找结果用于表征所述样本序列是否存在于所述参考序列中,以及所述样本序列或所述样本序列以第一个字符为起点的最大精确匹配在所述二维数组的范围,以及所述样本序列或者所述样本序列以第一个字符为起点的最大精确匹配的长度值。
  30. 根据权利要求16至29任一项所述的装置,其特征在于,所述序列为基因序列。
  31. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器;
    所述处理器用于执行所述存储器中存储的指令,以使得所述设备执行如权利要求1至15中任一项所述的方法。
  32. 一种计算机可读存储介质,其特征在于,包括指令,所述指令指示计算设备执行如权利要求1至15中任一项所述的方法。
PCT/CN2021/095825 2020-08-24 2021-05-25 序列查找方法、装置、设备及介质 WO2022041881A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010856456.3 2020-08-24
CN202010856456.3A CN114090840A (zh) 2020-08-24 2020-08-24 序列查找方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2022041881A1 true WO2022041881A1 (zh) 2022-03-03

Family

ID=80295447

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095825 WO2022041881A1 (zh) 2020-08-24 2021-05-25 序列查找方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN114090840A (zh)
WO (1) WO2022041881A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477559A (zh) * 2002-08-23 2004-02-25 华为技术有限公司 一种实现长字符串前缀匹配的方法
CN105138534A (zh) * 2015-06-29 2015-12-09 中山大学 基于fmd索引和快表的跨越式种子查找算法
US20170177743A1 (en) * 2014-08-20 2017-06-22 Oracle International Corporation Multidimensional spatial searching for identifying substantially similar data fields
CN107220028A (zh) * 2017-05-24 2017-09-29 上海兆芯集成电路有限公司 加速压缩方法以及使用此方法的装置
CN109040081A (zh) * 2018-08-10 2018-12-18 哈尔滨工业大学(威海) 一种基于bwt的协议字段逆向分析系统及方法
CN109326325A (zh) * 2018-07-25 2019-02-12 郑州云海信息技术有限公司 一种基因序列比对的方法、系统及相关组件
CN109831384A (zh) * 2017-11-23 2019-05-31 华为技术有限公司 名字查找方法及路由器
CN110245330A (zh) * 2018-03-09 2019-09-17 腾讯科技(深圳)有限公司 字符序列匹配方法、实现匹配的预处理方法和装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477559A (zh) * 2002-08-23 2004-02-25 华为技术有限公司 一种实现长字符串前缀匹配的方法
US20170177743A1 (en) * 2014-08-20 2017-06-22 Oracle International Corporation Multidimensional spatial searching for identifying substantially similar data fields
CN105138534A (zh) * 2015-06-29 2015-12-09 中山大学 基于fmd索引和快表的跨越式种子查找算法
CN107220028A (zh) * 2017-05-24 2017-09-29 上海兆芯集成电路有限公司 加速压缩方法以及使用此方法的装置
CN109831384A (zh) * 2017-11-23 2019-05-31 华为技术有限公司 名字查找方法及路由器
CN110245330A (zh) * 2018-03-09 2019-09-17 腾讯科技(深圳)有限公司 字符序列匹配方法、实现匹配的预处理方法和装置
CN109326325A (zh) * 2018-07-25 2019-02-12 郑州云海信息技术有限公司 一种基因序列比对的方法、系统及相关组件
CN109040081A (zh) * 2018-08-10 2018-12-18 哈尔滨工业大学(威海) 一种基于bwt的协议字段逆向分析系统及方法

Also Published As

Publication number Publication date
CN114090840A (zh) 2022-02-25

Similar Documents

Publication Publication Date Title
TWI720491B (zh) 縮減機率篩選查詢延時
JP5816198B2 (ja) 関連コンピューティングシステム間でコンピューティングオペレーションの結果を共有するためのシステムおよび方法
US10678654B2 (en) Systems and methods for data backup using data binning and deduplication
US10210191B2 (en) Accelerated access to objects in an object store implemented utilizing a file storage system
JP5744216B2 (ja) 言語ロケールに基づくインデックス及びサーチ方法
WO2014000517A1 (zh) 一种用于搜索输入的推荐系统及方法
BR112014002425B1 (pt) Método para varrer arquivos
KR20080024156A (ko) 검색을 위한 백-오프 메커니즘
TW201033896A (en) Systems, methods, and devices for configuring a device
WO2015007224A1 (zh) 基于云安全的恶意程序查杀的方法、装置和服务器
JP2009193203A (ja) パターン検出装置、パターン検出システム、パターン検出プログラム、およびパターン検出方法
US11392545B1 (en) Tracking access pattern of inodes and pre-fetching inodes
US8984267B2 (en) Pinning boot data for faster boot
US8914377B2 (en) Methods for prefix indexing
US20130276117A1 (en) Method and apparatus for detecting a malware in files
US10795821B2 (en) Memory efficient key-value store
JP2019537097A (ja) Iノードのアクセスパターンの追跡及びiノードの先取り
CN111638925A (zh) 一种接口方法表生成方法、函数指针查询方法及装置
WO2022041881A1 (zh) 序列查找方法、装置、设备及介质
WO2016029441A1 (zh) 一种文件扫描方法及装置
WO2021042542A1 (zh) 目录存储方法、装置、计算机设备及存储介质
US20200135298A1 (en) Systems and methods for grouping and collapsing sequencing reads
CN114518841A (zh) 存储器中处理器和使用存储器中处理器输出指令的方法
RU2628922C1 (ru) Способ определения похожести составных файлов
JP4540556B2 (ja) データアクセス方法及びそのプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21859726

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21859726

Country of ref document: EP

Kind code of ref document: A1