CN111292805B - Third generation sequencing data overlap detection method and system - Google Patents

Third generation sequencing data overlap detection method and system Download PDF

Info

Publication number
CN111292805B
CN111292805B CN202010195494.9A CN202010195494A CN111292805B CN 111292805 B CN111292805 B CN 111292805B CN 202010195494 A CN202010195494 A CN 202010195494A CN 111292805 B CN111292805 B CN 111292805B
Authority
CN
China
Prior art keywords
minimizer
subsequence
hash
window
sequencing data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010195494.9A
Other languages
Chinese (zh)
Other versions
CN111292805A (en
Inventor
刘卫国
槐敏涵
产院东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202010195494.9A priority Critical patent/CN111292805B/en
Publication of CN111292805A publication Critical patent/CN111292805A/en
Application granted granted Critical
Publication of CN111292805B publication Critical patent/CN111292805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a third generation sequencing data overlap detection method and system. The third-generation sequencing data overlap detection method comprises the steps of receiving all DNA sequences of third-generation sequencing data and sequencing the DNA sequences according to the length; distributing all DNA sequences to a preset number of parallel threads according to a strategy that the sizes of the total DNA data processed by each thread are equal; for each thread, calculating the subsequence with the minimum hash value of each window of all DNA sequences and taking the subsequence as a minimizer; indexing all minimizer according to hash value to construct reference gene hash index table based on double array structure; the reference gene hash index table is divided into two arrays, the index arrays store the positions of minimizer corresponding to different hash values in the structure arrays, and the structure arrays store the position information of the minimizer; and performing DNA sequence overlapping detection according to a reference gene hash index table based on a double-array structure. Which can improve sequencing data overlap detection efficiency.

Description

Third generation sequencing data overlap detection method and system
Technical Field
The invention belongs to the field of sequencing data processing, and particularly relates to a third-generation sequencing data overlap detection method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The third generation sequencing technology is a new generation of DNA sequencing technology, the average length of the DNA sequence is increased from 200 to 10000, and the long DNA sequence can contain more abundant gene information and accelerate the subsequent gene splicing process. For three generations of sequencing data, DNA overlap detection is an important sequence analysis process, overlap refers to the part of character matching between two sequences, where DNA can be seen as consisting of A/C/G/T characters.
The core of the existing DNA sequence overlap detection algorithm is to find the same k long subsequence called seed in the two DNA sequences, and the existing overlap algorithm Minimap finds the overlapped part in the two DNA sequences based on the k long subsequence. Before the minimum algorithm builds an index, all subsequences with the length of k are found in the DNA sequence with the length of n, and each k adjacent characters is a subsequence called k-mer. Each of which is provided withThe k-mers can be mapped to an integer after calculation by a hash function, and the subsequence with the smallest integer value is selected from the adjacent k-mers, called minimizer, and the minimizer is indexed. A commonly used hash function phi is to convert a k-long subsequence into an integer of 2k bits, with a function input a yielding 0, an input C yielding 1, an input G yielding 2, and an input T yielding 3 for a single character. For a k-long sequence s=a1a2a … ak, Φ (S) =Φ (a 1) ×4 k-1 +φ(a2)×4 k-2 + … +φ (ak). Next, the Minimap uses a reversible integer hash function to calculate the result obtained by the hash function again, and uses the new result as the basis for selecting the minimizer.
With the development of the third generation sequencing technology, on one hand, the scale of DNA sequences has a trend of explosive growth; on the other hand, the length of DNA increases to the order of 10k, and the error rate increases to 10% -15%. This presents new challenges to the DNA sequence analysis process. Gene assembly is an important process of DNA sequence analysis, and short segments obtained by DNA sequencing are reduced into longer continuous sequences through the steps of sequence comparison, merging and the like, so that the original appearance of the detected DNA molecules is reconstructed. During the process of gene assembly, DNA overlap detection is the most computationally intensive step. For the biological sequence analysis process of DNA overlap detection, the existing research mostly adopts a "seed-strand-alignment" strategy, and the difference between different algorithms mainly lies in how to define seeds and how to find similar seeds.
Among many DNA overlap detection algorithms, minimap combines both the computational efficiency of the algorithm and the accuracy of overlap detection. Therefore, we start with the Minimap algorithm, hopefully improving the performance of the Minimap algorithm in establishing the index part through parallel and optimization technology. Through intensive investigation and analysis of the minimum algorithm, the inventor finds that the existing implementation of the minimum algorithm mainly has the following defects in performance. First, the Minimap supports multithreading, but after testing different thread numbers, it was found that the Minimap did not have good thread extensibility. The thread extensibility refers to: when the program runs with different thread numbers, if the running time of the program is also reduced linearly along with the linear increase of the thread numbers, the program is called as having good thread expansibility. It is explained here that the program time does not decrease linearly after the Minimap has been found to increase linearly by testing. In addition, the minimum algorithm uses a sorting algorithm when indexing, but this part is not parallelized. Finally, the hash function in the minimum algorithm takes up most of the computation time of the program, and the hash function contains operations supporting vectorization operation, but the minimum algorithm does not utilize vector processor resources to accelerate the bottom layer at present. Where vectorization refers to the use of underlying hardware, i.e., vector processors, to perform the same operation on multiple operands in parallel with a vector, say Minimap does not accelerate with vector processors.
Disclosure of Invention
In order to solve the problems, the invention provides a third generation sequencing data overlap detection method and a third generation sequencing data overlap detection system, which can improve sequencing data overlap detection efficiency.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the first aspect of the invention provides a third generation sequencing data overlap detection method, comprising the following steps:
receiving all DNA sequences of the third generation sequencing data, and sequencing the DNA sequences according to the length;
distributing all DNA sequences to a preset number of parallel threads according to a strategy that the sizes of the total DNA data processed by each thread are equal;
for each thread, calculating the subsequence with the minimum hash value of each window of all DNA sequences and taking the subsequence as a minimizer;
indexing all minimizer according to hash value to construct reference gene hash index table based on double array structure; the reference gene hash index table is divided into an index array and a structure array, wherein the index array stores the storage positions of minimizer corresponding to different hash values in the structure array; the structure array stores minimizer position information arranged according to the hash value ascending order;
and performing DNA sequence overlapping detection according to a reference gene hash index table based on a double-array structure.
A second aspect of the invention provides a third generation sequencing data overlap detection system comprising:
a sequencing data preprocessing module for receiving all DNA sequences of the third generation sequencing data and sequencing the DNA sequences according to length;
the parallel thread allocation module is used for allocating all DNA sequences to a preset number of parallel threads according to a strategy that the sizes of the total DNA data processed by each thread are equal;
a Minimizer obtaining module, which is used for obtaining the subsequence with the minimum hash value of each window of all DNA sequences for each thread and taking the subsequence as a Minimizer;
the hash index table construction module is used for constructing indexes of all minimizer according to hash values and constructing a reference gene hash index table based on a double-array structure; the reference gene hash index table is divided into an index array and a structure array, wherein the index array stores the storage positions of minimizer corresponding to different hash values in the structure array; the structure array stores minimizer position information arranged according to the hash value ascending order;
and the overlap detection module is used for carrying out DNA sequence overlap detection according to the reference gene hash index table based on the double-array structure.
A third aspect of the invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps in a third generation sequencing data overlap detection method as described above.
A fourth aspect of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a third generation sequencing data overlap detection method as described above when the program is executed.
The beneficial effects of the invention are as follows:
(1) According to the strategy that the total DNA data processed by each thread is equal in size, all DNA sequences are distributed to a preset number of parallel threads, so that the load of the multiple threads is balanced, and the acceleration ratio of the parallel realization of the multiple threads is ensured;
(2) The invention constructs a reference gene hash index table based on a double-array structure; the reference gene hash index table is divided into two arrays, the index arrays store the positions of minimizer corresponding to different hash values in the structure arrays, and the structure arrays store the position information of the minimizer; the invention improves the parallel computing speed and the thread expansibility by utilizing the reference gene hash index table based on the double-array structure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a flow chart of a third generation sequencing data overlap detection method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a DNA sequence allocation thread according to an embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating a memory storage manner conversion according to an embodiment of the invention.
Fig. 4 is a reference gene hash index table based on a double array structure according to an embodiment of the present invention.
FIG. 5 is a schematic diagram of a third generation sequencing data overlay detection system according to an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
FIG. 1 is a flow chart of a third generation sequencing data overlap detection method according to this embodiment.
The following provides a specific implementation procedure of the third generation sequencing data overlap detection method of the present embodiment in combination with fig. 1.
As shown in fig. 1, the present embodiment provides a third generation sequencing data overlap detection method, including:
step S101: all DNA sequences of the third generation sequencing data were received and the DNA sequences were ordered according to length.
The benefit of ordering may be that the difference in the computing tasks corresponding to two adjacent sequences is reduced. Since parallel optimization includes vectorized optimization, if the length difference of adjacent sequences is too large, most of the computation channels in the vector registers are left idle. Thus, ordering is critical to maintaining load balancing for parallel implementations.
Step S102: and distributing all DNA sequences to a preset number of parallel threads according to a strategy that the total DNA data processed by each thread is equal in size.
As shown in fig. 2, DNA sequencing data is divided into different data blocks and then assigned to different threads for processing.
For the sequences after sequencing, the DNA sequences are divided, and a general division strategy is to divide the sequences according to the number of DNA sequences, and a strategy is to divide the sequences according to the data blocks of DNA. Since the data generated by the third generation sequencing is greatly different in length, if the DNA sequences after sequencing are divided by the number of sequence bars, the previous data block contains significantly less calculation amount than the subsequent data block. We have therefore chosen to partition according to a strategy of equal size of the DNA database. For example: when the DNA sequences are distributed to different threads for processing, the DNA sequences are divided according to a strategy that the sum of the characters of the DNA sequences is equal, for example, the length of 4 sequences of 2 threads is 5000,3000,1000,1000, and sequences with length of 5000 are distributed to the rest sequences of one thread.
For the main time-consuming process of the subsequence minimizer calculation for finding the minimum hash value of each window of all DNA sequences, the complexity of the calculation is proportional to the number of characters. This partitioning thus ensures load balancing between threads. In general, the multithreading parallel scheme ensures the expandability among threads, and the parallel scheme can still be used with the increase of the number of cores. In addition, the load balancing of the multithreading ensures the acceleration ratio of the multithreading parallel implementation.
Step S103: for each thread, the subsequence with the smallest hash value of each window of all DNA sequences is found and used as a minimizer.
As a specific implementation mode, the process of obtaining the subsequence with the smallest hash value of each window of all DNA sequences and taking the subsequence as a minimizer is as follows:
step S1031: finding all subsequences of equal length and size in each DNA sequence;
step S1032: calculating the hash value of the subsequence of each DNA sequence according to the hash function;
step S1033: searching a subsequence with the smallest hash value in the preset window size as a minimizer.
The method comprises the steps of continuously calculating hash values of a plurality of subsequences in the same window, comparing, and finding out a calculation process of a subsequence with a minimum hash value as a task, wherein each task is distributed to a channel in a vector register;
and (3) adjacently storing the subsequences corresponding to the windows in a plurality of windows of a continuous vector channel in the DNA sequence, and further realizing parallelization access to the memory by utilizing a vectorized access instruction.
Specifically, for minimizer calculation, there are two important parameters, one is the window size w and the other is the length k of the seed (i.e., the k-long subsequence in the DNA sequence, also called k-mer). The window w is used for finding out the seed with the smallest hash value from every w seeds as a minimizer. Therefore, summarizing the calculation process of minimizer, namely, continuously calculating hash values of w seeds in the same window, then comparing, and finding out the subsequence with the smallest hash value, wherein the calculation process is called a task, and each task is allocated to one channel in the vector register. Wherein, the adjacent w k long subsequences are referred to as a window; a window is a set of subsequences of a DNA sequence; w and k are positive integers greater than or equal to 2.
For a DNA sequence with a length of tens of thousands, the task is many, so that the vector parallel is given a sufficient degree of parallelism, and the 100% utilization rate of the vector register is ensured. In this mode, however, the value loaded from memory by a vector is not physically located in contiguous memory space. Therefore, it is necessary to use the gather instruction to read the discontinuous memory space, which reduces the efficiency of the vectorized memory access to a certain extent due to the locality of the memory access. For vectorization of sequence alignment problems, memory optimization has been the focus of vectorization optimization. Inspired by the memory organization scheme of other algorithms, this embodiment proposes a new memory conversion scheme, for a plurality of windows of consecutive vector channels in the DNA sequence, converting the characters corresponding to them, where the sequence is converted from the first subsequence of the first window, the second subsequence of the first window, … …, the w-th subsequence of the first window, and the first subsequence … … of the second window into the first subsequence of the first window, the first subsequence … … of the first window, and the first subsequence … … of the first window, based on the new memory structure, and the data loaded by the vectors are physically adjacent. As shown in fig. 3, a1, a2, etc. represent k long subsequences, and fig. 3 shows the conversion of the memory storage mode.
The previous discrete memory access instruction can be replaced by a vectorized memory access instruction, so that the parallelized memory access efficiency is improved.
Step S104: indexing all minimizer according to hash value to construct reference gene hash index table based on double array structure; the reference gene hash index table is divided into an index array and a structure array, wherein the index array stores the storage positions of minimizer corresponding to different hash values in the structure array; the structure array stores minimizer location information arranged in ascending order of hash values.
In a specific implementation, the algorithm is vectorized and optimized for the calculation of the kernel calculation process reversible integer hash function in the Minimap building index. For the reversible integer hash function, the present embodiment can implement parallelization of the algorithm with 25 vectorized instructions. These vectorization operations mainly include shift operation, bit operation and addition operation, and these vectorization instructions are very efficient, so that the calculation efficiency of the whole function is high. In addition, with the number of vector register channels as a variable, an extensible vectorized parallel scheme is designed, and the parallel scheme mainly comprises two parts. The first part is the design of the memory part, and according to the memory organization scheme, the number of vector register channels is used as a variable, and the DNA sequences of a plurality of windows of the channels are taken out each time to reconstruct data. The second part is the extensible vectorization realization of the platform, and the extension of the algorithm on different platforms can be realized only by giving specific realization of different vectorization instruction levels on different platforms aiming at meta-operations in the algorithm for SSE, AVX2 and AVX512 computing platforms.
The process of index creation needs to insert the hash table, insert the location of minimizer and hash value into the table, and the inserted process has "write conflict", and the realization of parallelization needs to be synchronized by lock. Because of the very frequent insert operations, designing lock-based parallel algorithms does not achieve good efficiency and thread extensibility. Reference gene hash indexes based on a double-tuple structure are used herein. The hash table is divided into two arrays, the index array stores the positions of minimizer corresponding to different hash values in the structure array, and the structure array stores the position information of the minimizer. In detail, the index array stores the starting position of the minimizer with the hash value being the current array position value in the structure array after the structure array is ordered according to the hash value. And the structure array is the minimizer position information stored according to the hash value ascending order after being ordered. Based on the new data structure, the embodiment provides a parallel indexing algorithm to construct a hash index structure of the minimizer. The structures containing minimizer hash values and location information are ordered in parallel according to hash values using the Intel TBB library function, so that here a nearly linear acceleration boost is exhibited.
As shown in FIG. 4, each position of the right array stores the position information of each minimizer, (t, i, r) represents the sequence number, the position in a sequence, and the forward or reverse chain of the sequence, respectively, and the minimzers are sorted according to hash values, i.e. the minimzers with the same hash value are stored together. For example: the left array subscript h stores the number 413 representing the right array storing the minimizer with the hash value h beginning at subscript 413 and the hash value h+1 beginning at subscript 452.
Step S105: and performing DNA sequence overlapping detection according to a reference gene hash index table based on a double-array structure.
The following tests were performed for different parameters: it was tested that the detection method of the present embodiment can achieve the best performance when k of the k long subsequence is chosen to be 10 to 12. When the window size w is larger, the fewer minimizer are stored in the index, the lower the frequency with which the calculated minimizer is stored in the index structure in the corresponding program. Compared with calculation, the memory reading cost is large, so that the operation time is reduced along with the increase of the window. In general, the new algorithm achieves good performance for different window sizes, according to the requirements of practical applications.
Under the same parameters, different thread numbers are tested, and the running time decline trend basically shows linearity along with the increase of the thread numbers, so that the original program is obviously improved in performance.
According to the method, all DNA sequences are distributed to a preset number of parallel threads according to a strategy that the total DNA data processed by each thread is equal in size, so that the load of the multiple threads is balanced, and the acceleration ratio of the parallel implementation of the multiple threads is ensured;
the embodiment constructs a reference gene hash index table based on a double-array structure; the reference gene hash index table is divided into two arrays, the index arrays store the positions of minimizer corresponding to different hash values in the structure arrays, and the structure arrays store the position information of the minimizer; the parallel computing speed and the thread expansibility are improved by using the reference gene hash index table based on the double-array structure.
Example 2
Fig. 5 shows a schematic structural diagram of a third generation sequencing data overlay detection system according to this embodiment.
The following provides the structural principle of the third generation sequencing data overlap detection system according to this embodiment with reference to fig. 5:
as shown in fig. 5, the third generation sequencing data overlap detection system of the present embodiment includes:
(1) A sequencing data preprocessing module for receiving all DNA sequences of the third generation sequencing data and sequencing the DNA sequences according to length;
the benefit of ordering may be that the difference in the computing tasks corresponding to two adjacent sequences is reduced. Since parallel optimization includes vectorized optimization, if the length difference of adjacent sequences is too large, most of the computation channels in the vector registers are left idle. Thus, ordering is critical to maintaining load balancing for parallel implementations.
(2) The parallel thread allocation module is used for allocating all DNA sequences to a preset number of parallel threads according to a strategy that the sizes of the total DNA data processed by each thread are equal;
for example: when the DNA sequences are distributed to different threads for processing, the DNA sequences are divided according to a strategy that the sum of the characters of the DNA sequences is equal, for example, the length of 4 sequences of 2 threads is 5000,3000,1000,1000, and sequences with length of 5000 are distributed to the rest sequences of one thread.
For the main time-consuming process of the subsequence minimizer calculation for finding the minimum hash value of each window of all DNA sequences, the complexity of the calculation is proportional to the number of characters. This partitioning thus ensures load balancing between threads. In general, the multithreading parallel scheme ensures the expandability among threads, and the parallel scheme can still be used with the increase of the number of cores. In addition, the load balancing of the multithreading ensures the acceleration ratio of the multithreading parallel implementation.
(3) A Minimizer obtaining module, which is used for obtaining the subsequence with the minimum hash value of each window of all DNA sequences for each thread and taking the subsequence as a Minimizer;
in a specific implementation, the Minimizer obtaining module includes:
a subsequence search module for finding all subsequences of equal length size in each DNA sequence;
a hash value calculation module for calculating a hash value of a subsequence of each DNA sequence according to a hash function;
and the Minimizer searching module is used for searching the subsequence with the smallest hash value in the preset window size as a Minimizer.
In the Minimizer solving module, hash values of a plurality of subsequences in the same window are continuously calculated, then comparison is carried out, a calculating process of a subsequence with the minimum hash value is found out and is used as a task, and each task is distributed to one channel in a vector register;
and (3) adjacently storing the subsequences corresponding to the windows in a plurality of windows of a continuous vector channel in the DNA sequence, and further realizing parallelization access to the memory by utilizing a vectorized access instruction.
Specifically, for minimizer calculation, there are two important parameters, one is the window size w and the other is the length k of the seed (i.e., the k-long subsequence in the DNA sequence, also called k-mer). The window w is used for finding out the seed with the smallest hash value from every w seeds as a minimizer. Therefore, summarizing the calculation process of minimizer, namely, continuously calculating hash values of w seeds in the same window, then comparing, and finding out the subsequence with the smallest hash value, wherein the calculation process is called a task, and each task is allocated to one channel in the vector register. Wherein, the adjacent w k long subsequences are referred to as a window; a window is a set of subsequences of a DNA sequence; w and k are positive integers greater than or equal to 2.
For a DNA sequence with a length of tens of thousands, the task is many, so that the vector parallel is given a sufficient degree of parallelism, and the 100% utilization rate of the vector register is ensured. In this mode, however, the value loaded from memory by a vector is not physically located in contiguous memory space. Therefore, it is necessary to use the gather instruction to read the discontinuous memory space, which reduces the efficiency of the vectorized memory access to a certain extent due to the locality of the memory access. For vectorization of sequence alignment problems, memory optimization has been the focus of vectorization optimization. Inspired by the memory organization scheme of other algorithms, this embodiment proposes a new memory conversion scheme, for a plurality of windows of consecutive vector channels in the DNA sequence, converting the characters corresponding to them, where the sequence is converted from the first subsequence of the first window, the second subsequence of the first window, … …, the w-th subsequence of the first window, and the first subsequence … … of the second window into the first subsequence of the first window, the first subsequence … … of the first window, and the first subsequence … … of the first window, based on the new memory structure, and the data loaded by the vectors are physically adjacent. As shown in fig. 3, a1, a2, etc. represent k long subsequences, and fig. 3 shows the conversion of the memory storage mode.
The previous discrete memory access instruction can be replaced by a vectorized memory access instruction, so that the parallelized memory access efficiency is improved.
(4) The hash index table construction module is used for constructing indexes of all minimizer according to hash values and constructing a reference gene hash index table based on a double-array structure; the reference gene hash index table is divided into an index array and a structure array, wherein the index array stores the storage positions of minimizer corresponding to different hash values in the structure array; the structure array stores minimizer position information arranged according to the hash value ascending order;
in a specific implementation, the algorithm is vectorized and optimized for the calculation of the kernel calculation process reversible integer hash function in the Minimap building index. For the reversible integer hash function, the present embodiment can implement parallelization of the algorithm with 25 vectorized instructions. These vectorization operations mainly include shift operation, bit operation and addition operation, and these vectorization instructions are very efficient, so that the calculation efficiency of the whole function is high. In addition, with the number of vector register channels as a variable, an extensible vectorized parallel scheme is designed, and the parallel scheme mainly comprises two parts. The first part is the design of the memory part, and according to the memory organization scheme, the number of vector register channels is used as a variable, and the DNA sequences of a plurality of windows of the channels are taken out each time to reconstruct data. The second part is the extensible vectorization realization of the platform, and the extension of the algorithm on different platforms can be realized only by giving specific realization of different vectorization instruction levels on different platforms aiming at meta-operations in the algorithm for SSE, AVX2 and AVX512 computing platforms.
The process of index creation needs to insert the hash table, insert the location of minimizer and hash value into the table, and the inserted process has "write conflict", and the realization of parallelization needs to be synchronized by lock. Because of the very frequent insert operations, designing lock-based parallel algorithms does not achieve good efficiency and thread extensibility. Reference gene hash indexes based on a double-tuple structure are used herein. The hash table is divided into two arrays, the index array stores the positions of minimizer corresponding to different hash values in the structure array, and the structure array stores the position information of the minimizer. In detail, the index array stores the starting position of the minimizer with the hash value being the current array position value in the structure array after the structure array is ordered according to the hash value. And the structure array is the minimizer position information stored according to the hash value ascending order after being ordered. Based on the new data structure, the embodiment provides a parallel indexing algorithm to construct a hash index structure of the minimizer. The structures containing minimizer hash values and location information are ordered in parallel according to hash values using the Intel TBB library function, so that here a nearly linear acceleration boost is exhibited.
As shown in FIG. 4, each position of the right array stores the position information of each minimizer, (t, i, r) represents the sequence number, the position in a sequence, and the forward or reverse chain of the sequence, respectively, and the minimzers are sorted according to hash values, i.e. the minimzers with the same hash value are stored together. For example: the left array subscript h stores the number 413 representing the right array storing the minimizer with the hash value h beginning at subscript 413 and the hash value h+1 beginning at subscript 452.
(5) And the overlap detection module is used for carrying out DNA sequence overlap detection according to the reference gene hash index table based on the double-array structure.
According to the method, all DNA sequences are distributed to a preset number of parallel threads according to a strategy that the total DNA data processed by each thread is equal in size, so that the load of the multiple threads is balanced, and the acceleration ratio of the parallel implementation of the multiple threads is ensured;
the embodiment constructs a reference gene hash index table based on a double-array structure; the reference gene hash index table is divided into two arrays, the index arrays store the positions of minimizer corresponding to different hash values in the structure arrays, and the structure arrays store the position information of the minimizer; the parallel computing speed and the thread expansibility are improved by using the reference gene hash index table based on the double-array structure.
Example 3
The present embodiment is a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the third generation sequencing data overlap detection method shown in fig. 1.
According to the method, all DNA sequences are distributed to a preset number of parallel threads according to a strategy that the total DNA data processed by each thread is equal in size, so that the load of the multiple threads is balanced, and the acceleration ratio of the parallel implementation of the multiple threads is ensured;
the embodiment constructs a reference gene hash index table based on a double-array structure; the reference gene hash index table is divided into two arrays, the index arrays store the positions of minimizer corresponding to different hash values in the structure arrays, and the structure arrays store the position information of the minimizer; the parallel computing speed and the thread expansibility are improved by using the reference gene hash index table based on the double-array structure.
Example 4
The present embodiment provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed implements the steps in the third generation sequencing data overlap detection method as shown in fig. 1.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A third generation sequencing data overlap detection method, comprising:
receiving all DNA sequences of the third generation sequencing data, and sequencing the DNA sequences according to the length;
distributing all DNA sequences to a preset number of parallel threads according to a strategy that the sizes of the total DNA data processed by each thread are equal;
for each thread, calculating the subsequence with the minimum hash value of each window of all DNA sequences and taking the subsequence as a minimizer;
indexing all minimizer according to hash value to construct reference gene hash index table based on double array structure; the reference gene hash index table is divided into two arrays, the index arrays store the positions of minimizer corresponding to different hash values in the structure arrays, and the structure arrays store the position information of the minimizer;
performing DNA sequence overlapping detection according to a reference gene hash index table based on a double-array structure;
the step of obtaining the subsequence with the minimum hash value of each window of all DNA sequences and taking the subsequence as a minimizer comprises the following steps:
the method comprises the steps of continuously calculating hash values of a plurality of subsequences in the same window, comparing, and finding out a calculating process of a subsequence with a minimum hash value as a task, wherein each task is distributed to a channel in a vector register;
for a plurality of windows of a continuous vector channel in the DNA sequence, sub-sequences corresponding to the windows are stored adjacently, and then parallelization memory access is realized by using a vectorization memory access instruction;
the vectorization memory accessing instruction is that, for a plurality of windows of a continuous vector channel in a DNA sequence, the characters corresponding to the windows are converted, and the sequence is converted from a first subsequence of a first window, a second subsequence of the first window, … …, a w-th subsequence of the first window, and a first subsequence … … of the second window to a first subsequence of the first window, a first subsequence of a first channel of a first subsequence … … of the second window, and a first subsequence … … of the first window based on a new memory structure, and the data loaded by the vector is physically adjacent.
2. The method for detecting overlapping of three-generation sequencing data according to claim 1, wherein the process of obtaining the subsequence with the smallest hash value of each window of all DNA sequences and using the subsequence as a minimizer is as follows:
finding all subsequences of equal length and size in each DNA sequence;
calculating the hash value of the subsequence of each DNA sequence according to the hash function;
searching a subsequence with the smallest hash value in the preset window size as a minimizer.
3. The third generation sequencing data overlap detection method of claim 1, wherein the index array stores a starting position of a minimizer with a hash value being a current array position value in the structure array after the structure array is ordered according to the hash value;
the structure array is the minimizer position information stored according to the hash value ascending order after being ordered.
4. The method for detecting overlapping of three-generation sequencing data according to claim 1, wherein in the reference gene hash index table based on the double array structure, structures containing minimizer hash values and position information are ordered in parallel according to hash values using a function of Intel TBB library.
5. A third generation sequencing data overlap detection system for performing the third generation sequencing data overlap detection method of claim 1, comprising:
a sequencing data preprocessing module for receiving all DNA sequences of the third generation sequencing data and sequencing the DNA sequences according to length;
the parallel thread allocation module is used for allocating all DNA sequences to a preset number of parallel threads according to a strategy that the sizes of the total DNA data processed by each thread are equal;
a Minimizer obtaining module, which is used for obtaining the subsequence with the minimum hash value of each window of all DNA sequences for each thread and taking the subsequence as a Minimizer;
the hash index table construction module is used for constructing indexes of all minimizer according to hash values and constructing a reference gene hash index table based on a double-array structure; the reference gene hash index table is divided into two arrays, the index arrays store the positions of minimizer corresponding to different hash values in the structure arrays, and the structure arrays store the position information of the minimizer;
and the overlap detection module is used for carrying out DNA sequence overlap detection according to the reference gene hash index table based on the double-array structure.
6. The three-generation sequencing data overlap detection system of claim 5, wherein said Minimizer calculation module comprises:
a subsequence search module for finding all subsequences of equal length size in each DNA sequence;
a hash value calculation module for calculating a hash value of a subsequence of each DNA sequence according to a hash function;
and the Minimizer searching module is used for searching the subsequence with the smallest hash value in the preset window size as a Minimizer.
7. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the third generation sequencing data overlap detection method according to any of claims 1 to 4.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps in the third generation sequencing data overlap detection method of any of claims 1-4 when the program is executed.
CN202010195494.9A 2020-03-19 2020-03-19 Third generation sequencing data overlap detection method and system Active CN111292805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010195494.9A CN111292805B (en) 2020-03-19 2020-03-19 Third generation sequencing data overlap detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010195494.9A CN111292805B (en) 2020-03-19 2020-03-19 Third generation sequencing data overlap detection method and system

Publications (2)

Publication Number Publication Date
CN111292805A CN111292805A (en) 2020-06-16
CN111292805B true CN111292805B (en) 2023-08-18

Family

ID=71025002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010195494.9A Active CN111292805B (en) 2020-03-19 2020-03-19 Third generation sequencing data overlap detection method and system

Country Status (1)

Country Link
CN (1) CN111292805B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782097B (en) * 2021-09-07 2022-06-24 中国人民解放军国防科技大学 Anchor point screening method and device based on bloom filter and computer equipment
CN114564306B (en) * 2022-02-28 2024-05-03 桂林电子科技大学 Third generation sequencing RNA-seq comparison method based on GPU parallel computing
CN114489518B (en) * 2022-03-28 2022-09-09 山东大学 Sequencing data quality control method and system
CN115641911B (en) * 2022-10-19 2023-05-23 哈尔滨工业大学 Method for detecting overlapping between sequences

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809510B2 (en) * 2002-02-27 2010-10-05 Ip Genesis, Inc. Positional hashing method for performing DNA sequence similarity search

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于多核和众核平台的并行DNA序列比对算法.《中国博士学位论文全文数据库,基础科学辑》.2019,全文. *

Also Published As

Publication number Publication date
CN111292805A (en) 2020-06-16

Similar Documents

Publication Publication Date Title
CN111292805B (en) Third generation sequencing data overlap detection method and system
JP6605573B2 (en) Parallel decision tree processor architecture
JP5425541B2 (en) Method and apparatus for partitioning and sorting data sets on a multiprocessor system
WO2012076379A2 (en) Data structure for tiling and packetizing a sparse matrix
Ekim et al. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets
US9823911B2 (en) Method and apparatus for compiling code based on a dependency tree
CN111445952B (en) Method and system for quickly comparing similarity of super-long gene sequences
WO2012076377A2 (en) Optimizing output vector data generation using a formatted matrix data structure
US20130262835A1 (en) Code generation method and information processing apparatus
CN108920412B (en) Algorithm automatic tuning method for heterogeneous computer system structure
CN112735528A (en) Gene sequence comparison method and system
Feng et al. Accelerating long read alignment on three processors
Komarov et al. Fast k-NNG construction with GPU-based quick multi-select
Wang et al. Removing sequential bottlenecks in analysis of next-generation sequencing data
CN109657197B (en) Pre-stack depth migration calculation method and system
US8583719B2 (en) Method and apparatus for arithmetic operation by simultaneous linear equations of sparse symmetric positive definite matrix
Holt et al. Constructing Burrows-Wheeler transforms of large string collections via merging
CN111028897A (en) Hadoop-based distributed parallel computing method for genome index construction
CN116092587B (en) Biological sequence analysis system and method based on producer-consumer model
CN107430506B (en) Method and apparatus for discovering multiple instances of repeated values within a vector and application to ranking
Satish et al. Mapreduce based parallel suffix tree construction for human genome
Zhao et al. PSAEC: an improved algorithm for short read error correction using partial suffix arrays
Zeng et al. SGSI–A Scalable GPU-friendly Subgraph Isomorphism Algorithm
Bhowmick et al. An approach for improving complexity of longest common subsequence problems using queue and divide-and-conquer method
CN113515674A (en) Sampling method and device for random walk of timing diagram

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant