CN114064551B

CN114064551B - CPU + GPU heterogeneous high-concurrency sequence alignment calculation acceleration method

Info

Publication number: CN114064551B
Application number: CN202210046617.1A
Authority: CN
Inventors: 张巍; 林超宁; 张崇
Original assignee: Guangzhou Jiajian Medical Testing Co ltd
Current assignee: Guangzhou Jiajian Medical Testing Co ltd
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-05-17
Anticipated expiration: 2042-01-17
Also published as: CN114064551A

Abstract

The invention discloses a high-concurrency sequence comparison calculation acceleration method based on CPU + GPU isomerism, which comprises the following steps: reconstructing a BWA-MEM algorithm code; and (3) concurrent processing of tasks on the CPU: completing the division of a sequence set, and forming a plurality of concurrent tasks for the first time; running a BWA-MEM algorithm after code reconstruction to complete data concurrent processing on the GPU; and (3) task concurrent processing on the GPU: and for the seed set and the chain generated in the sequence data comparison process, dividing the seed set with the same or adjacent length, position and number into the same data block, and performing the same treatment on the chain, thereby completing the division of the seed set and the chain and forming a plurality of concurrent tasks for the second time. The invention closely combines the characteristics of the BWA-MEM algorithm and the characteristics of the GPU accelerating equipment by designing a task parallel and data parallel mode, fully utilizes the strong concurrent operation capability of the GPU, provides excellent performance for a sequence comparison algorithm, and has high comparison and concurrent processing efficiency.

Description

CPU + GPU heterogeneous high-concurrency sequence alignment calculation acceleration method

Technical Field

The invention relates to the field of biological sequence comparison, in particular to a high-concurrency sequence comparison calculation acceleration method based on CPU + GPU isomerism.

Background

Biological sequence alignment is the application of the classic text alignment problem in the computer field in the biological field. With the development of emerging new molecular biology technologies, the ensuing molecular biology studies such as genetic variation, RNA expression, protein and gene interactions, etc., require researchers to employ high throughput methods to interpret them. This presents new challenges for high performance computing. The challenge in the era of high throughput sequencing is no longer the generation of data, but rather the storage, processing and analysis of data. The development of the third generation sequencing technology can further accelerate the sequencing speed, and meanwhile, longer sequencing fragments are generated, thereby putting higher requirements on the development of the sequence alignment technology. Meanwhile, computer hardware platforms are rapidly developed in recent years and are continuously updated, new multi-core platforms and multi-core platforms are continuously appeared, the performance of high-performance computers is rapidly improved by multi-core processor accelerating equipment such as a GPU, and a GPU heterogeneous platform is becoming an important mode for constructing the high-performance computers. This also makes the computer architecture more complex, bringing new challenges to the optimization of the computer program.

Meanwhile, in the field of biological sequence comparison, a common processing mode is adopted when a CPU + GPU heterogeneous platform mode is adopted to accelerate the algorithm.

For example, in the document "GPGPU-based rapid alignment of biological sequences" (published in mohsen, volume 38, phase 4, 2012 and 2 months) of the first author, an efficient biological sequence alignment scheme is proposed under a CPU-GPU heterogeneous platform. The scheme utilizes the parallel processing capacity of the GPU, and reconstructs a Smith-Waterman algorithm under an OpenCL framework by optimizing read delay, write delay, recombination functions and data transmission, so that the biological sequence comparison speed is increased. That is, the document is mainly optimized for the SW algorithm (i.e., the Smith-Waterman algorithm).

For another example, in the engineering master thesis "CPU _ GPU heterogeneous parallel optimization key technology research for biological sequence analysis algorithm" (author, university of defense science and technology research institute, 3 months 2012), the core idea is a heterogeneous system constructed based on CPU and GPU, and mainly optimizes the SW algorithm (i.e., Smith-Waterman algorithm).

The above technical solutions all optimize or improve a specific algorithm (e.g. SW algorithm, FM-index algorithm), focus on the algorithm itself, utilize the characteristics of GPU to solve the problem of data parallelism, and achieve the purpose of increasing the speed of biological sequence comparison by improving the specific algorithm, and have the following disadvantages: when biological sequences are compared, a section of accurate matching sub-section is found as seed, and the length, position and number of SMEM contained in different reads are greatly different; when the task division mode that each thread processes one reads is adopted in the GPU platform, serious asynchronism among different threads can be caused, the reads containing the shorter SMEM need to wait for the reads containing the longer SMEM to be searched, and for the reads containing the shorter SMEM, the number of the SMEM is more, so that the reads containing the longer SMEM needs to wait for the reads containing the shorter SMEM in turn, and the situation of mutual waiting leads to extremely low utilization rate of computing resources and extremely limited acceleration effect. The above solutions focus on the problem of data parallelism (further increase the data parallelism capability), and cannot overcome the above disadvantages.

Similarly, due to the kind of seed manner highly related to the input data of BWA-MEM, the number and length of seeds contained in each reads are very different, which in turn affects the process of generating chain according to the seeds in the next module, resulting in poor synchronization of different threads of the module generating chain, further limiting the acceleration effect. Since the operating characteristics of the algorithm are highly correlated with the input data, these methods cannot simultaneously adapt to input data of different characteristics. The requirement of the GPU platform for instruction consistency within the same warp severely limits the acceleration effect of the BWA-MEM algorithm. Therefore, the above-described problem cannot be solved for the existing BWA-MEM algorithm as well.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a high-concurrency sequence comparison calculation acceleration method based on CPU + GPU isomerism.

The purpose of the invention is realized by the following technical scheme:

a high-concurrency sequence alignment calculation acceleration method based on CPU + GPU isomerism comprises the following steps:

BWA-MEM algorithm code reconstruction step: for the BWA-MEM algorithm, the data structure is simplified, and partial circulation and logic judgment statements are optimized to be suitable for running on a GPU architecture;

and task concurrent processing on the CPU: on a CPU, for a sequence set to be compared, firstly setting the size of a data block of the sequence according to the number of processing threads of a GPU, completing the division of the sequence set, and forming a plurality of concurrent tasks for the first time; then reading the sequence data by the CPU data thread in blocks, and then comparing the sequence data;

and a GPU data concurrent processing step: on the GPU, a BWA-MEM algorithm after code reconstruction is operated to finish data concurrency of sequence data comparison;

and (3) concurrent processing of tasks on the GPU: on the GPU, for a seed set and a chain generated in the data concurrent processing process of sequence data comparison, the seed sets with the same or adjacent length, position and number are divided into the same data block, the chain is processed in the same way, so that the division of the seed set and the chain is completed, and a plurality of concurrent tasks are formed for the second time.

The seed set and the chain generated in the data concurrent processing process of the sequence data comparison refer to that in the sequence data comparison process, all the same public substrings of the sequence fragment to be compared and the reference sequence are searched, the same public substring with the minimum length not less than a set value is used as a seed, and a plurality of seeds are collected to be a seed set; then, the seeds that are positioned close to each other on the reference sequence are stored together as "chains", and the seeds on each chain are sorted in the order of the seed length from large to small.

The finding of all the same common substrings of the sequence segments to be aligned and the reference sequence is accomplished by using the FM-Index algorithm and the backward search algorithm in the BWA-MEM algorithm.

The sequence data comparison is processed in a pipelined fashion by starting two scheduling threads: initializing a host thread to wait, and working the other host thread in steps one and two; when the working host thread goes to the step two, waiting for the thread to be activated, and starting the step one by the waiting host;

the first step is as follows: after memory data are obtained, the GPU is scheduled to search seeds and expand tasks, and the expanded diversity of the seeds obtained by the GPU is used as intermediate data to be copied from the GPU memory to the host memory;

the second step is as follows: and generating sam data and outputting the file.

The BWA-MEM algorithm after code reconstruction supports asynchronous execution of the code through a global work list, a GPU distributes a local work list for each sequence comparison analysis task, and the concurrent comparison analysis tasks share a remote work list; during system operation, the GPU periodically reports work items generated and consumed; once the total number of workitems is zero, the process terminates; the GPU includes three threads: the method comprises the steps of receiving threads, sending threads and working threads, wherein the first two threads are used for communication between GPUs, and the last thread is used for processing local workitems; each GPU receives the remote workitems from the previous equipment and gives the remote workitems to a receiving thread to complete the shunting of the workitems; both the worker thread and the receiving thread submit GPU kernel to complete their jobs, and the kernel of the receiving thread submits to a separate stream and is assigned a higher priority.

The simplified BWA-MEM algorithm data structure is that a CUDA language is used for reconstructing the data structure, and a complex structure in the data structure is removed, wherein the complex structure comprises a plurality of levels of pointers and a structure body.

The optimization BWA-MEM algorithm partial loop and logic judgment statement means that partial loop is equivalently expanded, and the logic judgment statement is reconstructed.

And data division blocks are correspondingly formed by the plurality of concurrent tasks, and the data division blocks which need to be accessed most are loaded into the cache preferentially through the data scheduler and the set scheduling rule.

The sequence data includes single-ended or double-ended DNA sequencing fragment data.

The acceleration method realizes efficient asynchronous communication in a data pipeline transmission mode: in the multi-GPU node, a CPU and a GPU are connected through a front-end bus; the front-end bus is connected to the connector to support communication among the CPU-GPU and the GPU-GPU; data transmission between the GPU1 and the GPU2, and between the GPU3 and the GPU4 can be direct, and data transmission between the GPU2 and the GPU3 is performed by the CPU, so that the data transmission is divided into two stages: the GPU2 sends data to the main memory first, and then the GPU3 pulls information from the main memory; the GPU comprises an input memory copy engine, an output memory copy engine and an execution engine, so that two paths of memory copies and code execution are supported to be carried out concurrently; in the execution process of the sequence comparison algorithm, a plurality of receiving buffers are designed at the receiving end of the GPU, so that a receiver of a production line is realized; after the transmission of one section of data is finished, the GPU can immediately start the transmission of the next section of data while processing the data; the asynchronous program ensures the correctness through a fine-grained synchronization point; the asynchronous communication mechanism employed is either the future/premium mechanism or the event mechanism.

Forming a plurality of concurrent tasks for the second time, wherein the concurrent tasks comprise Grid and Block dimension division and sequence and thread division; wherein

The Grid and the Block dimension division are carried out in a GPU, and each Block in the Grid can be distributed to each stream multiprocessor of the GPU for execution according to a CUDA programming model; the Block dimension design is preferably considered in the dimension selection, and the larger the Grid-is, the better the Grid-is; the Block dimension design needs to be determined according to registers and shared memory quantity used in each SM in actual calculation and the available resource quantity on each SM of hardware;

the division of the sequence and the thread refers to: each sequence is allocated with a thread for operation; for the processing of each data partition block, different analysis tasks may have different workloads, and are solved through the thread configuration of the GPU kernel, the calculation in the GPU is completed by grid composed of a plurality of thread blocks, and each thread block is composed of a plurality of threads; the more graph analysis tasks there are vertices to be processed, the more threads are configured for processing.

The BWA-MEM algorithm includes: FM-Index search algorithm, backward search algorithm, Smith-Waterman algorithm and dynamic programming; the searching of SMEM and the generation of Chain are completed through the mem _ Chain module, and the Chain2aln module is responsible for expanding two ends of Chain by utilizing a Smith-Waterman algorithm to find an optimal comparison position; the mem _ chain module also comprises a mem _ collection _ intv module and a generation _ chain module, wherein the mem _ collection _ intv module searches the maximum precise matching from the read, and the generation _ chain module forms chain by mem of which the distance meets the condition; the BWA-MEM algorithm employs a multi-kernel parallel framework.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention carries out deep research on a CPU + GPU heterogeneous system, can tightly combine the characteristics of a BWA-MEM algorithm with the characteristics of GPU accelerating equipment by designing a task parallel + data parallel mode, fully utilizes the strong concurrent operation capability of the GPU, provides excellent performance for a BWA-MEM sequence comparison algorithm, and has high sequence comparison and high concurrent processing efficiency.

Task parallelism + data parallelism, specifically: the method comprises the steps of first task parallelism + second task parallelism + first data parallelism, wherein the first task parallelism occurs on a CPU, the second task parallelism occurs on a GPU, the first data parallelism occurs on the GPU, and the second task parallelism occurs in a task stream of the data parallelism.

2. The invention designs a data access asynchronous model, realizes efficient asynchronous communication in a data pipeline transmission mode, and supports asynchronous execution of a concurrent comparison algorithm through a global working list.

3. The invention designs a comparison algorithm and a task concurrency strategy. The strategy comprises the following steps: FM-index, Smith-Watermen algorithm and Grid, Block dimension, sequence and thread high concurrency strategy.

4. The invention designs a data-driven concurrent execution mechanism, asynchronously loads the data into the GPU cache through a data division strategy, serves a plurality of computing tasks, fully utilizes the access overhead of the overall thread and improves the utilization rate of GPU resources.

Drawings

FIG. 1 is a diagram of a two-thread two-step pipeline architecture.

Fig. 2 is a diagram of a multi-GPU node tree topology.

FIG. 3 is a schematic diagram of the operation of a distributed worklist.

Fig. 4 is a schematic diagram of an optimized chain storage format.

FIG. 5 shows a multithread parallelization mode after reconstruction of the data structure in the BWA-MEM algorithm.

FIG. 6 is a flow chart of the BWA-MEM algorithm.

FIG. 7 is a schematic diagram of the alignment algorithm data flow execution mechanism.

FIG. 8 is a diagram of a data concurrency enforcement architecture.

FIG. 9 is a schematic diagram of seed expansion

FIG. 10 is a schematic diagram of parallel computation of the anti-diagonal elements of the alignment score matrix.

FIG. 11 is a graph comparing the time spent running the original version bwa and the heterogeneous version bwa.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

The high-concurrency sequence alignment calculation acceleration method based on the CPU + GPU isomerism comprises the following steps:

the data access asynchronous model design and the data pipeline transmission mode realize efficient asynchronous communication, and then the asynchronous execution of the concurrent comparison algorithm is supported through the global working list.

And comparing an algorithm and a task concurrency strategy. The strategy comprises the following steps: FM-index, Smith-Watermen algorithm and Grid, Block dimension, sequence and thread high concurrency strategy.

The data-driven concurrent execution mechanism is asynchronously loaded into the GPU cache through a data division strategy, serves a plurality of computing tasks, fully utilizes the access overhead of the overall thread and improves the utilization rate of GPU resources.

The method specifically comprises the following steps:

1. data access asynchronous model design

1.1 asynchronous threading

As shown in fig. 1, large-scale sequence data needs to be divided for parallel processing. The current sequence partitioning strategy is to store data (single-ended or double-ended DNA sequencing fragment data) read by a single CPU data thread in a blocking mode into a host memory, wherein the size of the read data block is set according to the number of processing threads of a GPU, and assuming that a Tesla V100 GPU is used, 80 computing units are provided, 1024 threads can be run on each computing unit at most, and 80X 1024 = 81920 computing tasks can be processed on each card at a time. And then the host starts two scheduling threads to process the sequence comparison calculation task in a pipeline mode, initializes one host thread to wait, schedules the GPU to find seeds and expand the task after the other host thread obtains memory data, copies intermediate data (the seeds are expanded to be diverse) obtained by the GPU from the GPU memory to the host memory, starts a second processing step, generates sam data (CPU working multithreading) and outputs a file. And (5) waiting for the activation of the thread when the previous thread enters the step two, and starting to process the step one. Therefore, the heterogeneous system is operated in a pipeline mode, the host and the GPU are always kept in a busy state, and the optimal performance effect is achieved.

Pseudo code implementation of algorithm 1CPU pipeline execution mechanism

Procedure EXECUTOR1(R,S)

While R has unprocessed R_n for some samples do

// data thread reads sequence partition block to host memory

for each s ∈ S_S do

R_n←Scheduler(R,S)

end for

end while

End procedure

Procedure EXECUTOR2(R_n,G)

While R has unprocessed R_n for some GPUs do

// scheduling thread pipeline step 1 tasks

G_S←GetGPUs(R_nG)// obtaining treatment R_nTask set of

for each g ∈ G_S do

ParallelProcess(g,R_nD// CUDA Kernel asynchronous processing sequence task

end for

end while

End procedure

Procedure EXECUTOR3(D,J)

While S has unprocessed S_n for some jobs do

// scheduling thread pipeline step 2 tasks

J_SOid ← GetDatas (D, J)// obtaining a set of tasks for Process D

for each j ∈ J_S do

parallelProcess (j, D, S)// worker thread parallel processing task

end for

for each j ∈ J_S do

output(j,S_new) V/sync task output sam File

end for

end while

End procedure

1.2 asynchronous communication

In the multi-GPU node, a CPU and a GPU are connected through a front-side bus. The front-side bus is connected to a PCI-Express or NVLink connector and the like to support communication between the CPU-GPU and the GPU-GPU. As shown in fig. 2, data may be transferred directly between

GPUs

1 and 2 and GPUs 3 and 4. Data transfer between

GPUs

2 and 3 is via the CPU and therefore is divided into two phases, GPU2 first sends data to main memory and then GPU3 pulls information from main memory. The GPU includes two memory copy engines (input and output), and one execution engine, thus supporting two-way memory copy and code execution concurrency. In the execution process of the sequence comparison algorithm, in order to further improve the communication efficiency, the system designs a plurality of receiving buffers at the receiving end of the GPU, and a receiver of a production line is realized. After the transmission of one section of data is finished, the GPU can immediately start the transmission of the next section of data while processing the data, so that the calculation communication coverage rate of the GPU is improved.

Asynchronous programs require fine-grained synchronization points to ensure correctness. To minimize synchronization overhead, the system uses two low-latency asynchronous communication mechanisms. One is the future/progress mechanism, where a request to transfer a data segment, for example, returns a future object to block subsequent operations, indicating that the transfer has not yet been completed. Once the data transmission is complete, it can be processed immediately. The other is an event mechanism, where all operations submitted before an event can be synchronized by inserting the event in the CUDA stream.

Asynchronous execution

Since each GPU stores only a subset of the sequences in global memory, each GPU can only process work items associated with the local sequences. Therefore, for each GPU, the work items are divided into local work and remote work, and are respectively represented by localwork and remotework objects in the system. The attributes of the work items comprise a sequence ID and a sequence attribute, and a task ID attribute is used for distinguishing different alignment analysis tasks. The system implements a distributed worklist to assist in the asynchronous execution of the sequence alignment algorithm. The distributed worklist maintains a global list of pending workitems. Each work item may generate a new work item to add to the work list, e.g., the work thread traverses each sequence seed set, creating a new work item for the seed set grouping. In the implementation of the distributed work list, the global cooperation and the work item count are managed in a centralized manner in the host, and the work items are stored in a distributed manner in the global memory of each GPU. The GPU allocates a local work list for each comparison and analysis task, and the concurrent comparison and analysis tasks share a remote work list. During system operation, the GPU periodically reports work items that are generated and consumed. Once the total number of work items is zero, the process terminates. As shown in fig. 3, the implementation of a distributed worklist is described from the perspective of a single GPU. The GPU includes three threads: a receive thread, a send thread, and a worker thread. The first two are used for communication between GPUs and the last one for processing of local work items. The workflow of the distributed worklist is described in detail below. Each GPU receives the remote workitems from the previous equipment and gives the remote workitems to a receiving thread to complete the shunting of the workitems. Both the worker thread and the receiving thread submit GPU kernel to complete their jobs, and the kernel of the receiving thread submits to a separate stream and is assigned a higher priority. The purpose of this is to prevent local computing tasks from blocking asynchronous communication tasks, improving the performance and responsiveness of the system.

The local work list is a circular queue realized by an array, and the expense of dynamically allocating the memory during the operation is avoided. Each local work list contains two producers and one consumer as shown in fig. 3. The receiving thread and the worker thread add workitems to the local worklist, and only the worker thread consumes the workitems. The local work list comprises a memory array and three domains: start, end and pending. The consumption of work items is indicated by increasing the value of start. In order to avoid consuming data items which are not ready, the value of pending is increased through atomic operation, and space is reserved for the work items to be added. After the producer adds a work item, the value of end is synchronized with pending. To avoid write conflicts for multiple producers, new workitems generated by the worker thread are added to the local worklist by reducing the start value. When adding a work item to the local work list, multiple threads in the kernel perform write operation on the local work list, and the overhead of atomic operation is very large. The system uses warp-aggregated atomic to solve this problem. Specifically, the threads in each warp compute the total number of work items added and the respective offsets, then reserve space for writing data with atomic operations, and finally perform the write operation. Thus, the atomic operation between threads becomes the atomic operation between warps, and the overhead of the atomic operation is greatly reduced.

Memory design and optimization

When the CUDA device transmits data with the host end, only the data supporting bit copying can be transmitted. That is, for a structure containing a pointer variable, a data structure such as a pointer of the pointer cannot be directly transmitted between the device and the host. Therefore, the data structure needs to be adjusted, data which needs to be transmitted at the device and the host end is selected, and the data are independently distributed into a continuous storage space, so that the data transmission speed is increased. Taking the data structure for storing the generated chain as an example, the optimized chain storage modes are respectively shown in fig. 4.

Based on 6 kinds of memory storage spaces inside the GPU, the storage positions of the GPU are designed and optimized according to the data volume and the access characteristics of the data. Data used in kernel of BWA-MEM are shown in Table 1.

TABLE 1

2. Strategy design and implementation of comparison algorithm and task concurrency

2.1 BWA-MEM Algorithm parallelization

The BWA-MEM algorithm uses the algorithm of the original program, thus ensuring high precision of the output result, using the CUDA language in the process of reconstructing the CPU code into the GPU kernel code, reconstructing the data structure, removing complex structures such as multi-level pointers and structural bodies, equivalently expanding partial cycles, reconstructing logic judgment statements and making the logic judgment statements more suitable for running calculation on a GPU architecture.

The main algorithms for BWA-MEM are FM-Index search algorithm, backward search algorithm, SmithWaterman algorithm, and dynamic programming. The two modules MEM _ chain and chain2a ln are the two most important hot spots of the whole BWA-MEM sequence alignment. The mem _ Chain algorithm mainly completes SMEM search and Chain generation, and the Chain2aln module is responsible for expanding two ends of Chain by utilizing the Smith-Waterman algorithm to find the optimal comparison position. The mem _ chain module comprises two important modules: mem _ collect _ intv and generate _ chain. The MEM _ collection _ intv module searches the Maximum Exact Match (MEM) from the read, and the generation _ chain module forms chain from MEM with the distance meeting the condition. Since the whole algorithm flow needs to pass through a plurality of modules, in order to enable the program to have better performance on the GPU, a multi-kernel parallel framework shown in fig. 5 is designed. The flow of the BWA-MEM algorithm is shown in FIG. 6.

Search algorithm

If the string W is a substring of X, then each occurrence of W in X will be in the range of the suffix array. This is because all suffixes are prefixed by W. Based on this observation, we define:

R(w) = min k w is a prefix of xs (k) (1)

R (w) = max k w is a prefix of xs (k) (2)

The BWT conversion and suffix arrays are as follows:

wherein if W is an empty string, thenR(w) =1, r (w) = n-1. Section [ 2]R(w)，R(w)]Referred to as the SA interval of W, and the set of all positions where W appears in X is { S (k) } R (W) ≦ k ≦ R (W) }. For example, in BWT conversion and suffix array, the SA interval of the string "ac" is [2, 3%]. Suffix array values in this interval are 0 and 3, indicating the location of all occurrences of "ac" in the original string. Knowing the interval in the suffix array, we can get its position in the original string. Thus, a sequence alignment is equivalent to searching for a substring of X for which the SA interval matches the query. For exact match problems we can only find one such SA interval, for inexact match problems there may be many more SA intervals.

The backward search algorithm assumes that C (a) is that the number of symbols in X [0, n-2 ] arranged in lexicographic order is less than a ∈ Σ and O (a, i) is the number of times a appears in B [0, i ]. Ferragina and Manzini (2000) demonstrated that if W is a substring of X:

then there are substrings of R (aW). ltoreq.R (aW) if and only if aW is X. This conclusion allows testing whether W is a substring of X and counting the number of occurrences of W in O (| W |) time, a process called backward search by iteratively calculating R and R from the last time of W. It is noted that equations (3) and (4) actually implement a prefix tree that traverses X top-down, and we can compute the SA interval for a child node within a constant time if we know the interval for its parent node. In this sense, the backward search is equivalent to an exact match of the string in the prefix tree, but does not explicitly put the prefix tree into memory.

Algorithm, dynamic programming

After all 'seeds' (the sub-strings with the DNA sequencing fragments on the reference sequence which are in super-maximum exact match) meeting the requirements of the DNA sequencing fragments in the reference sequence are found by the algorithm, a 'seed expansion' stage is entered, the comparison score matrix of two character strings is required to be calculated at the stage, and the SmithWaterman algorithm and the dynamic programming process are used.

SmithWaterman algorithm step:

(1) the algorithm score matrix H is initialized so that row i represents the character ai and column j represents the character bj

(2) Calculating Hi, j of each item in the matrix:

where pen _ match is a reward score when two characters are the same, pen _ mismatch is a penalty score when two characters are not the same, Si, j is a reward score, pen _ gap (Si, -) is a penalty score for insertion or deletion of a sequence to be matched (assuming that the reference sequence is a horizontal "line" sequence), and pen _ gap (-, tj) is a penalty score for insertion or deletion of a reference sequence.

Calculating a score matrix: when calculating the elements Hi, j on the two-dimensional matrix, three elements on the left side, the upper left corner and the top of the two-dimensional matrix need to be calculated first, so the process of calculating the scoring matrix is also a dynamic planning process, the process divides a larger and more complex original problem into smaller and more easily solved sub-problems, and after the optimal solution of the sub-problems is obtained, the sub-problems are combined to obtain the optimal solution of the original problem.

2.4 task partitioning parallelization

Task partitioning involves two aspects of work, namely Grid, Block dimension partitioning and partitioning of sequences and threads. For Grid and Block dimension division, according to the CUDA programming model, each Block in Grid is allocated to each Stream Multiprocessor (SM) of the GPU for execution. Block dimension design is prioritized in dimension selection, with larger Grid generally being better. The Block dimension design needs to be determined according to the registers used in each SM in the actual computation, the amount of shared memory, and the amount of resources available on each SM in the hardware. The following is the partitioning of reads into threads. In the comparison process, reads have natural independence, namely different reads are completely independent, only reference sequence information is needed, and information of other reads is not needed. Therefore, the most intuitive processing method is adopted, and each reads is allocated with one thread to carry out operation. For the processing of each data partition block, different analysis tasks may have different workloads, and are solved by the thread configuration of the GPU kernel, the computation in the GPU is completed by grid composed of a plurality of thread blocks, and the thread blocks are composed of a plurality of threads. The more graph analysis tasks there are vertices to be processed, the more threads are configured for processing.

3. Data-driven concurrent execution mechanism

3.1 data flow model design

The design implementation of the data-driven concurrent execution mechanism, for example, a formalized algorithm, illustrates the concurrent computation of the data-driven GPU and the output of the computation result. The data loading part mainly comprises the strategy of the data partition block and the loading of the task state, and the concurrent processing part comprises the processing of the concurrent analysis task on the same partition and the synchronous output of data.

When a receiving thread of a GPU receives remote workitems from a CPU, the workitems are added into a local worklist of each task according to task IDs, and then the workthreads of each task are triggered to process. In order to fully utilize the relevance of concurrent task data access, the system provides a data-driven concurrent task execution model, and the core idea is as follows: the concurrent analysis tasks share the same access sequence of the data blocks, and each time one data block group and a plurality of task specific data are loaded into the GPU cache, a plurality of calculation tasks are triggered to be concurrently processed. The next data block is loaded only after all relevant tasks have been processed.

Taking fig. 7 as an example, if there are two calculation tasks, namely FM-index and SmithWaterman, in the current GPU, which both need to access the data partition block 1 and partition block 2 to complete the calculation, the system first loads the data partition block into the cache, triggers the two tasks to process, and then loads the next data partition block. Therefore, the concurrent analysis task is completed only by accessing the analysis data once, and the total data access expense is reduced.

Pseudo code implementation of algorithm 2 GPU high-concurrency execution mechanism

Procedure EXECUTOR(R,J)

While R has unprocessed R_n for some jobs do

// scheduling load data partitioning Block

R_i←Scheduler(R)

V/obtaining treatment R_iTask set of

J_S←GetJobs(R_i,J)

// parallel processing of data sets

for each j ∈ J_S do

ParallelProcess(j,R_i,S_j)

end for

Data set of synchronous task

for each j ∈ J_S do

datas(j,S_new)

end for

end while

End procedure

A sequence alignment algorithm is exemplified, and a data-driven concurrent alignment task execution model is explained in detail. Let the data that the alignment algorithm needs to access be denoted as D = (T, R, N, Q). Where T represents the ID set of sequences, R represents the sequence set of read, N represents the name set of sequences, and Q represents the quality set of read. In this model, the sequence structure data R = (T, R, N, Q) for each sequence alignment algorithm, the set of existing alignment algorithms J, the load-processed sequence structure data R ∈ Ri, where Ri is the ith sequence partition of R, and JS represents a subset of the alignment algorithms.

The execution model is shown in algorithm 2. When the sequence to be processed is still divided into blocks, a scheduling thread of a CPU is used for loading one divided block into a GPU cache, then a comparison analysis task needing to be processed is obtained, and the comparison analysis task are triggered to be processed in parallel. And after the processing of the division blocks is finished, outputting the calculated sequence diversity, and then entering the next cycle until the processing of all the sequence division blocks is finished.

Data partitioning and loading

Large-scale data is divided as necessary for parallel processing. Current data partitioning strategies are primarily based on an all-round partitioning, allowing better load balancing by distributing the load of these nodes across multiple devices. Different data sets adopt different dividing methods, including sequence set division, chain and seed set division. The basic partitioning rule is as follows: the data block size of the sequence is set according to the processing thread number of the GPU, and the data blocks of the chain and the seed set are divided by equal or adjacent length, position and number. Therefore, the influence of the bucket effect of parallel computing is relieved, convergence can be accelerated, synchronization overhead is reduced, and storage overhead of message data is reduced.

The data partitioning and processing sequence does not affect the correctness of the algorithm. In order to maximize the temporal and spatial correlation of concurrent analysis task data accesses, the system also provides a data scheduler that preferentially loads the partitioned blocks that need to be accessed most into the cache for processing. Each piece Gi of data to be accessed is divided and the system dynamically assigns a priority Pi. The partition block with the highest priority is loaded into the cache for processing, so that the utilization rate of the cache is improved. The basic scheduling rules are as follows: first, when a data partition needs to be processed by the associated task, it should be given the highest priority, loaded into the cache first, and second, a data partition should have a higher priority if it is the last data segment of the same sample. This is because the time for the thread or the like of the last sample data final result file is reduced, and therefore less cache space is consumed.

Concurrent processing of data

As shown in fig. 8, each time one or more data partition blocks are loaded into the cache, the associated concurrent analysis task is triggered to process. The newly submitted task only needs to send the initial work item to the corresponding equipment to wait for the triggered processing. When the number of concurrent analysis tasks is so large that threads of a single GPU are not enough, they are batched. Firstly, loading a first batch of data blocks into a cache, and triggering corresponding analysis tasks to process; after the step is finished, storing the results output by the partition blocks into a cache, and triggering the next batch of analysis tasks to be processed; only after all concurrent analysis tasks have been computed will the next data partition or partitions be loaded. For the processing of each multiple data partition, different analysis tasks may have different workloads, which results in low utilization of hardware resources. The system design is solved through the thread configuration of the GPU kernel, the calculation in the GPU is completed by grid composed of a plurality of thread blocks, and the thread blocks are composed of a plurality of threads. The more analysis tasks of the data to be processed, the more threads are configured for parallelization processing.

The specific operating example is as follows:

4.1 test platform, data set introduction

The platform system used in the test is CentOS Linux release 7.8, the kernel version is 3.10.0-1127, the CPU is Intel (R) Xeon (R) Gold 6240 CPU @ 2.60GHz, the two CPUs are provided with 18 cores, each core has 2 threads, the total number of the CPU threads is 72, and the total number of the host memories is 125 GiB. The host loads two NVIDIA Tesla V100 GPU heterogeneous processors, the total memory on a GPU board is 32GiB, the GPU driver version is 410.48, and the CUDA version is 10.0.

The test data used randomly sampled clinical exogenic paired-end DNA sequencing data, and two fastq data files to be aligned, each of size 11.79GiB, for a total of 23.58 GiB.

Data calculation flow

The host side reads the input command line parameters, analyzes the parameters to obtain information such as an input file name, an output file name, a reference sequence file name, a GPU (graphics processing Unit) used, a thread number and the like, and then reads the relevant data of the reference sequence, including an SA (SA) array, a BWT (BWT) array, a reference sequence index and the like. After the above work is completed, a read _ input thread is started to be dedicated to reading an input file, since the operation of reading a file is generally much faster than the operation of calculating, the strategy adopted is to read four batches of data at maximum (a batch of data size is the maximum data size that can be processed by the GPU at one time), specifically: after reading a batch of data, adding one to g _ seqIndex (the read data batch index) every time, comparing g _ seqIndex with g _ currIndex (the data batch index currently being processed), when g _ seqIndex is 4 times larger than g _ currIndex, continuing to detect read _ input thread after sleeping for 100 milliseconds (making CPU resources available for other tasks) until the task scheduling thread copies the data to be processed from the host memory and adds one to g _ currIndex, ending the detection, and continuing to read the subsequent data batch in the file, so that the read _ input thread can guarantee that the read _ input thread does not occupy too much host memory all the time.

And secondly, initializing a CUDA resource by the host end, uploading the relevant data of the reference sequence and the comparison parameter data to the GPU, then allocating a host memory for receiving data returned by the GPU, and finally starting two host pipeline task scheduling threads. One of the task scheduling threads enters a first pipeline step, a batch of data to be compared is taken out from a memory, g _ currIndex and g _ seqIndex are compared for insurance (the calculation is faster than the reading), when g _ currIndex is larger than g _ seqIndex, the scheduling threads continue to detect after sleeping for 50 milliseconds (CPU resources are made available for other tasks) until a read _ input thread reads a file and adds one to g _ seqIndex, the detection is ended, the g _ currIndex is added by one, then the data to be compared is copied to a GPU memory, and a GPU kernel is started to start calculating asynchronous tasks such as seeds and seed expansion scores. At this time, another task scheduling thread is in a waiting state.

The GPU kernel comprises two functions of seed finding and extending. Wherein the seed is basically a memory operation, using FM-Index algorithm (searching the position Index of the substring in the original string by SA array and BWT array, etc.) and backward search algorithm (formula (3) and formula (4)), searching all the same common substrings of the DNA fragment to be aligned and the reference sequence, requiring the minimum length not less than 19 (default value), taking them as "seeds", then, using the seeds located close to each other on the reference sequence as "chains", storing them together, and sorting the seeds on each chain according to the order of seed length from large to small.

Finally, each read can obtain a batch of seed chains corresponding to the read, each seed chain comprises a plurality of seeds, and if the seeds are not sorted and all the seeds of one read are processed by a subsequent GPU thread, the requirement of instruction consistency in the GPU and the same warp can cause the processing speed to depend on the read which is processed slowest, so that the short board effect which is just like a bucket is achieved. To solve this problem, through careful analysis, the total length of the extension-needed portion of each read is calculated by subtracting each seed length from the read length to obtain the length of each seed that needs to be extended, and then summing the lengths. After obtaining the total length of the part of each read which needs to be expanded, finding out the largest one (the shortest block of the 'board') which needs to be expanded, then taking the shortest block as a reference, dividing the reads of which the sum of the total lengths needs to be expanded is close to the value into the same group, for example, the total lengths of the reads 1, the read2 and the read3 which need to be expanded are 10000, 13000 and 30000 respectively, dividing the

reads

1 and 2 into one group, dividing the reads 3 into another group, and then allocating GPU processing threads by taking the group as a unit to do extension.

The extensing is more involved in calculation, belongs to a calculation intensive task, and mainly uses a SmithWaterman algorithm and a dynamic programming algorithm to carry out 'expansion' on seeds on a seed chain found in the front, namely, a real matching situation is found, as shown in FIG. 9, when a seed ACGTTAA of a sequence ACCACGTTAAGCGA is expanded, only the parts of the seeds outside a reference sequence and an alignment sequence need to be respectively found out (1); (2) and (4) performing local comparison and scoring on the to-be-expanded part by using a SmithWaterman algorithm.

Since the SmithWaterman algorithm is very computationally intensive (in actual operation, three matrices need to be computed without optimization), in order to fully exert the characteristics of GPU parallel computation, the order of computing elements of the score matrix is observed, and the computation of each matrix element only depends on the values of the left, upper left and upper three elements of the element, and is independent of the values of other elements, so that it is also a dynamic programming process, and the elements on the anti-diagonal line of the matrix have no mutual dependency relationship, and it is found from this characteristic that the elements on the anti-diagonal line of the matrix can just be computed in parallel by using the GPU, as shown in fig. 10, that is: allocating a sufficiently large number of GPU threads (e.g., the same number as the length of the aligned target sequence), the first clock cycle, the GPU thread computing all elements on the first anti-diagonal in parallel, the second clock cycle, the GPU thread computing all elements on the second anti-diagonal in parallel, and so on until the elements of the complete matrix are computed. In the calculation process, the maximum score value, the position of the reference sequence corresponding to the obtained score value and the position of the comparison target sequence are continuously recorded and updated, and finally the final score and the corresponding relevant position information are stored in a GPU memory.

When the seeds of one chain are expanded, not all the seeds of the chain need to be expanded, because after some seeds are expanded, the expansion result covers the seeds which are not expanded, before each seed is subjected to expansion calculation, judgment is firstly carried out according to the existing expansion result in the GPU memory, if the existing expansion result does not cover the current seed, the current seed is expanded, otherwise, the current seed is skipped, and other seeds in the chain are continuously processed.

After a batch of comparison sequences are completely processed, all the diversity sets are completely written into a GPU memory, the current calculation task of the GPU is completed, a task scheduling thread copies the calculation result in the GPU memory to a host memory, the second step of the production line is started, meanwhile, another waiting task scheduling thread is activated, and the next batch of comparison sequence data is processed. From this point on, the two steps of the pipeline are performed simultaneously.

In the second pipeline step, two tasks are mainly performed, wherein the task is to start a plurality of host working threads to generate a final comparison result according to the diversity obtained by comparison in the first pipeline step, and the comparison result is a character string in the sam format, which includes information such as an optimal matching position or a suboptimal matching position of the comparison sequence on the reference sequence, comparison quality, a specific comparison condition (CIGAR value) and the like. And the second task is to output the sam data to the file after the first task is completed.

Test results and analysis

In this test, the host end respectively uses 5, 10, 20, and 30 threads to respectively run the original program and the heterogeneous program under the same condition, and the obtained running time ratio is shown in fig. 11. As can be seen from the comparison result, as the number of host threads increases, the total time consumption of the original edition and the heterogeneous edition decreases, and when the number of host threads is less than or equal to 20, the time consumption of the heterogeneous edition decreases significantly, but when the number of host threads exceeds 20, the time consumption of the heterogeneous edition no longer decreases, but fluctuates around a certain value, for example, when the number of host threads is increased from 20 to 30, the time consumption of the heterogeneous edition does not decrease significantly, but increases a little, which may be caused by other factors, such as a slight fluctuation in reading a reference file when initializing a resource, outputting a file halfway, and the like. The reason that the time consumption of the heterogeneous version is not reduced all the time with the increase of the host threads is that no matter how many host threads are, the GPU is in a full-load state, when the host threads are less, the calculation speed of the first pipeline step is higher than that of the second pipeline step, but with the increase of the host threads, the speed of generating the sam data of the second pipeline step can be increased only, when the host threads are about 20, the calculation speeds of the first pipeline step and the second pipeline step are close to each other, the optimal processing speed is achieved, and the overall processing speed is optimal. And the time consumption is reduced because the original edition side can synchronously increase the processed data amount when the host thread is added and is basically not restricted by other factors. On the premise of low overall time consumption and low use of host resources, when the number of host threads of the heterogeneous version is 20, the time consumed by the original version 20 thread is 13 times that of the heterogeneous version, and the time consumed by the original version 30 thread is 9 to 10 times that of the heterogeneous version, so that the speed is obviously increased.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The high-concurrency sequence alignment calculation acceleration method based on the CPU + GPU isomerism is characterized by comprising the following steps of:

and (3) concurrent processing of tasks on the GPU: on the GPU, for a seed set and a chain generated in the data concurrent processing process of sequence data comparison, dividing the seed set with the same or adjacent length, position and number into the same data block, and performing the same processing on the chain, thereby completing the division of the seed set and the chain and forming a plurality of concurrent tasks for the second time;

the sequence data comparison is processed in a pipelined fashion by starting two scheduling threads: initializing a host thread to wait, and working the other host thread in steps one and two; when the working host thread goes to the second step, waiting for the thread to be activated, and starting the first step by the waiting host thread;

the second step is as follows: generating sam data and outputting a file;

the BWA-MEM algorithm after code reconstruction supports asynchronous execution of the code through a global work list, a GPU distributes a local work list for each sequence alignment analysis task, and the concurrent alignment analysis tasks share a remote work list; during system operation, the GPU periodically reports work items generated and consumed; once the total number of workitems is zero, the process terminates; the GPU includes three threads: the method comprises the steps of receiving threads, sending threads and working threads, wherein the first two threads are used for communication between GPUs, and the last thread is used for processing local workitems; each GPU receives the remote workitems from the previous equipment and gives the remote workitems to a receiving thread to complete the shunting of the workitems; the working thread and the receiving thread can submit the GPU kernel to complete the work of the working thread and the receiving thread, and the kernel of the receiving thread is submitted to an independent stream and is distributed with higher priority;

2. The method for accelerating computation of highly concurrent sequence alignment based on CPU + GPU heterogeneity according to claim 1, wherein simplifying the data structure of BWA-MEM algorithm means reconstructing the data structure using CUDA language and removing complex structures in the data structure, where the complex structures include multi-level pointers and structures.

3. The CPU + GPU isomerism-based high-concurrency sequence alignment calculation acceleration method according to claim 1, wherein optimizing a partial loop and a logic judgment statement of a BWA-MEM algorithm means that the partial loop is equivalently expanded, and the logic judgment statement is reconstructed.

4. The method according to claim 1, wherein the plurality of concurrent tasks correspond to form data partition blocks, and the data partition blocks that need to be accessed most are preferentially loaded into the cache by a data scheduler and a set scheduling rule.

5. The method for accelerating computation of highly concurrent sequence alignment based on CPU + GPU isomerism according to claim 1, wherein the sequence data comprises single-ended or double-ended DNA sequencing fragment data.

6. The method for accelerating the calculation of the alignment of the high-concurrency sequences based on the CPU + GPU isomerism is characterized in that a plurality of concurrent tasks are formed for the second time, and include Grid and Block dimension division and division of sequences and threads; wherein

The Grid and the Block dimension division are carried out in a GPU, and each Block in the Grid can be distributed to each stream multiprocessor of the GPU for execution according to a CUDA programming model; the dimension design of a Block is considered during dimension selection, and the larger the Grid is, the better the Grid is; the Block dimension design needs to be determined according to registers and shared memory quantity used in each SM in actual calculation and the available resource quantity on each SM of hardware;

the division of the sequence and the thread refers to: each sequence is allocated with a thread for operation; for the processing of each data partition block, different analysis tasks have different workloads and are solved through the thread configuration of a GPU kernel, the calculation in the GPU is completed by grid consisting of a plurality of thread blocks, and each thread block consists of a plurality of threads; the more graph analysis tasks there are vertices to be processed, the more threads are configured for processing.

7. The method for accelerating computation of highly concurrent sequence alignment based on CPU + GPU heterogeneity according to claim 1, wherein the BWA-MEM algorithm comprises: FM-Index search algorithm, backward search algorithm, Smith-Waterman algorithm and dynamic programming; the searching of SMEM and the generation of Chain are completed through the mem _ Chain module, and the Chain2aln module is responsible for expanding two ends of Chain by utilizing a Smith-Waterman algorithm to find an optimal comparison position; the mem _ chain module also comprises a mem _ collection _ intv module and a generation _ chain module, wherein the mem _ collection _ intv module searches the maximum precise matching from the read, and the generation _ chain module forms chain by mem of which the distance meets the condition; the BWA-MEM algorithm employs a multi-kernel parallel framework.