CN111415708B

CN111415708B - Method and system for realizing large-scale database clustering by double buffer model

Info

Publication number: CN111415708B
Application number: CN202010213789.4A
Authority: CN
Inventors: 刘卫国; 徐晓明
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2023-05-05
Anticipated expiration: 2040-03-24
Also published as: CN111415708A

Abstract

The invention discloses a method and a system for realizing large-scale database clustering by a double-buffer model, which aim at length decreasing sequencing of a gene sequence database; building a matching dictionary: and constructing a sparse suffix array by using one gene sequence as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for optimization and lifting, and judging the redundant sequence after the calculated matching value reaches a threshold value. The clustering operation based on the biological gene sequences of the large-scale database and the redundant gene sequence removing operation can be performed by using the accurate matching operation aiming at the gene sequences, and the I/O operation aiming at the large-scale data file can be performed by using the double-buffer multithreading parallel operation, so that the data under the conditions can be processed quickly.

Description

Method and system for realizing large-scale database clustering by double buffer model

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a method and a system for realizing large-scale database clustering by a double-buffer model.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Gene sequencing data is doubling at greater speeds, so that the most prominent core problem in biological big data studies for analyzing genes as well as health data is the tremendous amount of data.

For genome processing, it is a matching operation based on genome sequences. For these data, it is not an algorithmic matter of particular difficulty to perform genome matching. However, the data that now needs to be manipulated is of the TB even greater level, in other words, not so much data is carried in normal machine memory at all. Moreover, for huge genetic data, many algorithms and related processes are also based on operations performed to remove redundant sequences.

Clustering algorithms on biological gene sequences operate by computing distances between gene sequences based on maximum exact match algorithms (Maximal Exact Matches, MEMs) for similar genes.

Biological gene sequence processing operations are largely due to the many redundant sequences that do not represent genetic diversity or diversity of species and proteins. In other words, such a huge amount of data is not all useful in gene representation. When the similarity of the two sequences in exact match reaches a certain threshold, the two sequences are considered to belong to the same class, one of the two sequences is a representative sequence (presenting query) and the other is a redundant sequence (redundancy). In a biological sense, the longest sequence in a class is the most representative by default, and is the representative sequence.

The technical problem to be solved by the application is as follows: under the condition of overlarge data volume, the reading of the data to be processed is time-consuming, and the calculation and the data reading are realized in parallel in a program for solving the similarity in an accurate matching way by adopting a double-buffering method so as to cover the time of the data reading by the calculation.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a method for realizing large-scale database clustering by a double-buffer model, which aims at I/O operation of large-scale data files and is realized by double-buffer multithreading.

To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the double-buffer model realizes a large-scale database clustering method, which comprises the following steps:

performing length decreasing sorting on the gene sequence database;

establishing two buffer areas in a memory, wherein the size and the block size of each buffer area are equal, and preloading the whole gene sequence file into the two buffer areas;

building a matching dictionary: and constructing a sparse suffix array by using one gene sequence as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search and matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for processing, realizing clustering of biological gene sequences, and judging that the similarity obtained in the matching process reaches a threshold value, namely the redundant sequence.

According to the further technical scheme, when the two sequences are subjected to similarity matching, the longer sequence is defaulted to be a representative sequence, the shorter sequence is a redundant sequence, the first sequence after sequencing is necessarily the representative sequence, and the lower sequence and the similarity reach a threshold value and are marked as the redundant sequence.

According to a further technical scheme, the whole gene sequence file is divided into a plurality of blocks;

the whole gene sequence file is preloaded into two buffer areas, then the file is loaded into the other buffer area based on the data in one buffer area while the calculation operation is carried out, and then the corresponding synchronization strategy is carried out, so that the coverage of the calculation time to the I/O time is realized under the condition that the calculation time is far longer than the I/O time.

Further, the size and boundary of the buffer area are set, and the actual size of each buffer area data is slightly larger than the set size limit, and is less than the length of one sequence.

Further technical solutions, when the I/O operation and the computation operation of the MEM-check are synchronized, the following steps are performed: a thread is created to read a file to a buffer, then a group of threads is created to perform MEM-check calculation work, and synchronization is realized specifically by setting the synchronous semaphores full and empty.

According to a further technical scheme, K threads are created for N sequences to respectively calculate, N is far greater than K, the sequence index number calculated by each thread is K and is redundant for N-mode operation, and thus the index number obtained by each thread is fixed.

According to the further technical scheme, the longest calculation time is used for covering the next longest calculation time in a plurality of query sequences in the same buffer block, so that the waiting time on the whole level is reduced.

According to the technical scheme, the query source files are subjected to sequencing operation in advance, so that data sequences in the query source files are arranged in a descending manner according to the length, the data lengths obtained by each thread are similar, the calculation time of the maximum exact matching algorithm is positively correlated with the query length, the running time among the threads is small, and the overall running time tends to be average time rather than worst running time.

According to a further technical scheme, the sizes of the blocks are dynamically scheduled, wherein the size of each block is only set to a reference value instead of a fixed size, whether the number of the sequences read in is an integer multiple of the number of threads is judged under the condition that the read-in operation reaches the size of the block, and if not, the condition that the number of the sequences read in is an integer multiple of the number of threads is known to be satisfied by continuing the read-in operation.

The invention also discloses a system for realizing large-scale database clustering by the double-buffer model, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the following steps when executing the program, and comprises the following steps:

performing length decreasing sorting on the gene sequence database;

building a matching dictionary: and constructing a sparse suffix array by using one gene sequence as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for optimization and lifting, and judging the redundant sequence after the calculated matching value reaches a threshold value.

The one or more of the above technical solutions have the following beneficial effects:

the clustering operation based on the biological gene sequences of the large-scale database and the redundant gene sequence removing operation can be performed by using the accurate matching operation aiming at the gene sequences, and the I/O operation aiming at the large-scale data file can be performed by using the double-buffer multithread parallel operation, so that the data under the conditions can be rapidly processed.

The double buffer model masks the data I/O time in a parallel mode in the processing process of large-scale data.

Pre-ordering, and modulo arithmetic remainder operations may enable load balancing between computing threads.

The block size is dynamically scheduled, so that the number of sequences allocated to each computing thread is equal, the sequence lengths are similar, the tasks of each thread are approximately the same, and load balancing is achieved. The load balancing mentioned in this application mainly refers to: the workload of each thread is similar as much as possible, so that the parallelism can be improved to the greatest extent, instead of a certain thread being very busy and other threads being very idle, the result is that most threads wait for the end of a busy thread, so that the parallelism is poor, the utilization rate of computing resources is low, and the time consumption is more.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a diagram of an embodiment of the present invention with exception of the access data read-file;

FIG. 2 is a schematic diagram illustrating the read-in memory-load-all operation in advance according to the embodiment of the present invention;

FIGS. 3 (a) -3 (b) are diagrams illustrating double buffer according to embodiments of the present invention;

FIG. 4 is a schematic diagram illustrating an inter-process synchronization process according to an embodiment of the present invention;

FIG. 5 is a diagram showing different thread runtimes in a block according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a problem of uneven load in an out-of-order condition in a block according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a load balancing process according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating the configuration of the number of threads per buffer block and the number of query sequences according to an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating analysis of time results for load non-uniformity according to an embodiment of the present invention;

FIG. 10 is a comparative run-time diagram under three models of an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

The embodiment discloses a general flow of a large-scale database clustering algorithm realized by a double-buffer model and a MEMs maximum exact matching algorithm:

and performing length decreasing sequencing on the gene sequence database. When the two sequences are matched in similarity, the longer one is defaulted as a representative sequence, and the shorter one is a redundant sequence. The first bar after sorting must therefore be the representative sequence, the redundant sequence marked as it whose lower similarity reaches the threshold.

The implementation of the algorithm needs to construct a matching dictionary firstly, and the specific implementation mode is that a sparse suffix array (Sparse Suffix Array, SSA) is constructed by one gene sequence, the sparse suffix array is used as the dictionary, other gene sequences are matched with the dictionary suffix array, binary search matching search is adopted at a certain position of a query sequence in the matching process, and an inverse suffix array (Inverse Suffix Array, ISA), a minimum common sub-prefix array (Longest Common Prefix, LCP), a suffix link (suffixlinks) and the like are adopted for optimization promotion, and after the calculated matching value reaches a threshold value, the redundant sequence is determined.

The basic operation of the genetic sequence clustering algorithm is that a sequence searches a very large sequence database. The front sequence is located in the reference file (ref file), and the rear database is the query file. Both sequences of matching calculation are put into the memory during the execution of the algorithm. The ref file is a certain sequence or several sequences, which are smaller than the query file, and a method of directly reading the ref file into the memory and then directly fetching the ref file from the memory during the processing of the source data is generally adopted. For the query file, if the memory is large enough, the query file can be loaded into the memory at one time, and a load-all method can be adopted. However, when the query file is relatively large, it is considered that the memory cannot be loaded, and the method is implemented by other methods (external memory direct access data or double buffering method).

Referring to fig. 1, the external access direct data read-file: when the required source file cannot be completely loaded in the memory, the source data can be read directly from the source file without previously calling the source data into the memory, so that each time a sequence is loaded from the file, each sequence has one I/O operation, and the computing operation on the MEM-check can be continuously performed when the source data file is relatively large. But this also presents a significant problem: the file reading operation is performed once every time a piece of data is fetched, so that a great expense is generated due to frequent I/O call.

Referring to fig. 2, the read-in memory-load-all operation is performed in advance: in the initial stage of the experiment, under the condition that the source query data file is not particularly large and can be loaded in the memory, the source data file can be read into the memory (load-all) at one time, and then the data is extracted in the calculation process and is directly carried out in the memory, so that the large-frequency I/O operation can be avoided. However, this also incurs the overhead of reading the file into memory once. This overhead is not a significant problem for the corresponding frequent I/O operations. The biggest bottleneck of the way of reading the source data file into the memory in advance is that: it is limited to file sizes smaller than memory sizes, otherwise the entire file cannot be read into memory. Therefore, although this approach is more efficient, the bottleneck is more pronounced and not suitable for all processing situations.

Generation of double-buffer concept: based on the above two cases, a double buffer (double buffer) mode is proposed, which is a compromise in terms of the size of data volume in the call-in memory and is more efficient to implement.

Referring to fig. 3 (a) -3 (b), the essence of double buffering is that: the entire file is divided into a plurality of blocks (blocks), and then two buffers are built in the memory, each buffer being equal in size and block size. The file is preloaded into the two buffer areas, then the file is loaded into one buffer area based on the data in one buffer area while the other buffer area is operated for calculation, and then corresponding synchronous strategy is carried out, and under the condition that the calculation time is far longer than the I/O time, the coverage of the calculation time to the I/O time is realized.

It should be noted that, a file contains a plurality of pieces of gene sequence data, and many pieces can reach millions or even hundreds of millions.

Specifically, two buffers are loaded with only two blocks (first two blocks), each block containing multiple gene sequences.

When the file is preloaded into these two buffers, it is first clear that: the data required for calculation is all that is required in the memory. The first block is loaded into a buffer (in memory) first and then the data in this buffer (first block data) is calculated; the second block is loaded into the second buffer during the calculation, then the thread synchronization mechanism is used to wait for the calculation in the first buffer to finish, then the data in the second buffer is calculated again, and the third block is loaded into the first buffer, … … alternately. Except for the first block, the operation of reading in the memory is parallel to the calculation of the last block from the second block, and the corresponding time of reading the file is masked by the calculation.

Buffer block (block) size and boundary processing: the size of the buffer block has a significant impact on performance. When the buffer block is set to be larger and equal to the file size, the double buffer model is similar to the load-all model; when the buffer block setting is small, similar to each sequence size, the double buffer model is similar to the external memory direct fetch data model. The different block size settings result in a varying number of blocks and thus a varying overall overhead in the basic unit of operation of the block.

The size of each buffer block boundary is fixed, and the corresponding boundary problem is noted when the buffer is read in specifically: for example, the block size is set to 20MB, and when a last sequence is read, the first half of the sequence is already read into the buffer, and if the last half is added, the second half exceeds 20MB, in which case we perform the operation of reading the whole sequence into the buffer, so that it can perform effective operation in the calculation of the later. The actual size of each buffer data is in fact a little larger (just less than the length of one sequence) than the set size limit (20 MB in the example).

Referring to fig. 4, the inter-process synchronization process: in addition to dealing with boundary problems, another important issue is the synchronization between the operations of reading the file and the computing operations. In order to achieve the purpose of good coverage of the computing time to the I/O time, the problem of synchronization between the operation of a certain buffer block and the operation of the I/O to another buffer block must be well processed, and parallel execution between the two must be ensured. If the operation is serial, the performance is not improved, but the calculation is waited for the execution of the I/O, and the performance is not improved due to various additional overheads.

This is a relatively typical producer-consumer synchronization model in terms of synchronization problems between simple I/O operations and MEM-check's computing operations. A thread is created to read a file to a buffer, then a group of threads is created to perform MEM-check calculation work, and synchronization is realized specifically by setting the synchronous semaphores full and empty.

Different thread running time display diagrams in the block are shown in fig. 5, and a multi-thread model is shown, K threads are created for N sequences (index numbers are 0 to N-1) to respectively calculate (N is far greater than K), the calculated sequence index number of each thread is K and is redundant for N-mode running, and thus the index number obtained by each thread is fixed. The basic operation unit is block, the calculation is performed on N pieces of data in the block, some threads can finish calculation in advance, the sequence lengths possibly separated by some threads are relatively long, and other K-1 threads are all waiting for the slowest thread to finish (as shown in fig. 6), so that the operation on the whole block is finished, the semaphore is changed, and the operation on the next block is performed. Therefore, the overall run time is not dependent on the average run time, but rather on the slowest thread, which is a typical uneven load problem. Fig. 6 shows that the calculation in block units after the partitioning operation causes such load unevenness to be amplified.

Referring to fig. 7, data preprocessing is proposed for load unevenness: the query source files are subjected to sorting operation in advance, so that the data sequences in the query source files are arranged in a descending manner according to the length. In this way, the data length of each thread is similar, the calculation time of the maximum exact matching algorithm is in positive correlation with the query length, so that the running time among the threads is not greatly different, and the overall running time tends to be the average time rather than the worst running time. (simply, multiple query sequences in the same buffer block, the longest computation time is used to cover the next longest computation time, and the latency on the whole level is reduced.)

Referring to fig. 8, the settings based on the number of threads per buffer block and the number of query sequences: in addition to pre-processing the data to reduce the effects of uneven loading, there is a problem of uneven loading: the number of threads divided by each thread in each block is different, resulting in uneven load among the number of sequences. This problem of uneven loading is particularly acute in cases where the single sequence length is relatively long. For example, an initial block size of 500M is created, and a set of (24) threads is created to perform a computational operation on this one block. It is found that the first sequence length reaches 500M, so that only thread No. 0 performs the calculation operation, and the other threads are all in a waiting state. This is similar to serial operation and has a severe impact on performance. For this case, we design the block size as dynamic scheduling, where each block size only sets a reference value instead of a fixed size, so that if the read operation reaches the block size, it is determined whether the number of sequence stripes read in is an integer multiple of the number of threads, if not, continue to read until it is known that the number of stripes is an integer multiple of the number of threads.

Referring to fig. 9, the time result analysis for load unevenness: experimental results for creating the running time of the files before and after the same number of thread running sequences for different block sizes show that uneven load has a great influence on the performance of the program. As can be seen from experimental results, the runtime after source data ordering does not float much up and down with the change in block size, within the scope of the thread creation overhead and corresponding hardware resources, while the runtime results of source files that are not ordered differ much.

Referring to fig. 10, experimental results and performance analysis: a double buffered implementation and corresponding testing for different data was performed. The improvement effect is also obvious in terms of I/O level in the experimental result, and the double buffer model performance is obviously better than that of directly accessing data outside (see figure 5: for the same source data file, the horizontal axis is different file numbers, and the vertical axis is running time). And if a large number of threads are created, it is found that the operation of directly reading the sequence based on the same external memory file can lead to a rapid increase in processing time, and the double-buffer model has no problem in this respect.

In the final experimental result, the impact of the inherent overheads of double buffering (creating a large number of threads to recycle buffer space) can be reflected when the file is small, but these overheads are much smaller than the bottleneck that the pre-read memory file size must be smaller than the memory size. In a sense, the pre-read memory can be considered a special double buffer model, whose block size is the entire file size. And even if the memory is large enough, the files with any size can be directly put down, the operation of reading all files at one time and the calculation operation are serial, and the operation time of reading all the files at one time is much higher than the serial overhead of the operation of only reading the first block under the double buffer model (the reading operation of all the other blocks is covered by calculation). Therefore, the double buffer model is superior to the model of direct pre-read into memory in terms of I/O overhead. And in view of the last data point, even if the file size is within the memory range, when the file size is close to the memory size, as a certain space is required in the memory in various calculation processes, when the data set occupies more memory space, the performance can be reduced sharply, and the running time exceeds that of the external memory direct access data model.

Example two

It is an object of this embodiment to provide a dual-buffer model implementing a large-scale database clustering system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of:

performing length decreasing sorting on the gene sequence database;

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. The method for realizing large-scale database clustering by the double-buffer model is characterized by comprising the following steps:

performing length decreasing sorting on the gene sequence database;

establishing two buffer areas, and preloading the whole gene sequence file subjected to descending sorting into the two buffer areas;

building a matching dictionary: and constructing a sparse suffix array by using one gene sequence of the buffer zone as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search and matching search at a certain position of the query sequence in the matching process, and adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for processing to realize clustering of the biological gene sequences.

2. The method for clustering large-scale databases by using double-buffer model according to claim 1, wherein when two sequences are subjected to similarity matching, a longer one is defaulted as a representative sequence, a shorter one is a redundant sequence, the first one after sequencing is definitely the representative sequence, and the redundant sequence marked as the redundant sequence with the similarity reaching a threshold value below.

3. The method for realizing large-scale database clustering by using a double-buffer model according to claim 1, wherein the whole gene sequence file is divided into a plurality of blocks, and then two buffer areas are established in a memory, wherein the size of each buffer area is equal to the size of each block;

4. The method for clustering large-scale databases by using double buffer model as claimed in claim 1, wherein the size and boundary of the buffer are set, and the actual size of each buffer data is slightly larger than the set size limit, and is less than the length of one sequence.

5. The method for implementing large-scale database clustering by using the double-buffer model according to claim 1, wherein when the I/O operation and the MEM-check computing operation are synchronized, the method comprises the following steps: a thread is created to read files to a buffer, then a group of threads is created to perform MEM-check calculation work, and synchronization is specifically realized by setting synchronization semaphores full and empty.

6. The method for clustering large-scale databases by using the double-buffer model according to claim 1, wherein for N sequences, K threads are created to respectively calculate, N is far greater than K, and the calculated sequence index number of each thread is K and is redundant for N-mode operation, so that the index number obtained by each thread is fixed.

7. The method for implementing large-scale database clustering by using double-buffer model as claimed in claim 1, wherein the longest calculation time is used to mask the next longest calculation time for multiple query sequences in the same buffer block, so as to reduce the waiting time on the whole level.

8. The method for clustering large-scale databases by using double-buffer model according to claim 1, wherein the query source files are sequenced in advance so that the data sequences therein are arranged in decreasing length, the data lengths obtained by each thread are similar, and the calculation time of the maximum exact matching algorithm is positively correlated with the query length, thereby the running time among the threads is not greatly different, and the overall running time tends to be average time instead of worst running time.

9. The method for clustering large-scale databases by using double buffer model according to claim 1, wherein the size of the blocks is dynamic scheduling, wherein the size of each block is only set to a reference value instead of a fixed size, and if the read operation reaches the size of the block, it is judged whether the number of the read sequence is an integer multiple of the number of threads, if not, the read operation is continued until the number of the read sequence is known to be an integer multiple of the number of threads.

10. The double buffer model realizes a large-scale database clustering system, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that the processor realizes the following steps when executing the program:

performing length decreasing sorting on the gene sequence database;