CN111415708A

CN111415708A - Method and system for realizing large-scale database clustering by double-buffer model

Info

Publication number: CN111415708A
Application number: CN202010213789.4A
Authority: CN
Inventors: 刘卫国; 徐晓明
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-03-24
Filing date: 2020-03-24
Publication date: 2020-07-14
Anticipated expiration: 2040-03-24
Also published as: CN111415708B

Abstract

The invention discloses a method and a system for realizing large-scale database clustering by a double-buffer model, which are used for sorting a gene sequence database in a descending way; constructing a matching dictionary: and the sparse suffix array is used for constructing a sparse suffix array for one gene sequence to serve as a dictionary, other gene sequences are matched with the dictionary suffix array, binary search matching search is carried out at a certain position of the query sequence in the matching process, optimization promotion is carried out by adopting the inverse suffix array, the minimum public sub-prefix array and the suffix link, and the redundant sequence is judged after the calculated matching value reaches a threshold value. Clustering operation and redundant gene sequence removal operation based on large-scale database biological gene sequences are both used for accurate matching operation aiming at the gene sequences, and double-buffering and multi-threading parallel operation can process data under the conditions for I/O operation of large-scale data files.

Description

Method and system for realizing large-scale database clustering by double-buffer model

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a method and a system for realizing large-scale database clustering by using a double-buffer model.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Gene sequencing data is doubling at a greater rate, so that the most prominent core problem in big biological data research for analyzing genes and health data is the enormous amount of data.

For genome processing, it is based on matching operation of genome sequence. For these data, it is not algorithmically difficult to perform genome matching. However, the data that needs to be manipulated is now TB or even larger, in other words, not so much data is carried at all in ordinary machine memory. Also for large genetic data, many algorithms and associated processes are based on removing redundant sequences.

Clustering algorithm operations on biological gene sequences are based on the maximum exact match algorithm (MEMs) of similar genes to calculate the distance between gene sequences.

Biological gene sequence manipulation is largely due to the presence of many redundant sequences that do not represent genetic diversity or species and protein diversity. In other words, such massive amounts of data are not all useful in gene representation. When the similarity of the exact match of two sequences reaches a certain threshold, the two sequences are considered to belong to the same category, wherein one is a representative sequence (redundant query) and the other is a redundant sequence (redundant query). In a biological sense, the longest sequence in a class is by default the most representative and is the representative sequence.

The technical problem that this application will solve is: under the condition of overlarge data quantity, reading of data to be processed is time-consuming, and how to adopt a double-buffering method to realize calculation and data reading in parallel in a program for solving similarity through accurate matching so as to cover the time of data reading through calculation.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method for realizing large-scale database clustering by using a double-buffer model, which is realized by using double-buffer multithreading aiming at the I/O operation of large-scale data files.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the method for realizing large-scale database clustering by using the double-buffer model comprises the following steps:

sorting the gene sequence database in descending order;

establishing two buffer areas in a memory, wherein the size and the block size of each buffer area are equal, and preloading the whole gene sequence file into the two buffer areas in advance;

constructing a matching dictionary: and the sparse suffix array is used for constructing a sparse suffix array for one gene sequence to serve as a dictionary, matching other gene sequences with the dictionary suffix array, performing binary search matching search at a certain position of the query sequence in the matching process, and processing by adopting the inverse suffix array, the minimum public sub-prefix array and the suffix link to realize clustering of the biological gene sequences, wherein the similarity obtained in the matching process reaches a threshold value, so that the redundant sequence is judged.

According to the further technical scheme, when similarity matching is carried out on two sequences, the default longer one is a representative sequence, the default shorter one is a redundant sequence, the first sequenced one is a representative sequence, and the lower redundant sequence marked as the lower redundant sequence when the similarity of the first sequenced one reaches a threshold value.

In a further technical scheme, the whole gene sequence file is divided into a plurality of blocks;

the whole gene sequence file is pre-loaded into two buffer areas, then calculation operation is carried out based on data in one buffer area, simultaneously, the file is loaded into the other buffer area, then corresponding synchronization strategy is carried out, and under the condition that the calculation time is far longer than the I/O time, the covering of the calculation time to the I/O time is realized.

The further technical scheme is that the size and the boundary of the buffer area are set, the actual size of data in each buffer area is a little larger than the set size limit, and the length of one sequence is not more.

According to the further technical scheme, when the I/O operation and the MEM-check calculation operation are synchronized: creating a thread to read a file to a buffer, then creating a group of threads to perform MEM-check calculation, and specifically realizing synchronization by setting the quantities full and empty of synchronization signals.

According to the further technical scheme, K threads are created for N sequences to be calculated respectively, N is far larger than K, the sequence index number calculated by each thread is K, and the remainder is obtained by modulo N operation, so that the index number obtained by each thread is fixed.

According to the further technical scheme, the longest calculation time is used for covering the next longest calculation time in a plurality of query sequences in the same buffer block, and the waiting time on the whole level is reduced.

According to the further technical scheme, the query source files are sequenced in advance, so that data sequences in the query source files are arranged in a descending manner according to the length, the data length of each thread is similar, the calculation time of the maximum precise matching algorithm is positively correlated with the query length, the running time difference among the threads is not large, and the whole running time tends to be the average time rather than the running time under the worst condition.

The method comprises the steps of performing dynamic scheduling on the sizes of blocks, setting a reference value instead of a fixed size for each block, judging whether the number of read sequence pieces is an integral multiple of the number of threads or not when the read operation reaches the block size, and if not, continuously reading until the number of the read sequence pieces is an integral multiple of the number of threads.

The invention also discloses a system for realizing large-scale database clustering by using the double-buffer model, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the following steps:

sorting the gene sequence database in descending order;

constructing a matching dictionary: and the sparse suffix array is used for constructing a sparse suffix array for one gene sequence to serve as a dictionary, other gene sequences are matched with the dictionary suffix array, binary search matching search is carried out at a certain position of the query sequence in the matching process, optimization promotion is carried out by adopting the inverse suffix array, the minimum public sub-prefix array and the suffix link, and the redundant sequence is judged after the calculated matching value reaches a threshold value.

The above one or more technical solutions have the following beneficial effects:

clustering operation and redundant gene sequence removal operation based on large-scale database biological gene sequences are both used for accurate matching operation aiming at the gene sequences, and double-buffering and multi-threading parallel operation can realize rapid data processing under the conditions aiming at I/O operation of large-scale data files.

The double-buffer model aims at calculating time to cover the data I/O time in a parallel mode in the processing process of large-scale data.

The operations are pre-ordered, and the modulo operation remainder operation can enable load balance among various computing threads.

The size of the block is dynamically scheduled, so that the number of sequences distributed to each computing thread is equal, the sequence lengths are similar, the tasks of each thread are approximately the same, and load balance is achieved. The load balancing referred to in this application mainly refers to: the workload of each thread is made to be similar as much as possible, so that the parallelism can be improved to the maximum extent, and not a thread is busy but other threads are idle, and as a result, most threads wait for the completion of a busy thread, so that the parallelism is poor, the utilization rate of computing resources is low, and the consumed time is more.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a diagram illustrating an external access direct data read-file according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the operation of pre-reading a load-all into a memory according to an embodiment of the present invention;

FIGS. 3(a) -3 (b) are schematic diagrams of a double-buffer according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating synchronization among processes according to an embodiment of the present invention;

FIG. 5 is a diagram showing the run times of different threads in a block of an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating the problem of uneven loading in an out-of-order state in a block according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of load balancing processing according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating the configuration of the number of threads per buffer block and the number of query sequences according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of time result analysis for load imbalance according to an embodiment of the present invention;

FIG. 10 is a schematic diagram showing the comparison of the running times of three models according to the embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment discloses a general flow of a double-buffer model for realizing a large-scale database clustering algorithm and a MEMs maximum precise matching algorithm:

sorting the gene sequence database in descending order. When similarity matching is performed between two sequences, the longer one is a representative sequence and the shorter one is a redundant sequence by default. So the first after the ordering must be the representative sequence, the redundant sequence marked as it with the lower edge whose similarity reaches the threshold.

The implementation of the algorithm requires that a matching dictionary is constructed first, the specific implementation mode is Sparse Suffix Array (SSA), a Sparse Suffix Array is constructed for one gene sequence and serves as a dictionary, other gene sequences are matched with the dictionary Suffix Array, binary search matching search is adopted at a certain position of a query sequence in the matching process, optimization promotion is carried out by adopting Inverse Suffix Array (ISA), minimum Common Prefix Array (L event Common Prefix, L CP), Suffix links (Suffix), and the like, and after the calculated matching value reaches a threshold value, the redundant sequence is determined.

The basic operation of gene sequence clustering algorithms is to search a very large database of sequences for a sequence. The front sequence is located in a reference file (ref file), and the rear database is a query file. Both sequences for matching calculation in the execution process of the algorithm are put into the memory. The ref file is generally a sequence or sequences, which are small compared to the query file, and generally adopts a method of directly reading in the memory and then directly fetching in the memory during the source data processing. For the query file, if the memory is large enough, the query file can be loaded into the memory at one time, and a load-all method can also be adopted. However, when the query file is large and cannot be loaded into the memory, other methods (external direct access data or double buffering) are considered to be used for implementation.

Referring to fig. 1, the external access direct data read-file: when the required source file cannot be completely loaded in the memory, the source data can be directly read from the source file without calling the source data into the memory in advance, so that each sequence has one I/O operation when one sequence is loaded from the file, and the calculation operation of the MEM-check can be continued when the source data file is large. But this also presents a significant problem: the file reading operation is performed every time a piece of data is fetched, so that a large overhead is generated by frequent I/O calls.

Referring to FIG. 2, a read-ahead load-all operation: in the initial stage of the experiment, when the source query data file is not particularly large and can be stored in the memory, the source data file can be read into the memory (load-all) at one time, and then data is extracted in the calculation process and is directly performed in the memory, so that the I/O operation with large frequency can be avoided. However, this also incurs the overhead of reading the file into memory at one time. The overhead in this respect is not too large for frequent I/O operations. The biggest bottleneck of the method of reading the source data file into the memory in advance is as follows: it is limited to file sizes smaller than the memory size, otherwise the entire file cannot be read into memory. Therefore, although this method is highly efficient, the bottleneck is significant and is not suitable for all processing situations.

Generation of double-buffer concept: based on the two situations, a double buffer (double buffer) mode which is a relatively compromise in the aspect of the size of the data amount in the called memory and is more efficient to realize is provided.

Referring to fig. 3(a) -3 (b), the essence of double buffering is that: the whole file is divided into a plurality of blocks (blocks), and then two buffers are built in a memory, wherein the size of each buffer is equal to the block size. The method comprises the steps of loading files into the two buffer areas in advance, then carrying out calculation operation based on data in one of the buffer areas, loading the files into the other buffer area, then carrying out corresponding synchronization strategies, and realizing the coverage of calculation time on I/O time under the condition that the calculation time is far longer than the I/O time.

It should be noted that a file contains many pieces of gene sequence data, many of which can reach millions or even hundreds of millions.

Specifically, the two buffers are loaded with only two blocks (the first two blocks), each block containing a plurality of gene sequences.

When preloading a file into both buffers, it first needs to be explicit: the data required for the calculation is required in the memory. So first load the first block into a buffer (in the memory), then calculate the data (the first block data) in this buffer; the second block is loaded into the second buffer during the calculation, and then the thread synchronization mechanism is used to wait for the end of the calculation in the first buffer, and then the data in the second buffer is calculated while the third block is loaded into the first buffer, … … alternating. Except for the first block, the operation of reading the memory is parallel to the calculation of the previous block from the second block, and the corresponding time for reading the file is covered by the calculation.

Buffer block (block) size and boundary processing: the size of the buffer block has a significant impact on performance. When the set size of the buffer block is larger and equal to the size of a file, the double-buffer model is similar to a load-all model; when the setting of the buffer block is small and is similar to the size of each sequence, the double-buffer model is similar to an external access direct-access data model. The setting of different block sizes results in varying numbers of blocks and thus in varying overall overhead in terms of the basic unit of operation of the blocks.

The size of each buffer block boundary is fixed, and the corresponding boundary problem needs to be noticed when specifically reading the buffer: for example, the size of the block is set to be 20MB, when a sequence is read in last, the first half of the sequence is already read into the buffer, and if the second half is added, the sequence exceeds 20MB, in this case, we perform the operation to read the whole sequence into the buffer, so that the effective operation can be performed in the calculation at the back. So in fact the actual size of each buffer data is a little larger than the set size limit (20 MB in the example) just by the length of one more sequence.

Referring to fig. 4, the synchronization process between processes: in addition to addressing the boundary problem, another important issue is the synchronization between the operation of reading the file and the operation of computing. In order to achieve good coverage of computation time to I/O time, the synchronization problem between the operation of a certain buffer block and the operation of I/O to another buffer block must be well handled, and parallel execution between the two is guaranteed. If the operation is serial, not only does the performance not increase, but the computation waits for the execution of the I/O, and various other additional overheads will not increase the performance.

This is a more typical producer-consumer synchronization model in terms of synchronization issues between simple I/O operations and MEM-check's computational operations. Creating a thread to read a file to a buffer, then creating a group of threads to perform MEM-check calculation, and specifically realizing synchronization by setting the quantities full and empty of synchronization signals.

Referring to fig. 5, a multi-thread model is shown, wherein for N sequences (index numbers are 0 to N-1), K threads are created to perform calculation respectively (N is much larger than K), and the sequence index number calculated by each thread is K to be the remainder of N modulo operation, so that the index number assigned by each thread is fixed. The basic operation unit is a block, and for the calculation of N pieces of data in the block, some threads may finish calculation in advance, and the lengths of sequences possibly obtained by some threads are all longer, which causes that other K-1 threads finish waiting for the slowest thread (as shown in fig. 6), thereby finishing the operation on the whole block, and then changing the semaphore to perform the operation on the next block. Therefore, the overall run time is not dependent on the average run time, but rather on the slowest running thread, which is a typical load imbalance problem. Fig. 6 shows that performing the calculation in units of blocks after performing the blocking operation amplifies such load unevenness.

Referring to fig. 7, data preprocessing is proposed for load imbalance: the method adopts the step of carrying out sorting operation on the query source file in advance so that the data sequence in the query source file is arranged according to the decreasing length. In this way, the data lengths of the threads are all similar, the calculation time of the maximum precise matching algorithm is positively correlated with the query length, so that the running time of the threads is not greatly different, and the overall running time tends to be the average time rather than the worst running time. (simply, the longest computation time is used to cover the next longest computation time and reduce the overall latency) for multiple query sequences in the same buffer block.)

Referring to FIG. 8, based on the thread number per buffer and the query sequence number setting: in addition to pre-processing and sorting data to reduce the effects of load imbalance, there is also a problem of load imbalance: the number of threads assigned to each thread in each block is different, resulting in uneven load among the number of sequences. This problem of load inequality is particularly acute where the length of a single sequence is relatively long. For example, an initial block size of 500M is created, and a set of 24 threads is created to perform a calculation operation on this one block. It is found that the length of the first sequence reaches 500M, and then only thread 0 performs the calculation operation, and other threads are in the waiting state. This is similar to serial operation and has a severe impact on performance. For this case, we design the block size as dynamic scheduling, where the size of each block is set to only one reference value instead of a fixed size, so as to determine whether the number of sequence lines read is an integer multiple of the number of threads in the case that the read operation reaches the block size, if not, continue to read until the number of lines is an integer multiple of the number of threads.

Referring to fig. 9, analysis of time results for load imbalance: the experimental results of creating the same number of threads for different block sizes to run the running times of the files before and after the sequencing show that the uneven load has a great influence on the performance of the program. The experimental results show that the running time after the source data is sorted does not fluctuate greatly along with the change of the block size, and the running time results of the source files which are not sorted are different greatly within the thread creation overhead and the floating range of the corresponding hardware resources.

Referring to fig. 10, experimental results and performance analysis: the implementation of double buffering is realized, and corresponding tests are carried out on different data. The improvement effect is still obvious when the I/O layer is viewed in the experimental result, and the performance of the double-buffer model is obviously superior to that of directly externally accessed data (see figure 5: for the same source data file, the horizontal axis is different file numbers, and the vertical axis is the running time). And if a larger number of threads are created, it is found that the operation of directly reading the sequence based on the same external memory file can cause the processing time to rise rapidly, and the double-buffer model has no problem in this respect.

In the final experimental result, when the file is small, the proportion of the double buffering inherent overhead (creating and recycling a large number of threads, recycling buffer space) can be reflected, but the overhead is much smaller than the bottleneck that the size of the memory file read in advance must be smaller than the size of the memory. In a sense, pre-read memory can be considered as a special double-buffer model, with the block size being the entire file size. Even if the memory is large enough to directly put down a file with any size, the operation of reading all files at one time and the calculation operation are serial, and the operation time of reading all the files at one time is much higher than the serial overhead of the operation of reading only the first block (the reading operation of the rest blocks is covered by calculation) under the double-buffer model. Therefore, the double buffer model is superior to the model of directly pre-reading into memory in terms of the I/O overhead. And in view of the last data point, even when the size of the file is within the memory range, when the size of the file is close to the size of the memory, due to the fact that various calculation processes also need certain space in the memory, when a data set occupies more memory space, the performance is sharply reduced, and the running time exceeds that of an external memory direct data model.

Example two

The present embodiment aims to provide a system for implementing large-scale database clustering by using a double-buffer model, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the following steps, including:

sorting the gene sequence database in descending order;

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. The method for realizing large-scale database clustering by using the double-buffer model is characterized by comprising the following steps of:

sorting the gene sequence database in descending order;

establishing two buffer areas, and preloading the whole descending ordered gene sequence file into the two buffer areas in advance;

constructing a matching dictionary: and the sparse suffix array is used for constructing a sparse suffix array for one gene sequence in the buffer area to serve as a dictionary, matching other gene sequences with the dictionary suffix array, performing binary search matching search at a certain position of the query sequence in the matching process, and processing by adopting the inverse suffix array, the minimum public sub-prefix array and the suffix link to realize clustering of the biological gene sequences.

2. The method for clustering large-scale databases by using double-buffer models as claimed in claim 1, wherein when similarity matching is performed between two sequences, the longer one is the representative sequence by default, the shorter one is the redundant sequence, the first one after sorting must be the representative sequence, and the redundant sequence marked as the representative sequence whose similarity reaches the threshold value is arranged below.

3. The method for realizing large-scale database clustering by using the double-buffer model as claimed in claim 1, wherein the whole gene sequence file is divided into a plurality of blocks, and then two buffer areas are established in the memory, wherein the size of each buffer area is equal to the size of each block;

4. The method for implementing large-scale database clustering according to the double-buffer model of claim 1, wherein the size and the boundary of the buffer are set, and the actual size of the data in each buffer is a little bit larger than the set size limit, and the length of one sequence is less than the length of the other sequence.

5. The method for realizing large-scale database clustering by using double buffer models as claimed in claim 1, wherein when the I/O operation and the MEM-check calculation operation are synchronized, the following steps are carried out: creating a thread to read a file to a buffer area, then creating a group of threads to perform MEM-check calculation, and specifically realizing synchronization by setting the quantities of synchronization signals full and empty.

6. The method for realizing large-scale database clustering by using the double-buffer model as claimed in claim 1, wherein for N sequences, K threads are created for respective calculation, N is much larger than K, and the sequence index number calculated by each thread is K and is left over for N modulo operation, so that the index number assigned by each thread is fixed.

7. The method for implementing large-scale database clustering according to the double-buffer model of claim 1, wherein the longest computation time is used for covering the next longest computation time for the plurality of query sequences in the same buffer block, thereby reducing the overall level of latency.

8. The method for realizing large-scale database clustering through the double-buffer model as claimed in claim 1, wherein the query source files are sorted in advance, so that the data sequences are arranged in a descending manner according to the length, the data length of each thread is similar, the calculation time of the maximum exact matching algorithm is positively correlated with the query length, and therefore the running time of each thread is not large, and the overall running time tends to be the average time rather than the worst running time.

9. The method for realizing large-scale database clustering by using double buffer models as claimed in claim 1, wherein the size of each block is dynamically scheduled, wherein the size of each block is only set to a reference value instead of a fixed size, and when the read-in operation reaches the size of the block, whether the number of read-in sequence pieces is an integer multiple of the number of threads is judged, if not, the read-in operation is continued until the condition that the number of the read-in sequence pieces is an integer multiple of the number of threads is met.

10. The system for realizing large-scale database clustering by using the double-buffer model comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the processor executes the program to realize the following steps:

sorting the gene sequence database in descending order;