CN111415708B - Method and system for realizing large-scale database clustering by double buffer model - Google Patents

Method and system for realizing large-scale database clustering by double buffer model Download PDF

Info

Publication number
CN111415708B
CN111415708B CN202010213789.4A CN202010213789A CN111415708B CN 111415708 B CN111415708 B CN 111415708B CN 202010213789 A CN202010213789 A CN 202010213789A CN 111415708 B CN111415708 B CN 111415708B
Authority
CN
China
Prior art keywords
buffer
sequence
matching
double
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010213789.4A
Other languages
Chinese (zh)
Other versions
CN111415708A (en
Inventor
刘卫国
徐晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202010213789.4A priority Critical patent/CN111415708B/en
Publication of CN111415708A publication Critical patent/CN111415708A/en
Application granted granted Critical
Publication of CN111415708B publication Critical patent/CN111415708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method and a system for realizing large-scale database clustering by a double-buffer model, which aim at length decreasing sequencing of a gene sequence database; building a matching dictionary: and constructing a sparse suffix array by using one gene sequence as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for optimization and lifting, and judging the redundant sequence after the calculated matching value reaches a threshold value. The clustering operation based on the biological gene sequences of the large-scale database and the redundant gene sequence removing operation can be performed by using the accurate matching operation aiming at the gene sequences, and the I/O operation aiming at the large-scale data file can be performed by using the double-buffer multithreading parallel operation, so that the data under the conditions can be processed quickly.

Description

Method and system for realizing large-scale database clustering by double buffer model
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method and a system for realizing large-scale database clustering by a double-buffer model.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Gene sequencing data is doubling at greater speeds, so that the most prominent core problem in biological big data studies for analyzing genes as well as health data is the tremendous amount of data.
For genome processing, it is a matching operation based on genome sequences. For these data, it is not an algorithmic matter of particular difficulty to perform genome matching. However, the data that now needs to be manipulated is of the TB even greater level, in other words, not so much data is carried in normal machine memory at all. Moreover, for huge genetic data, many algorithms and related processes are also based on operations performed to remove redundant sequences.
Clustering algorithms on biological gene sequences operate by computing distances between gene sequences based on maximum exact match algorithms (Maximal Exact Matches, MEMs) for similar genes.
Biological gene sequence processing operations are largely due to the many redundant sequences that do not represent genetic diversity or diversity of species and proteins. In other words, such a huge amount of data is not all useful in gene representation. When the similarity of the two sequences in exact match reaches a certain threshold, the two sequences are considered to belong to the same class, one of the two sequences is a representative sequence (presenting query) and the other is a redundant sequence (redundancy). In a biological sense, the longest sequence in a class is the most representative by default, and is the representative sequence.
The technical problem to be solved by the application is as follows: under the condition of overlarge data volume, the reading of the data to be processed is time-consuming, and the calculation and the data reading are realized in parallel in a program for solving the similarity in an accurate matching way by adopting a double-buffering method so as to cover the time of the data reading by the calculation.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a method for realizing large-scale database clustering by a double-buffer model, which aims at I/O operation of large-scale data files and is realized by double-buffer multithreading.
To achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the double-buffer model realizes a large-scale database clustering method, which comprises the following steps:
performing length decreasing sorting on the gene sequence database;
establishing two buffer areas in a memory, wherein the size and the block size of each buffer area are equal, and preloading the whole gene sequence file into the two buffer areas;
building a matching dictionary: and constructing a sparse suffix array by using one gene sequence as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search and matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for processing, realizing clustering of biological gene sequences, and judging that the similarity obtained in the matching process reaches a threshold value, namely the redundant sequence.
According to the further technical scheme, when the two sequences are subjected to similarity matching, the longer sequence is defaulted to be a representative sequence, the shorter sequence is a redundant sequence, the first sequence after sequencing is necessarily the representative sequence, and the lower sequence and the similarity reach a threshold value and are marked as the redundant sequence.
According to a further technical scheme, the whole gene sequence file is divided into a plurality of blocks;
the whole gene sequence file is preloaded into two buffer areas, then the file is loaded into the other buffer area based on the data in one buffer area while the calculation operation is carried out, and then the corresponding synchronization strategy is carried out, so that the coverage of the calculation time to the I/O time is realized under the condition that the calculation time is far longer than the I/O time.
Further, the size and boundary of the buffer area are set, and the actual size of each buffer area data is slightly larger than the set size limit, and is less than the length of one sequence.
Further technical solutions, when the I/O operation and the computation operation of the MEM-check are synchronized, the following steps are performed: a thread is created to read a file to a buffer, then a group of threads is created to perform MEM-check calculation work, and synchronization is realized specifically by setting the synchronous semaphores full and empty.
According to a further technical scheme, K threads are created for N sequences to respectively calculate, N is far greater than K, the sequence index number calculated by each thread is K and is redundant for N-mode operation, and thus the index number obtained by each thread is fixed.
According to the further technical scheme, the longest calculation time is used for covering the next longest calculation time in a plurality of query sequences in the same buffer block, so that the waiting time on the whole level is reduced.
According to the technical scheme, the query source files are subjected to sequencing operation in advance, so that data sequences in the query source files are arranged in a descending manner according to the length, the data lengths obtained by each thread are similar, the calculation time of the maximum exact matching algorithm is positively correlated with the query length, the running time among the threads is small, and the overall running time tends to be average time rather than worst running time.
According to a further technical scheme, the sizes of the blocks are dynamically scheduled, wherein the size of each block is only set to a reference value instead of a fixed size, whether the number of the sequences read in is an integer multiple of the number of threads is judged under the condition that the read-in operation reaches the size of the block, and if not, the condition that the number of the sequences read in is an integer multiple of the number of threads is known to be satisfied by continuing the read-in operation.
The invention also discloses a system for realizing large-scale database clustering by the double-buffer model, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the following steps when executing the program, and comprises the following steps:
performing length decreasing sorting on the gene sequence database;
building a matching dictionary: and constructing a sparse suffix array by using one gene sequence as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for optimization and lifting, and judging the redundant sequence after the calculated matching value reaches a threshold value.
The one or more of the above technical solutions have the following beneficial effects:
the clustering operation based on the biological gene sequences of the large-scale database and the redundant gene sequence removing operation can be performed by using the accurate matching operation aiming at the gene sequences, and the I/O operation aiming at the large-scale data file can be performed by using the double-buffer multithread parallel operation, so that the data under the conditions can be rapidly processed.
The double buffer model masks the data I/O time in a parallel mode in the processing process of large-scale data.
Pre-ordering, and modulo arithmetic remainder operations may enable load balancing between computing threads.
The block size is dynamically scheduled, so that the number of sequences allocated to each computing thread is equal, the sequence lengths are similar, the tasks of each thread are approximately the same, and load balancing is achieved. The load balancing mentioned in this application mainly refers to: the workload of each thread is similar as much as possible, so that the parallelism can be improved to the greatest extent, instead of a certain thread being very busy and other threads being very idle, the result is that most threads wait for the end of a busy thread, so that the parallelism is poor, the utilization rate of computing resources is low, and the time consumption is more.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a diagram of an embodiment of the present invention with exception of the access data read-file;
FIG. 2 is a schematic diagram illustrating the read-in memory-load-all operation in advance according to the embodiment of the present invention;
FIGS. 3 (a) -3 (b) are diagrams illustrating double buffer according to embodiments of the present invention;
FIG. 4 is a schematic diagram illustrating an inter-process synchronization process according to an embodiment of the present invention;
FIG. 5 is a diagram showing different thread runtimes in a block according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a problem of uneven load in an out-of-order condition in a block according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a load balancing process according to an embodiment of the present invention;
FIG. 8 is a schematic diagram illustrating the configuration of the number of threads per buffer block and the number of query sequences according to an embodiment of the present invention;
FIG. 9 is a schematic diagram illustrating analysis of time results for load non-uniformity according to an embodiment of the present invention;
FIG. 10 is a comparative run-time diagram under three models of an embodiment of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Example 1
The embodiment discloses a general flow of a large-scale database clustering algorithm realized by a double-buffer model and a MEMs maximum exact matching algorithm:
and performing length decreasing sequencing on the gene sequence database. When the two sequences are matched in similarity, the longer one is defaulted as a representative sequence, and the shorter one is a redundant sequence. The first bar after sorting must therefore be the representative sequence, the redundant sequence marked as it whose lower similarity reaches the threshold.
The implementation of the algorithm needs to construct a matching dictionary firstly, and the specific implementation mode is that a sparse suffix array (Sparse Suffix Array, SSA) is constructed by one gene sequence, the sparse suffix array is used as the dictionary, other gene sequences are matched with the dictionary suffix array, binary search matching search is adopted at a certain position of a query sequence in the matching process, and an inverse suffix array (Inverse Suffix Array, ISA), a minimum common sub-prefix array (Longest Common Prefix, LCP), a suffix link (suffixlinks) and the like are adopted for optimization promotion, and after the calculated matching value reaches a threshold value, the redundant sequence is determined.
The basic operation of the genetic sequence clustering algorithm is that a sequence searches a very large sequence database. The front sequence is located in the reference file (ref file), and the rear database is the query file. Both sequences of matching calculation are put into the memory during the execution of the algorithm. The ref file is a certain sequence or several sequences, which are smaller than the query file, and a method of directly reading the ref file into the memory and then directly fetching the ref file from the memory during the processing of the source data is generally adopted. For the query file, if the memory is large enough, the query file can be loaded into the memory at one time, and a load-all method can be adopted. However, when the query file is relatively large, it is considered that the memory cannot be loaded, and the method is implemented by other methods (external memory direct access data or double buffering method).
Referring to fig. 1, the external access direct data read-file: when the required source file cannot be completely loaded in the memory, the source data can be read directly from the source file without previously calling the source data into the memory, so that each time a sequence is loaded from the file, each sequence has one I/O operation, and the computing operation on the MEM-check can be continuously performed when the source data file is relatively large. But this also presents a significant problem: the file reading operation is performed once every time a piece of data is fetched, so that a great expense is generated due to frequent I/O call.
Referring to fig. 2, the read-in memory-load-all operation is performed in advance: in the initial stage of the experiment, under the condition that the source query data file is not particularly large and can be loaded in the memory, the source data file can be read into the memory (load-all) at one time, and then the data is extracted in the calculation process and is directly carried out in the memory, so that the large-frequency I/O operation can be avoided. However, this also incurs the overhead of reading the file into memory once. This overhead is not a significant problem for the corresponding frequent I/O operations. The biggest bottleneck of the way of reading the source data file into the memory in advance is that: it is limited to file sizes smaller than memory sizes, otherwise the entire file cannot be read into memory. Therefore, although this approach is more efficient, the bottleneck is more pronounced and not suitable for all processing situations.
Generation of double-buffer concept: based on the above two cases, a double buffer (double buffer) mode is proposed, which is a compromise in terms of the size of data volume in the call-in memory and is more efficient to implement.
Referring to fig. 3 (a) -3 (b), the essence of double buffering is that: the entire file is divided into a plurality of blocks (blocks), and then two buffers are built in the memory, each buffer being equal in size and block size. The file is preloaded into the two buffer areas, then the file is loaded into one buffer area based on the data in one buffer area while the other buffer area is operated for calculation, and then corresponding synchronous strategy is carried out, and under the condition that the calculation time is far longer than the I/O time, the coverage of the calculation time to the I/O time is realized.
It should be noted that, a file contains a plurality of pieces of gene sequence data, and many pieces can reach millions or even hundreds of millions.
Specifically, two buffers are loaded with only two blocks (first two blocks), each block containing multiple gene sequences.
When the file is preloaded into these two buffers, it is first clear that: the data required for calculation is all that is required in the memory. The first block is loaded into a buffer (in memory) first and then the data in this buffer (first block data) is calculated; the second block is loaded into the second buffer during the calculation, then the thread synchronization mechanism is used to wait for the calculation in the first buffer to finish, then the data in the second buffer is calculated again, and the third block is loaded into the first buffer, … … alternately. Except for the first block, the operation of reading in the memory is parallel to the calculation of the last block from the second block, and the corresponding time of reading the file is masked by the calculation.
Buffer block (block) size and boundary processing: the size of the buffer block has a significant impact on performance. When the buffer block is set to be larger and equal to the file size, the double buffer model is similar to the load-all model; when the buffer block setting is small, similar to each sequence size, the double buffer model is similar to the external memory direct fetch data model. The different block size settings result in a varying number of blocks and thus a varying overall overhead in the basic unit of operation of the block.
The size of each buffer block boundary is fixed, and the corresponding boundary problem is noted when the buffer is read in specifically: for example, the block size is set to 20MB, and when a last sequence is read, the first half of the sequence is already read into the buffer, and if the last half is added, the second half exceeds 20MB, in which case we perform the operation of reading the whole sequence into the buffer, so that it can perform effective operation in the calculation of the later. The actual size of each buffer data is in fact a little larger (just less than the length of one sequence) than the set size limit (20 MB in the example).
Referring to fig. 4, the inter-process synchronization process: in addition to dealing with boundary problems, another important issue is the synchronization between the operations of reading the file and the computing operations. In order to achieve the purpose of good coverage of the computing time to the I/O time, the problem of synchronization between the operation of a certain buffer block and the operation of the I/O to another buffer block must be well processed, and parallel execution between the two must be ensured. If the operation is serial, the performance is not improved, but the calculation is waited for the execution of the I/O, and the performance is not improved due to various additional overheads.
This is a relatively typical producer-consumer synchronization model in terms of synchronization problems between simple I/O operations and MEM-check's computing operations. A thread is created to read a file to a buffer, then a group of threads is created to perform MEM-check calculation work, and synchronization is realized specifically by setting the synchronous semaphores full and empty.
Different thread running time display diagrams in the block are shown in fig. 5, and a multi-thread model is shown, K threads are created for N sequences (index numbers are 0 to N-1) to respectively calculate (N is far greater than K), the calculated sequence index number of each thread is K and is redundant for N-mode running, and thus the index number obtained by each thread is fixed. The basic operation unit is block, the calculation is performed on N pieces of data in the block, some threads can finish calculation in advance, the sequence lengths possibly separated by some threads are relatively long, and other K-1 threads are all waiting for the slowest thread to finish (as shown in fig. 6), so that the operation on the whole block is finished, the semaphore is changed, and the operation on the next block is performed. Therefore, the overall run time is not dependent on the average run time, but rather on the slowest thread, which is a typical uneven load problem. Fig. 6 shows that the calculation in block units after the partitioning operation causes such load unevenness to be amplified.
Referring to fig. 7, data preprocessing is proposed for load unevenness: the query source files are subjected to sorting operation in advance, so that the data sequences in the query source files are arranged in a descending manner according to the length. In this way, the data length of each thread is similar, the calculation time of the maximum exact matching algorithm is in positive correlation with the query length, so that the running time among the threads is not greatly different, and the overall running time tends to be the average time rather than the worst running time. (simply, multiple query sequences in the same buffer block, the longest computation time is used to cover the next longest computation time, and the latency on the whole level is reduced.)
Referring to fig. 8, the settings based on the number of threads per buffer block and the number of query sequences: in addition to pre-processing the data to reduce the effects of uneven loading, there is a problem of uneven loading: the number of threads divided by each thread in each block is different, resulting in uneven load among the number of sequences. This problem of uneven loading is particularly acute in cases where the single sequence length is relatively long. For example, an initial block size of 500M is created, and a set of (24) threads is created to perform a computational operation on this one block. It is found that the first sequence length reaches 500M, so that only thread No. 0 performs the calculation operation, and the other threads are all in a waiting state. This is similar to serial operation and has a severe impact on performance. For this case, we design the block size as dynamic scheduling, where each block size only sets a reference value instead of a fixed size, so that if the read operation reaches the block size, it is determined whether the number of sequence stripes read in is an integer multiple of the number of threads, if not, continue to read until it is known that the number of stripes is an integer multiple of the number of threads.
Referring to fig. 9, the time result analysis for load unevenness: experimental results for creating the running time of the files before and after the same number of thread running sequences for different block sizes show that uneven load has a great influence on the performance of the program. As can be seen from experimental results, the runtime after source data ordering does not float much up and down with the change in block size, within the scope of the thread creation overhead and corresponding hardware resources, while the runtime results of source files that are not ordered differ much.
Referring to fig. 10, experimental results and performance analysis: a double buffered implementation and corresponding testing for different data was performed. The improvement effect is also obvious in terms of I/O level in the experimental result, and the double buffer model performance is obviously better than that of directly accessing data outside (see figure 5: for the same source data file, the horizontal axis is different file numbers, and the vertical axis is running time). And if a large number of threads are created, it is found that the operation of directly reading the sequence based on the same external memory file can lead to a rapid increase in processing time, and the double-buffer model has no problem in this respect.
In the final experimental result, the impact of the inherent overheads of double buffering (creating a large number of threads to recycle buffer space) can be reflected when the file is small, but these overheads are much smaller than the bottleneck that the pre-read memory file size must be smaller than the memory size. In a sense, the pre-read memory can be considered a special double buffer model, whose block size is the entire file size. And even if the memory is large enough, the files with any size can be directly put down, the operation of reading all files at one time and the calculation operation are serial, and the operation time of reading all the files at one time is much higher than the serial overhead of the operation of only reading the first block under the double buffer model (the reading operation of all the other blocks is covered by calculation). Therefore, the double buffer model is superior to the model of direct pre-read into memory in terms of I/O overhead. And in view of the last data point, even if the file size is within the memory range, when the file size is close to the memory size, as a certain space is required in the memory in various calculation processes, when the data set occupies more memory space, the performance can be reduced sharply, and the running time exceeds that of the external memory direct access data model.
Example two
It is an object of this embodiment to provide a dual-buffer model implementing a large-scale database clustering system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of:
performing length decreasing sorting on the gene sequence database;
building a matching dictionary: and constructing a sparse suffix array by using one gene sequence as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for optimization and lifting, and judging the redundant sequence after the calculated matching value reaches a threshold value.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (10)

1. The method for realizing large-scale database clustering by the double-buffer model is characterized by comprising the following steps:
performing length decreasing sorting on the gene sequence database;
establishing two buffer areas, and preloading the whole gene sequence file subjected to descending sorting into the two buffer areas;
building a matching dictionary: and constructing a sparse suffix array by using one gene sequence of the buffer zone as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search and matching search at a certain position of the query sequence in the matching process, and adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for processing to realize clustering of the biological gene sequences.
2. The method for clustering large-scale databases by using double-buffer model according to claim 1, wherein when two sequences are subjected to similarity matching, a longer one is defaulted as a representative sequence, a shorter one is a redundant sequence, the first one after sequencing is definitely the representative sequence, and the redundant sequence marked as the redundant sequence with the similarity reaching a threshold value below.
3. The method for realizing large-scale database clustering by using a double-buffer model according to claim 1, wherein the whole gene sequence file is divided into a plurality of blocks, and then two buffer areas are established in a memory, wherein the size of each buffer area is equal to the size of each block;
the whole gene sequence file is preloaded into two buffer areas, then the file is loaded into the other buffer area based on the data in one buffer area while the calculation operation is carried out, and then the corresponding synchronization strategy is carried out, so that the coverage of the calculation time to the I/O time is realized under the condition that the calculation time is far longer than the I/O time.
4. The method for clustering large-scale databases by using double buffer model as claimed in claim 1, wherein the size and boundary of the buffer are set, and the actual size of each buffer data is slightly larger than the set size limit, and is less than the length of one sequence.
5. The method for implementing large-scale database clustering by using the double-buffer model according to claim 1, wherein when the I/O operation and the MEM-check computing operation are synchronized, the method comprises the following steps: a thread is created to read files to a buffer, then a group of threads is created to perform MEM-check calculation work, and synchronization is specifically realized by setting synchronization semaphores full and empty.
6. The method for clustering large-scale databases by using the double-buffer model according to claim 1, wherein for N sequences, K threads are created to respectively calculate, N is far greater than K, and the calculated sequence index number of each thread is K and is redundant for N-mode operation, so that the index number obtained by each thread is fixed.
7. The method for implementing large-scale database clustering by using double-buffer model as claimed in claim 1, wherein the longest calculation time is used to mask the next longest calculation time for multiple query sequences in the same buffer block, so as to reduce the waiting time on the whole level.
8. The method for clustering large-scale databases by using double-buffer model according to claim 1, wherein the query source files are sequenced in advance so that the data sequences therein are arranged in decreasing length, the data lengths obtained by each thread are similar, and the calculation time of the maximum exact matching algorithm is positively correlated with the query length, thereby the running time among the threads is not greatly different, and the overall running time tends to be average time instead of worst running time.
9. The method for clustering large-scale databases by using double buffer model according to claim 1, wherein the size of the blocks is dynamic scheduling, wherein the size of each block is only set to a reference value instead of a fixed size, and if the read operation reaches the size of the block, it is judged whether the number of the read sequence is an integer multiple of the number of threads, if not, the read operation is continued until the number of the read sequence is known to be an integer multiple of the number of threads.
10. The double buffer model realizes a large-scale database clustering system, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that the processor realizes the following steps when executing the program:
performing length decreasing sorting on the gene sequence database;
establishing two buffer areas, and preloading the whole gene sequence file subjected to descending sorting into the two buffer areas;
building a matching dictionary: and constructing a sparse suffix array by using one gene sequence of the buffer zone as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search and matching search at a certain position of the query sequence in the matching process, and adopting an inverse suffix array, a minimum common sub-prefix array and a suffix link for processing to realize clustering of the biological gene sequences.
CN202010213789.4A 2020-03-24 2020-03-24 Method and system for realizing large-scale database clustering by double buffer model Active CN111415708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010213789.4A CN111415708B (en) 2020-03-24 2020-03-24 Method and system for realizing large-scale database clustering by double buffer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010213789.4A CN111415708B (en) 2020-03-24 2020-03-24 Method and system for realizing large-scale database clustering by double buffer model

Publications (2)

Publication Number Publication Date
CN111415708A CN111415708A (en) 2020-07-14
CN111415708B true CN111415708B (en) 2023-05-05

Family

ID=71493217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010213789.4A Active CN111415708B (en) 2020-03-24 2020-03-24 Method and system for realizing large-scale database clustering by double buffer model

Country Status (1)

Country Link
CN (1) CN111415708B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1317114A (en) * 1998-07-10 2001-10-10 快速检索及传递公司 Search system and method for retrieval of data, and use thereof in search engine
CN103686077A (en) * 2013-11-29 2014-03-26 成都亿盟恒信科技有限公司 Double buffering method applied to realtime audio-video data transmission of 3G wireless network
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
WO2018054496A1 (en) * 2016-09-23 2018-03-29 Huawei Technologies Co., Ltd. Binary image differential patching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1317114A (en) * 1998-07-10 2001-10-10 快速检索及传递公司 Search system and method for retrieval of data, and use thereof in search engine
CN103686077A (en) * 2013-11-29 2014-03-26 成都亿盟恒信科技有限公司 Double buffering method applied to realtime audio-video data transmission of 3G wireless network
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
WO2018054496A1 (en) * 2016-09-23 2018-03-29 Huawei Technologies Co., Ltd. Binary image differential patching

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Patrick Flick等.Parallel distributed memory construction of suffix and longest common prefix arrays.《SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis》.2015,全文. *

Also Published As

Publication number Publication date
CN111415708A (en) 2020-07-14

Similar Documents

Publication Publication Date Title
US10521239B2 (en) Microprocessor accelerated code optimizer
EP2783280B1 (en) An accelerated code optimizer for a multiengine microprocessor
JP4292198B2 (en) Method for grouping execution threads
EP2783282B1 (en) A microprocessor accelerated code optimizer and dependency reordering method
US6216220B1 (en) Multithreaded data processing method with long latency subinstructions
Liu et al. CUDA-BLASTP: accelerating BLASTP on CUDA-enabled graphics hardware
EP2656229B1 (en) Mechanism for conflict detection using simd
US9268595B2 (en) Scheduling thread execution based on thread affinity
US11308171B2 (en) Apparatus and method for searching linked lists
EP2866138A1 (en) Processor core with multiple heterogenous pipelines for emulated shared memory architectures
US7617494B2 (en) Process for running programs with selectable instruction length processors and corresponding processor system
CN111415708B (en) Method and system for realizing large-scale database clustering by double buffer model
CN111045800A (en) Method and system for optimizing GPU (graphics processing Unit) performance based on short job priority
CN116092587B (en) Biological sequence analysis system and method based on producer-consumer model
Chen et al. An exact matching approach for high throughput sequencing based on bwt and gpus
Feng et al. Accelerating Smith-Waterman alignment of species-based protein sequences on GPU
CN109584967B (en) Parallel acceleration method for protein identification
KR102210765B1 (en) A method and apparatus for long latency hiding based warp scheduling
US7093260B1 (en) Method, system, and program for saving a state of a task and executing the task by a processor in a multiprocessor system
Jiang et al. Fine-grained acceleration of hmmer 3.0 via architecture-aware optimization on massively parallel processors
KR100861701B1 (en) Register renaming system and method based on value similarity
US20040128476A1 (en) Scheme to simplify instruction buffer logic supporting multiple strands
CN116821008B (en) Processing device with improved cache hit rate and cache device thereof
US20230289185A1 (en) Data processing apparatus, method and virtual machine
US20240111526A1 (en) Methods and apparatus for providing mask register optimization for vector operations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant