CN116092587B - Biological sequence analysis system and method based on producer-consumer model - Google Patents

Biological sequence analysis system and method based on producer-consumer model Download PDF

Info

Publication number
CN116092587B
CN116092587B CN202310375440.4A CN202310375440A CN116092587B CN 116092587 B CN116092587 B CN 116092587B CN 202310375440 A CN202310375440 A CN 202310375440A CN 116092587 B CN116092587 B CN 116092587B
Authority
CN
China
Prior art keywords
data
module
biological sequence
processing module
producer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310375440.4A
Other languages
Chinese (zh)
Other versions
CN116092587A (en
Inventor
刘卫国
孙伟豪
殷泽坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202310375440.4A priority Critical patent/CN116092587B/en
Publication of CN116092587A publication Critical patent/CN116092587A/en
Application granted granted Critical
Publication of CN116092587B publication Critical patent/CN116092587B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention relates to the technical field of biological sequence analysis, and discloses a biological sequence analysis system and a biological sequence analysis method based on a producer-consumer model, wherein the biological sequence analysis system comprises an input module, a processing module and an output module; the producer-consumer model is arranged between the input module and the processing module and between the processing module and the output module; the input module and the output module both adopt single threads to process data, and the processing module adopts multithreading to dynamically divide and process the data; the producer in the producer-consumer model applies for empty blocks from the data pool, reads data and puts the data into the blocks, generates blocks containing the data, and puts the blocks containing the data into a data queue; and the consumer in the producer-consumer model acquires the block containing the data from the data queue for self use, releases the data after the use is finished, and returns the empty block to the data pool. The method has good thread expansibility and realizes good load balancing.

Description

Biological sequence analysis system and method based on producer-consumer model
Technical Field
The invention relates to the technical field of biological sequence analysis, in particular to a biological sequence analysis system and a biological sequence analysis method based on a producer-consumer model.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Traditional biological base sequence analysis methods are not suitable for large-scale base data. Since the issue of the Mash tool, minimum hash sketches (MinHash sketches) and the like have played an important role in comparative genomics. They are used to cluster genomes from large databases, search datasets with specific sequence content, speed up overlapping steps in the genome assembler, map sequencing reads, and find similarity threshold level differences that characterize species, etc. Although MinHash was originally developed for finding similar web pages, it is used here to summarize large sets of genomic sequences, such as reference genomes or sequencing datasets, one set is reduced to a set of representative k-mers (substrings of length k) and finally stored as an integer list. The abstract is much smaller than the original data, but can be used to estimate the relevant set cardinality, e.g., the size of the union or the intersection between k-mer content of two genomes, from which cardinality the Jaccard coefficient or the mesh distance can be obtained, which represents the average nucleotide identity. These make it possible to cluster sequences and otherwise solve a number of genome nearest neighbors.
MinHash may be regarded as a Locally Sensitive Hash (LSH), involving a hash function intended to map similar inputs to the same value. LSH is also used in bioinformatics, including homology searches and metagenomic classification. Some studies have pointed out the disadvantage of MinHash that MinHash radix estimation can be affected when the sizes of the sets are very different, which is not uncommon, for example, when the distance between two genomes of very different lengths is found, or when the similarity between a short sequence (e.g., bacterial genome) and a large set (e.g., a deep-covered metagenomic dataset) is found.
HyperLogLog (HLL) sketch can be an alternative to MinHash, which shows excellent accuracy and speed in a range of scenarios, including when the size of the input set is very different, and when the sketch data structure is very small. HLL software has been applied in other fields of bioinformatics, for example, to calculate the number of different k-mers in a genome or data collection; in addition, the software uses the recent theoretical improvement in radix estimation of the set union and intersection, i.e., estimating the Jaccard coefficient J and other components required for the similarity measure.
The data analysis efficiency of the existing HyperLogLog software is lower, so that more time is spent in genome alignment, and the problems of the HyperLog software mainly exist in comparative genetics at present: (1) The pressure faced by the reading thread is high, and only one thread is used for reading the biological sequence data in the hyperLogLog software, so that the reading thread is required to not only read the biological sequence data character string of the present job, but also perform additional data formatting analysis operation, and therefore, the pressure of the reading thread is greatly increased by the formatting analysis operation, and the efficiency of the reading thread is low; (2) The HyperLogLog software has low efficiency, low running speed and relatively poor thread expansibility, and has relatively large limit on the utilization value of the software; (3) The hyperLogLog software has low effective utilization rate of the memory and large memory usage amount, so that the hyperLog software cannot process large-scale biological sequences, can only process small-scale data, and does not better utilize memory resources.
Disclosure of Invention
In order to solve the problems, the invention provides a biological sequence analysis system and a biological sequence analysis method based on a producer-consumer model, which are constructed by means of two producer-consumer models, and an external processing module adopts a multithread dynamic partitioning strategy, so that the biological sequence analysis system has good thread expansibility, realizes good load balancing and realizes highly optimized hyperlogog biological sequence analysis.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a first aspect of the present invention provides a biological sequence analysis system based on a producer-consumer model, comprising an input module for acquiring a biological sequence, a processing module for analyzing the biological sequence, and an output module for outputting an analysis result;
the producer-consumer model is arranged between the input module and the processing module and between the processing module and the output module; the input module and the output module both adopt single threads to process data, and the processing module adopts multithreading to dynamically divide and process the data;
the producer in the producer-consumer model applies for empty blocks from the data pool, reads data and puts the data into the blocks, generates blocks containing the data, and puts the blocks containing the data into a data queue;
and the consumer in the producer-consumer model acquires the block containing the data from the data queue for self use, releases the data after the use is finished, and returns the empty block to the data pool.
Further, a build sketch module in the processing module upgrades the variables with the use frequency exceeding the threshold value in the data processing process to register variables stored in the register.
Further, in the process of converting capital letters into lowercase letters, a draft building module in the processing module utilizes a vector register to splice a plurality of data which perform the same operation into a vector, and the vector is completed by using an instruction.
Further, when the reverse complementary sequence is constructed, a sketch constructing module in the processing module constructs the reverse complementary sequence by using a lookup table;
the look-up table is formed by the following steps: the ASCII code of a base is divided by a certain integer to obtain the remainder, and the complementary base corresponding to the remainder is found in the lookup table and stored in an array.
Further, manual vectorization is adopted when the maximum leading zero of the bucket index is established by the construction sketch module in the processing module.
Further, when the distance calculating module in the processing module calculates the distance between two base sequences, a Boolean variable is used for judging whether the distance between the two base sequences is calculated and obtaining a distance calculating result, and if so, the distance calculating result is directly obtained; otherwise, calculating and storing the distance calculation result.
Further, a calculation distance module in the processing module calculates the intersection of two gene sequences by adopting a mode of combining automatic vectorization and manual vectorization.
Further, the processing module performs formatting parsing of the biological sequence.
Further, after the output module writes the content in the block out to the disk, the empty block is released back to the data pool.
A second aspect of the present invention provides a biological sequence analysis method based on the biological sequence analysis system of the first aspect, comprising the steps of:
the input module acquires a biological sequence;
the processing module analyzes the biological sequence;
the output module outputs the analysis result.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a biological sequence analysis system based on a producer-consumer model, which is constructed by means of two producer-consumer models, and an external processing module adopts a multithread dynamic partitioning strategy, so that the biological sequence analysis system has better thread expansibility and realizes better load balancing.
The invention provides a biological sequence analysis system based on a producer-consumer model, which finally realizes highly optimized hyperglogy biological sequence analysis by means of a series of optimization technologies such as a lightweight IO frame of an input processing output module, optimization of an update module, optimization of a distance module and the like and biological informatics data analysis knowledge.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a diagram of an input-processing-output lightweight IO frame in accordance with a first embodiment of the present invention;
FIG. 2 is a diagram of the operation of an input module and a processing module according to a first embodiment of the present invention;
FIG. 3 is a diagram of the operation of a processing module and an output module according to a first embodiment of the present invention;
FIG. 4 is a vectorization schematic of a first embodiment of the present invention;
FIG. 5 is a diagram illustrating the expandability of threads according to a first embodiment of the present invention;
fig. 6 is a performance enhancement schematic of a first embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The embodiments of the present invention and features of the embodiments may be combined with each other without conflict, and the present invention will be further described with reference to the drawings and embodiments.
Term interpretation:
SSE2: the single instruction multiple data stream expands instruction set 2, collectively Streaming SIMD Extensions 2.
AVX: advanced vector expansion, collectively Advanced Vector Extensions. Both AVX2 and AVX512 are extensions of AVX instructions, and AVX512 mainly extends 256-bit data to 512-bit, improving the data level parallelism capability.
IO: input and output, all referred to as: input/Output.
OpenMP: the abbreviation of Open Multi-Processing is an Application Programming Interface (API) that can be used to explicitly direct the parallelism of Multi-threaded, shared memory.
Sketch: based on the base sequence of the gene, a compact, approximate summary of the data can be obtained, which can represent or summarize the raw data.
Example 1
The first embodiment aims to provide a biological sequence analysis system based on a producer-consumer model, which is optimized based on the existing hyperlogog software, and the single-thread performance is optimized through optimizing the compiling options and compiling parameters, reducing memory redundancy copies, reducing redundancy calculation, optimizing access memory, optimizing calculation, vectorizing an SSE2 instruction set, vectorizing an AVX512 instruction set and other optimizing technologies; when the single-thread performance peak value is nearly reached, the multi-thread optimization is performed, so that the load balance during multi-thread is ensured, and the good multi-thread acceleration ratio can be ensured; the optimized hyperlogog software (a biosequence analysis system based on a producer-consumer model) achieves a high degree of optimization.
The biological sequence analysis system based on the producer-consumer model is oriented to a multi-core platform.
The biological sequence analysis system based on the producer-consumer model provided by the embodiment comprises: the device comprises an input module for acquiring a biological sequence, a processing module for analyzing the biological sequence and an output module for outputting an analysis result.
Wherein the analysis results are the similarity or distance between biological sequences.
The processing module comprises: a sketch (update) module and a distance calculation module are constructed.
The optimization of the existing Hyperloglog software in this embodiment mainly includes: and constructing three links of input-processing-output lightweight IO frames, optimizing update modules and optimizing distance modules.
(1) An input-processing-output lightweight IO framework is constructed.
(101) The producer-consumer model is arranged between the input module and the processing module and between the processing module and the output module. A producer in the producer-consumer model applies for empty blocks from the data pool, reads data and puts the data into the blocks, generates blocks containing the data, and puts the blocks containing the data into a data queue; the consumer obtains the block containing the data for self use, releases the data after the use is finished, and returns the empty block to the data pool.
As shown in fig. 1, this embodiment uses two producer-consumer modules to construct an input-process-output lightweight IO framework, and decouples the input, process, and output modules, and constructs a producer-consumer model between the input module and the process module, and a producer-consumer model between the process module and the output module, where asynchronous execution can be achieved simultaneously, and the entire data stream is streaming.
A producer-consumer model is constructed between the input module and the processing module, in which the input module is the producer and the processing module is the consumer. To ensure the correctness and ordering of the read data, the producer can only have one, but the data processed by the processing module can be different, so that the consumer can have a plurality of data. In the original hyperlogog software, the input module is required to perform not only the reading of the biological sequence, but also the formatting parsing work of the biological sequence, which is much slower than the reading work of the biological sequence. Therefore, the formatting and analyzing work is put into the processing module, so that the data reading efficiency of the input module can be greatly improved. Because the processing modules tend to be multi-threaded in concurrency, the formatting parsing work added by the processing modules has little effect on its execution. This way, a more load-balancing between the input module and the processing module is possible, which is also a natural advantage of the producer-consumer model.
In this embodiment, data is stored using blocks (chunk). The input module reads the biological sequence, puts the biological sequence into a chunk, and puts the biological sequence into a queue for the processing module to use after one chunk is packaged. The idea of a data pool is also used, and only a fixed number of chunk are created initially for repeated use in the running process, so that the expenditure of repeatedly applying for and releasing the chunk memory can be greatly reduced.
The working diagrams of the input module, the chunk and the processing module are shown in fig. 2, the input module applies for empty chunk to the data pool, reads data (biological sequence) and puts the data (biological sequence) into the chunk, and generates chunk containing the data (biological sequence) and puts the data into a data queue; the processing module acquires the chunk containing the data for self use, releases the data after the use is finished, and returns the empty chunk to the data pool.
A producer-consumer model is constructed between the processing module and the output module, wherein the processing module acts as a producer and the output module acts as a consumer. In order to ensure the ordering and correctness of the data writing back, only one writing thread is used, the processing module processes the data in a multithreading way, an empty chunk object is obtained after the processing is finished, a result to be written out is placed in a chunk, and the result to be written out is placed in a queue after the chunk packaging is finished; the output module acquires the chunk from the queue, writes the chunk content out to the disk, and then releases the empty chunk back to the data pool. Better load balancing is realized between the processing module and the output module.
The working diagrams of the processing module, the chunk and the output module are shown in figure 3, the processing module applies for empty chunk objects to the data pool, reads data (analysis result) and puts the data into the chunk, and generates chunk containing the data and puts the data into a data queue; and the output module acquires the chunk containing the data for self use, releases the data after the use is finished, and returns the empty chunk to the data pool.
(102) The input module and the output module both adopt single threads to process data, and the processing module adopts multithreading to dynamically divide and process data.
In the processing module, multithreading is adopted, and researches show that whether the processing module is an update module or a distance module, because the correlation between the front and back of data in the hyperlogog software is strong, cyclic dependence exists between loops, and the branch structure is relatively large, the software-level multithreading cannot be realized, and only the data-level multithreading can be realized.
The processing module uses OpenMP for multi-threaded optimization. Compare different data allocation performances: block division, cyclic division, dynamic division.
The cost of the system is the lowest and the system is the simplest when the system distributes thread tasks, and the loop task division can be completed when compiling. However, since the base sequence data is large or small, the workload of the cyclic blocks is also large or small, and the workload of the cyclic blocks to which some threads are allocated is small, and the workload of the cyclic blocks to which other threads are allocated is large, so that the load is unbalanced. Therefore, block partitioning is applicable to a scene where the workload is almost uniform every time a cycle, and is not applicable to this scene.
The cost of the system is lower when the thread tasks are distributed, the system is simpler, and the loop task division can be completed when the system compiles. The cyclic division is a division policy closest to dynamic division and load balancing in probability among all static division policies. In extreme cases, the workload allocated to some threads may be small, and the workload allocated to other threads may be large, resulting in unbalanced load. The loop comparison, in which the workload increases or decreases with the number of loops, is applicable to loop division, and therefore this scenario is also not applicable.
Dynamic partitioning, a compiler cannot determine the task to which a thread is assigned during compilation, only at runtime. For one loop, multi-threaded dynamic partitioning is employed. If a thread is idle, the task of the current cycle is allocated to it; if all threads are busy, the cyclic task allocation is put into a waiting state until the threads are idle again; when all loop tasks have been allocated, all threads wait for completion of the tasks allocated to them. Dynamic partitioning strategy, which requires the system to constantly distribute cyclic tasks to idle threads when the program is running, and constantly monitor whether any thread is idle. Therefore, the dynamic partitioning strategy is relatively high in cost and relatively complex in the task allocation process of the system. However, for the present scenario, the problem of non-uniformity among biological sequence data often occurs, so that performance improvement caused by more balanced threads is often greater than overhead in task allocation. Therefore, the scene adopts a dynamic partitioning strategy.
In the embodiment, two lightweight IO frames based on a producer-consumer model are constructed by means of a data pool, the input module and the output module are both single-threaded processing data, and the processing module adopts multithreading dynamic division to process the data. Decoupling, asynchronous execution and load balancing are realized among the input module, the processing module and the output module, and the whole data flow is a flow pipeline, so that the problem of low IO efficiency of the traditional hyperlogog software is solved.
(2) Optimization of update module.
(201) The update module upgrades the variable with the use frequency exceeding the threshold value in the data processing process to the register variable stored in the register.
The common variables used in the Hyperloglog software program are stored in memory, and the use and storage costs are relatively high. Thus, if the variable is frequently used in the program, such as a loop variable in a for loop. The program will have a relatively large overhead in terms of variable access alone, which tends to be detrimental to program performance. The register variables are stored in registers of the cpu. Thus, using a register (register) for some hotspot variables may increase the efficiency of the code.
The present embodiment chooses to add a register key to the frequently used variables, but it should be noted that the register only suggests whether the compiler upgrades the normal variables to register level variables or not to truly upgrade to register variables yet see the actual compiler.
(202) In the process of converting capital letters into lowercase letters, the update module utilizes a vector register to splice a plurality of data which do the same operation into a vector, and uses an instruction to complete, namely, the conversion of the capital letters into the lowercase letters is optimized by using a vectorization technology.
Vectorization refers to the fact that multiple data can be identically operated in only one instruction by using a specific vector register, so that the data processing efficiency can be greatly accelerated. With the conventional SISD (single instruction single data, single instruction single data set), eight computations, i.e., 8 data, 8 instructions, i.e., a0+b0=c0, a1+b1=c1, a2+b2=c2, a3+b3=c3, a4+b4=c4, a5+b5=c5, a6+b6=c6, a7+b7=c7, are required to be completed as shown in the eight-group data addition computation of fig. 4. With vectorized SIMD (single instruction multiple data, single instruction multiple data set), eight data are spelled into one vector to be stored in a specific register, and then calculation is completed by using only one instruction, and only one calculation is needed, namely 8 data and 1 instruction, namely (A0, A1, A2, A3, A4, A5, A6, A7) + (B0, B1, B2, B3, B4, B5, B6, B7) = (C0, C1, C2, C3, C4, C5, C6, C7).
In this embodiment, three vectorization implementations of SSE2, AVX2 and AVX512 are perfected according to the vectorization register bit number judgment supported by the user machine. The number of vectorized register bits supported by different machines is different, and 128 bits, 256 bits and 512 bits are supported. SSE2 is a 128-bit vectorized instruction, AVX2 is a 256-bit vectorized instruction, and AVX512 is a 512-bit vectorized instruction. And dynamically selecting one of SSE2, AVX2 and AVX512 according to the vectorization register bit number judgment of the machine.
(203) And when the update module constructs the reverse complementary sequence, constructing by using a lookup table. Specifically, 8 residues are divided by the ASCII code of a certain base, and the complementary base corresponding to the remainder is found in the lookup table and stored in an array.
When the reverse complementary sequence is constructed, the reverse complementary base conversion is carried out by utilizing a lookup table mode, and compared with the data calculation, the efficiency of the lookup table mode is obviously higher.
As shown in table 1, since the ASCII codes of five bases of A, T, C, G, N are not equal in number by dividing 8 by the remainder, it is only necessary to construct an array of length 8 (the result of dividing ASCII of five base letters by 8 is 7 at maximum, the array index range of length 8 is 0-7 and can be covered perfectly, the requirement is satisfied and space is not wasted), and complementary bases are stored at the corresponding index positions index. And counting once and looking up the table once to obtain the complementary base of one base.
Table 1, look-up table
(204) The update module adopts a mode of transferring pointer references to carry out sliding window segment interception on the biological sequence.
And when the sliding window segment interception is carried out on the biological sequence by adopting a copy array low-efficiency method, the memory utilization rate is low, and redundant copies exist. It was found that there was no modification to the copy content, only involving reading of the copy content. Therefore, the embodiment adopts a pointer reference transferring mode, reduces redundant copies of the memory, and can greatly improve the utilization rate of the memory.
(205) The update module obtains the maximum leading zero time of the corresponding bucket index from 64-bit binary obtained after Murmurhash3, and adopts manual vectorization.
For the maximum leading zero of the corresponding bucket index of the 64-bit binary obtained after Murmurhash3, because data does not have data dependence and cyclic dependence, the embodiment uses the maximum leading zero of the corresponding bucket index of the 64-bit binary to carry out manual vectorization optimization, processes a plurality of data at a time and obtains the maximum leading zero of the corresponding bucket index of the plurality of binary at a time, thereby improving the efficiency.
(3) Optimizing distance module.
(301) When calculating the distance between two base sequences, the distance module uses a Boolean variable to judge whether the distance between the two base sequences is calculated and obtain a distance calculation result, if so, the distance calculation result is directly obtained; otherwise, calculating and storing the distance calculation result.
When calculating distances of different base sequences, redundant calculation exists for the same base sequence involved. In the embodiment, a Boolean variable is used for judging whether redundant calculation is performed in the previous distance calculation process or not, and a result is obtained; if yes, directly obtaining a calculation result; otherwise, calculating and saving the calculation result, and changing the value of the Boolean variable. Thus, redundant computation of the distance module can be greatly reduced.
Wherein the initial variable value of the Boolean variable is false; when the distance between two base sequences is calculated for the first time, judging that the variable value is false, calculating the distance, storing the distance calculation result, and changing the variable value to true; when the distance between the two base sequences is calculated again, if the variable value is judged to be true, the distance calculation result is directly obtained, and the recalculation is not needed.
(302) The distance module adopts a statistical array mode to count the times that each maximum leading zero of one biological sequence sketch is larger than, smaller than and equal to the maximum leading zero of the same position of the other biological sequence sketch.
Obtaining sketch arrays sketch1 and sketch2 of two biological sequences after the update module, wherein the sketch arrays are stored with a plurality of maximum leading zeros; the arrays c1g and c1l respectively count the situations that the data information of the array of the sketch1 is larger than and smaller than the data information of the array of the sketch 2; the arrays c2g and c2l count the situations that the data information of the array of the sketch2 is larger than and smaller than the data information of the array of the sketch 1.
And counting the times that each maximum leading zero of one biological sequence sketch is larger than, smaller than and equal to the maximum leading zero of the same position of the other biological sequence sketch aiming at the sketch constructed by the update module. Statistics of these data are extremely time consuming operations, for which this embodiment makes use of vectorized optimization. In order to obtain correct results, it is possible to generate data collision when adjacent data statistics are found, for example, the values of two adjacent positions point to the same position. For this reason, a plurality of statistics arrays are added in this embodiment, so that each position in the vector is guaranteed to have a statistics array which completely belongs to itself (there is no statistics array processed one count at a time before vectorization, but 8 counts processed at a time after vectorization, each of the 8 counts has its own statistics array, so that it is required to ensure that each of the 8 positions which are vectorized has a statistics array which completely belongs to itself); when vectorizing, each position in the vector has an offset, so that the statistical arrays are ensured not to conflict at all (each position has own statistical array, and the offset thought is needed to find the position of own statistical array); and after the statistics is completed, adding and combining the multiple statistics arrays by using a vectorization technology to obtain a final correct result. Thus, the data processing efficiency can be quickened, and the correct result is ensured.
(303) The distance module adopts a mode of combining automatic vectorization and manual vectorization to calculate the intersection of the gene sequence A and the gene sequence B.
When calculating the intersection of the gene sequence A and the gene sequence B, an intersection sketch is required to be calculated first; when an array of countsA×Bhalf (gene sequence A plus half of gene sequence B) and countsB×Ahalf (gene sequence B plus half of gene sequence A) is calculated, it was found that there is no data dependency or cyclic dependency, and therefore vectorization optimization is also possible. The compiler auto-vectorization option is turned on and auto-vectorization information is printed, and it is found that this loop (the loop when the countsA×Bhalf and countsB×Ahalf arrays are calculated) is not auto-vectorized, and the present embodiment performs manual vectorization. By means of array capacity expansion filling, shifting and other methods, vectorization instruction set (SSE 2, AVX2 or AVX 512) is adopted to carry out efficient vectorization.
The existing Hyperloglog software has poor thread expansibility due to unreasonable division and distribution among an input module, a processing module and an output module. As shown in FIG. 5, by means of the lightweight IO frame constructed by the two producer-consumer models, the additional processing module adopts a multithread dynamic partitioning strategy, and the optimized Hyperlogog software has better thread expansibility.
The knowledge is analyzed by means of a series of optimization techniques such as a lightweight IO frame of an input processing output module, optimization of an update module, optimization of a distance module and the like and bioinformatics data. As shown in fig. 6, the present embodiment uses a code optimization technique related to high performance computing and architecture and a related bioinformatics data analysis base knowledge to the existing Hyperloglog comparative genomics software, and finally implements a highly optimized biological sequence analysis system.
Example two
An objective of the second embodiment is to provide a biological sequence analysis method based on the biological sequence analysis system of the first embodiment, which includes the following steps:
the input module acquires a biological sequence;
the processing module analyzes the biological sequence;
the output module outputs the analysis result.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims (4)

1. A biological sequence analysis system based on a producer-consumer model, comprising an input module for acquiring a biological sequence, a processing module for analyzing the biological sequence, and an output module for outputting an analysis result, characterized in that: the producer-consumer model is arranged between the input module and the processing module and between the processing module and the output module; the input module and the output module both adopt single threads to process data, and the processing module adopts multithreading to dynamically divide and process the data;
the producer in the producer-consumer model applies for empty blocks from the data pool, reads data and puts the data into the blocks, generates blocks containing the data, and puts the blocks containing the data into a data queue;
the consumer in the producer-consumer model obtains the block containing the data from the data queue for self use, releases the data after the use is finished, and returns the empty block to the data pool;
the construction sketch module in the processing module upgrades the variables with the use frequency exceeding the threshold value in the data processing process into register variables stored in a register; in the process of converting capital letters into lowercase letters, a draft building module in the processing module utilizes a vector register to splice a plurality of data which perform the same operation into a vector, and the vector is completed by using an instruction; the draft constructing module in the processing module utilizes a lookup table to construct when constructing the reverse complementary sequence; the look-up table is formed by the following steps: dividing ASCII code of a certain base by a certain integer, taking the remainder, finding the complementary base corresponding to the remainder in a lookup table, and storing the complementary base in an array; a sketch constructing module in the processing module adopts a mode of transferring pointer references to carry out sliding window interception on the biological sequence; when the maximum leading time of the constructed sketch module in the processing module to the bucket index is zero, manual vectorization is adopted;
when the distance calculating module in the processing module calculates the distance between two base sequences, a Boolean variable is used for judging whether the distance between the two base sequences is calculated and obtaining a distance calculating result, and if yes, the distance calculating result is directly obtained; otherwise, calculating and storing a distance calculation result; the calculation distance module in the processing module adopts a statistical array mode to count the times that each maximum leading zero of one biological sequence sketch is larger than, smaller than and equal to the maximum leading zero of the same position of the other biological sequence sketch; the calculation distance module in the processing module calculates the intersection of two gene sequences by adopting a mode of combining automatic vectorization and manual vectorization.
2. A producer-consumer model based biological sequence analysis system according to claim 1, wherein: and the processing module performs formatting analysis of the biological sequence.
3. A producer-consumer model based biological sequence analysis system according to claim 1, wherein: and after the output module writes the content in the block out of the disk, releasing the empty block back to the data pool.
4. A biological sequence analysis method based on the biological sequence analysis system according to any one of claims 1 to 3, characterized in that: the method comprises the following steps:
the input module acquires a biological sequence;
the processing module analyzes the biological sequence;
the output module outputs the analysis result.
CN202310375440.4A 2023-04-11 2023-04-11 Biological sequence analysis system and method based on producer-consumer model Active CN116092587B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310375440.4A CN116092587B (en) 2023-04-11 2023-04-11 Biological sequence analysis system and method based on producer-consumer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310375440.4A CN116092587B (en) 2023-04-11 2023-04-11 Biological sequence analysis system and method based on producer-consumer model

Publications (2)

Publication Number Publication Date
CN116092587A CN116092587A (en) 2023-05-09
CN116092587B true CN116092587B (en) 2023-08-18

Family

ID=86212335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310375440.4A Active CN116092587B (en) 2023-04-11 2023-04-11 Biological sequence analysis system and method based on producer-consumer model

Country Status (1)

Country Link
CN (1) CN116092587B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117393046B (en) * 2023-12-11 2024-03-19 山东大学 Space transcriptome sequencing method, system, medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113496762A (en) * 2021-05-20 2021-10-12 山东大学 Biological gene sequence summary data generation method and system
CN114064551A (en) * 2022-01-17 2022-02-18 广州嘉检医学检测有限公司 CPU + GPU heterogeneous high-concurrency sequence alignment calculation acceleration method
CN114420210A (en) * 2022-03-28 2022-04-29 山东大学 Rapid trimming method and system for biological sequencing sequence
CN114420215A (en) * 2022-03-28 2022-04-29 山东大学 Large-scale biological data clustering method and system based on spanning tree
WO2022267867A1 (en) * 2021-06-23 2022-12-29 深圳华大基因股份有限公司 Gene sequencing analysis method and apparatus, and storage medium and computer device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113496762A (en) * 2021-05-20 2021-10-12 山东大学 Biological gene sequence summary data generation method and system
WO2022267867A1 (en) * 2021-06-23 2022-12-29 深圳华大基因股份有限公司 Gene sequencing analysis method and apparatus, and storage medium and computer device
CN114064551A (en) * 2022-01-17 2022-02-18 广州嘉检医学检测有限公司 CPU + GPU heterogeneous high-concurrency sequence alignment calculation acceleration method
CN114420210A (en) * 2022-03-28 2022-04-29 山东大学 Rapid trimming method and system for biological sequencing sequence
CN114420215A (en) * 2022-03-28 2022-04-29 山东大学 Large-scale biological data clustering method and system based on spanning tree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
生产者―消费者二维队列模型在舆情监控系统中的应用;雷龙艳;万亚平;徐强;阳小华;;南华大学学报(自然科学版)(03);第61-65页 *

Also Published As

Publication number Publication date
CN116092587A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
Vasimuddin et al. Efficient architecture-aware acceleration of BWA-MEM for multicore systems
Ahmed et al. GASAL2: a GPU accelerated sequence alignment library for high-throughput NGS data
Ho et al. Exploiting half precision arithmetic in Nvidia GPUs
Hou et al. Fast segmented sort on gpus
Shanbhag et al. Efficient top-k query processing on massively parallel hardware
Du et al. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
Stamatakis et al. Exploring new search algorithms and hardware for phylogenetics: RAxML meets the IBM cell
Tomov et al. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
Sadasivan et al. Accelerating Minimap2 for accurate long read alignment on GPUs
CN117724763A (en) Apparatus, method and system for matrix operation accelerator instruction
Chacón et al. Thread-cooperative, bit-parallel computation of levenshtein distance on GPU
CN116092587B (en) Biological sequence analysis system and method based on producer-consumer model
Funke et al. Data-parallel query processing on non-uniform data
Behrens et al. Efficient SIMD Vectorization for Hashing in OpenCL.
CN111444134A (en) Parallel PME (pulse-modulated emission) accelerated optimization method and system of molecular dynamics simulation software
Xia et al. A review of parallel implementations for the Smith–Waterman algorithm
Liu Parallel and scalable sparse basic linear algebra subprograms
González-Domínguez et al. Parallel pairwise epistasis detection on heterogeneous computing architectures
Volk et al. GPU-Based Speculative Query Processing for Database Operations.
Meng et al. Modern computational techniques for the HMMER sequence analysis
Lei et al. CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications
Perera et al. High performance dataframes from parallel processing patterns
Nisa et al. Distributed-memory k-mer counting on GPUs
US11366664B1 (en) Single instruction multiple data (simd) execution with variable width registers
Jiang et al. CUDAMPF++: A proactive resource exhaustion scheme for accelerating homologous sequence search on CUDA-enabled GPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant