CN114420210A - Rapid trimming method and system for biological sequencing sequence - Google Patents

Rapid trimming method and system for biological sequencing sequence Download PDF

Info

Publication number
CN114420210A
CN114420210A CN202210308606.6A CN202210308606A CN114420210A CN 114420210 A CN114420210 A CN 114420210A CN 202210308606 A CN202210308606 A CN 202210308606A CN 114420210 A CN114420210 A CN 114420210A
Authority
CN
China
Prior art keywords
data
sequence
thread
sequencing sequence
biological sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210308606.6A
Other languages
Chinese (zh)
Other versions
CN114420210B (en
Inventor
刘卫国
王明凯
殷泽坤
张�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210308606.6A priority Critical patent/CN114420210B/en
Publication of CN114420210A publication Critical patent/CN114420210A/en
Application granted granted Critical
Publication of CN114420210B publication Critical patent/CN114420210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5011Pool
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method and a system for quickly trimming a biological sequencing sequence, which belong to the technical field of biological information, and the scheme comprises the following steps: obtaining a biological sequencing sequence to be trimmed; performing a read operation, a trim operation, and a write operation on the biological sequencing sequence; the read operation, the trim operation and the write operation are decoupled based on a producer-consumer model, and asynchronous execution is realized; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trim operation.

Description

Rapid trimming method and system for biological sequencing sequence
Technical Field
The invention belongs to the technical field of biological information, and particularly relates to a method and a system for quickly trimming a biological sequencing sequence.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
In next generation sequencing, the nucleic acid sequence to be sequenced is ligated to an Adapter sequence (Adapter) for recognition by the sequencer, however, when the length of the nucleic acid sequence is shorter than the Read length of the sequencing platform run, the sequenced fragment of the gene sequence (called Read) will contain both the nucleic acid sequence to be sequenced and all or part of the Adapter sequence. In addition, in NGS (Next Generation Sequencing), the reliability of the Sequencing result becomes lower at the end cycle (tail cycle), and some Sequencing sequences with low quality are obtained. Sequencing sequences contaminated with sequencing adapters or low quality sequencing processes often lead to unsatisfactory results of downstream analysis (e.g., gene alignment work, etc.), so pruning of adapters in (Trim) sequencing and low quality data is an indispensable step prior to downstream analysis tasks.
With the advancement of modern sequencers, ever-increasing throughput and sequence length, presenting new challenges for pruning, the performance (e.g., speed and accuracy) of some current processing tools has been difficult to meet with sequencing data of such a scale that the preprocessing step becomes a bottleneck for data analysis, and an ultrafast, accurate sequencing data linker quality pruning tool for NGS data preprocessing is still in urgent need.
There are many tools currently used for sequencing adapters in the next generation of sequencing data and pruning of low quality data, such as: the tools adopt different Trim algorithms, and realize Adapter-Trim and Quality-Trim in NGS sequencing data.
The inventor finds that the Ktrim has better performance on NGS short-reading sequencing data than other tools, but with the development and popularization of the solid state disk technology and the improvement of the disk array technology, the difference between the current processing speed of the Ktrim and the read-write peak value of the hard disk is large, the performance requirement on biological data preprocessing at present cannot be met, and through tests, the thread expansibility of the Ktrim is not good, and the processing speed of a program with more than four threads is difficult to continue to improve.
Disclosure of Invention
In order to solve the problems, the invention provides a method and a system for quickly pruning a biological sequencing sequence, wherein the scheme adopts a lightweight I/O frame to greatly improve the I/O rate; meanwhile, a vectorization mode is adopted on the basis to optimize the searching process of the Adapter, and the processing efficiency of the rapid trimming of the biological sequencing sequence is effectively improved.
According to a first aspect of the embodiments of the present invention, there is provided a method for rapid trimming of a biological sequencing sequence, comprising:
obtaining a biological sequencing sequence to be trimmed;
performing a read operation, a trim operation, and a write operation on the biological sequencing sequence; the read operation, the trim operation and the write operation are decoupled based on a producer-consumer model, and asynchronous execution is realized; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trim operation.
Furthermore, the read operation, the trim operation and the write operation are respectively realized by adopting independent threads, wherein one read thread and one write thread are arranged, and one or more trim threads are arranged.
Further, the read operation is used for reading the biological sequencing sequence in a block mode through a read thread, and storing the read block object into a first data queue.
Further, the pruning operation is used for acquiring data from the first data queue through a pruning thread, formatting the biological sequencing sequence, and removing low-quality base sequences and linker sequences in the biological sequencing sequence; and storing the processed sequence into a second data queue.
Further, the obtaining of the linker sequence in the pruning thread comprises: treating each base in the biological sequencing sequence as a character; and based on the vector register, obtaining the position of the connector sequence in the sequence data with the preset length by adopting a plurality of times of bit operations.
Further, the write operation is used for acquiring the processed biological sequencing sequence from the second data queue through a write thread and storing the biological sequencing sequence.
According to a second aspect of the embodiments of the present invention, there is provided a rapid trimming system for a biological sequencing sequence, comprising:
a data acquisition unit for acquiring a biological sequencing sequence to be trimmed;
a data processing unit for performing a read operation, a trim operation, and a write operation on the biological sequencing sequence; the read operation, the trim operation and the write operation are decoupled based on a producer-consumer model, and asynchronous execution is realized; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trim operation.
Compared with the prior art, the invention has the beneficial effects that:
(1) the invention provides a method and a system for quickly trimming a biological sequencing sequence, wherein the scheme adopts a lightweight I/O frame to greatly improve the I/O rate; meanwhile, a vectorization mode is adopted on the basis to optimize the searching process of the Adapter, and the processing efficiency of the rapid trimming of the biological sequencing sequence is effectively improved.
(2) The method solves the problem of low IO efficiency of Ktrim, and improves the input and output rate of file data to reach or approach the performance peak value of disk reading and writing. The scheme realizes decoupling, asynchronization and speed balance among a read thread, a work thread and a write thread by adopting a producer-consumer model; meanwhile, the task of data formatting is transferred from the read thread to the working thread, and the task of data writing preparation is transferred from the write thread to the working thread, so that the speed of the read thread and the speed of the write thread are guaranteed to the greatest extent.
(3) The scheme of the invention uses the data pool DataPool to repeatedly utilize the chunk object, reduces the creation and destruction of the object, and reduces the expense for creating the object; meanwhile, when data are transmitted among the read thread, the working thread and the write thread, pointers of the data are used as much as possible without data replication, so that the overhead of memory copy is effectively reduced;
(4) the read thread, the working thread (namely the trimming thread) and the write thread in the whole working process are only created once and are destroyed until all tasks of the read thread, the working thread and the write thread are completed, so that the expense of thread creation is effectively reduced.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
Fig. 1 is a schematic diagram of chunk object flow between a read thread and a worker thread (i.e., a Trim thread) according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a relationship between a worker thread (i.e., a Trim thread) and a write thread according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a lightweight IO frame according to an embodiment of the present disclosure;
fig. 4 is a schematic overall frame diagram according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Interpretation of terms:
an Adapter: the adaptor sequence refers to an additional sequence fragment measured by a sequencing technology or a sequencing platform in a biological sequence sequencing process.
Trim: the important link of sequencing data quality control is mainly to remove low-quality base sequences and Adapter sequences.
The first embodiment is as follows:
the purpose of this example is to provide a method for rapid trimming of biological sequencing sequences.
Firstly, it should be noted that the present invention is an improvement based on the ktrm tool, which has better performance on NGS short-read sequencing data than the existing several other tools (such as trimmatic, Fastp), and by performing tests on a computing server device equipped with intel to strong CPU and a standard 64-bit CentOS system, the tool can reach its optimal performance on four threads using single-ended test data (ten million pieces of sequencing data in FASTQ format, size about 2.1 GB), processing time is 7.26 seconds, and processing rate is about 300 MB/sec; using double-ended test data (two single-ended sequencing data files, totaling two million pieces of data in FASTQ format, about 4.2GB in size), the best performance was achieved on four threads, with a processing time of 11.29 seconds and a best performance processing rate of about 370 MB/sec. The authors of the Ktrim tool mention in their paper (Sun k. Ktrim: an extra-fast and available adapter-and quality-timer for sequencing data) that Ktrim, while maintaining a high accuracy, has a processing speed 2-18 times faster than several other tools (Trim bulk, trimmatic, seqpurga).
Although ktim has better performance than other tools, with the development and popularization of solid state disk technology and the improvement of disk array technology, the difference between the current processing rate of ktim and the read-write peak value of a hard disk is large, the performance requirement of current biological data preprocessing cannot be met, and through tests, the thread expansibility of ktim is not good, and the processing speed of a program using more than four threads is difficult to be improved continuously, so that ktim still has a large progress space in computing performance.
Through analysis and testing, the performance bottleneck of the Ktrim is mainly found in the analysis of a FASTQ file, the processing performance of a program is limited due to an inefficient IO mode, the method is typical IO intensive application, meanwhile, the process of finding the Adapter in the Ktrim is relatively time-consuming, aiming at the above situations, a biological sequencing sequence fast trimming method (hereinafter referred to as a RabbitTrim tool) is provided, the tool uses a lightweight FASTQ data IO frame (as shown in FIG. 3, a lightweight IO frame structure diagram is shown) on the basis of the Ktrim, the IO speed is greatly increased, the finding process of the Adapter (connector sequence) is optimized on the basis of the Ktrim by using a vectorization mode, the processing efficiency of the program is obviously increased, and the hard disk read-write peak value (about 2 GB/sec) of a test platform can be approached.
Specifically, the performance bottleneck of the ktrm tool is mainly at the IO of the data file, and includes:
(1) the reading line pressure is large. In Ktrim, a reading thread is used for reading an input data file, the reading thread not only needs to read character string data, but also needs to analyze the data according to a FASTQ format, and the data formatting work greatly increases the burden of the reading thread, so that the data efficiency for producing the FASTQ format is very low.
(2) The number of data prefetches is small. Ktim supports a multi-thread mode, but in the multi-thread mode, only one block of data (one BATCH data) is prefetched more, and the input data supply still has a problem when the computation portion is executed faster.
(3) Dependency relationships exist among parts on the workflow. Partial parallelism is realized between a read operation and a Trim operation (namely, a Trim operation), but when the execution rates of the two operations are not matched, one operation waits for the other operation, the Trim operation and the write operation are in a serial mode, a block of data is processed and the next block of data is processed after the data is processed and output, and great dependence exists among the read operation, the Trim operation and the write operation.
(4) The process of finding a linker is slow and the ktrm tool uses the plot str function in the < string.h > library to find the "seed" of a linker, which is relatively slow.
In order to solve the above problems, the present invention provides a method for rapid trimming of a biological sequencing sequence, comprising:
obtaining a biological sequencing sequence to be trimmed;
performing a read operation, a trim operation, and a write operation on the biological sequencing sequence; the read operation, the trim operation and the write operation are decoupled based on a producer-consumer model, and asynchronous execution is realized; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trim operation.
Furthermore, the read operation, the trim operation and the write operation are respectively realized by adopting independent threads, wherein one read thread and one write thread are arranged, and one or more trim threads are arranged.
Further, the read operation is used for reading the biological sequencing sequence in a block mode through a read thread, and storing the read block object into a first data queue.
Further, the creation of the block objects introduces a data pool idea, and only a preset number of block objects are created for reuse.
Further, the pruning operation is used for acquiring data from the first data queue through a pruning thread, formatting the biological sequencing sequence, and removing low-quality base sequences and linker sequences in the biological sequencing sequence; and storing the processed sequence into a second data queue.
Further, the obtaining of the linker sequence in the pruning thread comprises: treating each base in the biological sequencing sequence as a character; and based on the vector register, obtaining the position of the connector sequence in the sequence data with the preset length by adopting a plurality of times of bit operations.
Further, the write operation is used for acquiring the processed biological sequencing sequence from the second data queue through a write thread and storing the biological sequencing sequence.
Furthermore, the threads corresponding to the read operation, the trim operation and the write operation are only created once until the processing task is completed, and then destroyed.
Further, the formatting specifically includes performing data parsing according to a FASTQ format.
Specifically, for ease of understanding, the embodiments of the present invention are described in detail below with reference to the accompanying drawings:
the scheme of the invention mainly aims at improving the problems existing in the Ktrim tool and comprises the following two aspects: the improvement of the input and output process and the improvement of the Adapter query process specifically comprise the following steps:
improvement of I/O process
Aiming at the input and output process, the RabbitTrim tool provided by the invention uses a lightweight IO frame to complete the input, analysis and output of data, wherein the frame is based on a producer-consumer model and is used for decoupling the read operation, Trim operation and write operation to realize asynchronous execution; specifically, the treatment process is as follows:
a producer-consumer model is established between a read operation and a Trim operation, wherein a read thread is used as a producer and a Trim thread is used as a consumer, and in order to ensure the orderliness and correctness of read data, the producer is set to be one, and the consumer can be set to be one or more. It was mentioned above that the read thread in ktim is responsible not only for reading data but also for formatting the data, which puts a large stress on the read thread. Because the speed of constructing the data object is slower than the speed of reading the character string with the specified size from the file when the data is formatted, the invention transfers the work of formatting to the consumer (namely, the Trim thread), and effectively improves the efficiency of the producer (namely, the reading thread); meanwhile, the invention can balance the speed of the consumer and the speed of the consumer by increasing the number of Trim threads.
The scheme of the invention reads data according to a block (chunk) mode, a read thread reads the data into a chunk object, and then the chunk object is stored into a queue for use by a Trim thread. The chunk object flow process between the read thread and the Trim thread is shown in fig. 1.
Secondly, a producer-consumer model is also established between the Trim thread and the write thread, the Trim thread is used as a producer, the write thread is a consumer, only one write thread is used in the program in order to ensure the correctness of the write-back data, because the speed of writing a disk is slow, the process of converting FASTQ object data into character string data in the program is placed in the Trim thread, and the relationship is shown in FIG. 2.
In summary, for the input and output process, the scheme of the invention has the following improvements compared with the existing ktrm tool:
(1) the problem of low IO efficiency of Ktrim is effectively solved, the input and output rate of file data is improved to reach or approach the performance peak value of disk reading and writing:
(2) decoupling, asynchronization and speed balancing between the read thread, the work thread and the write thread are realized by using a producer-consumer model;
(3) transferring the task with data formatting from a reading thread to a working thread, and transferring the task with data writing preparation from a writing thread to the working thread to ensure the speed of the reading thread and the writing thread as much as possible;
(4) the chunk object is repeatedly utilized by using the data pool DataPool, so that the creation and destruction of the object are reduced, and the expense for creating the object is reduced;
(5) when data are transmitted among a read thread, a working thread (namely a Trim thread) and a write thread, pointers of the data are used as much as possible without data replication, so that the overhead of memory copy is effectively reduced;
(6) the read thread, each working thread and the write thread in the whole working process can be created only once and are destroyed until all tasks of the threads are completed, so that the expense of thread creation is reduced.
Adapter query procedure improvement
The core part in the Ktrim tool is to search the position of possible adapters, namely the 'seed' of the adapters, which is a nucleotide sequence with the length of 3, and the process of searching the 'seed' can be abstracted to search a substring with the length of 3 in a character string with the length of tens to hundreds; the existing ktrm uses a request _ str (request _ hash, request _ hash _ new) function in a < string.h > library, and after an IO module is optimized, the simple mode is low in efficiency and becomes a hot spot during program operation.
The scheme of the invention is based on a vectorization component in a modern computer, and uses a vectorization mode to find the positions of all 'seed', specifically: each base in a sequencing sequence can be regarded as a character, 64 bytes of data, namely 64 bases, can be processed at one time by using an AVX512 vector register, the position of a 'seed' in sequence data with the length of 64 can be obtained by using multiple bit operations (three times of exclusive OR operation, two times of left shift operation and two times of OR operation) through the AVX512 vector register and an instruction set, the sequencing sequence data with the length of L is divided into N groups, each group comprises 64 bases, and all the positions of the 'seed' in the whole sequence can be obtained by carrying out the bit operations. Wherein
Figure 424217DEST_PATH_IMAGE001
Symbol of
Figure 515539DEST_PATH_IMAGE002
Indicating that the whole is taken. The specific process is shown in the following pseudo code:
algorithm 1 vectored-find-seed
Input seq sequencing sequence, index1 seed1, index2 seed2, all positions where seed1 seed1 occurs, all positions where seed2 seed2 occurs, seed position lookup table
Output seed1 and seed2
1:function VECTORIZED-FIND-SEED(seq, index1, index2, seed1, seed2, seedtable)
2 length ← length (seq) ▷ sequence length
3 num ← ceil (len-2/62) ▷ number of iterations required
4: i ← 0
5 v11 ← init (index1[0]) ▷ initialize the avx512 register with a certain base of the "seed
6: v12 ← init(index1[1])
7: v13 ← init(index1[2])
8: v21 ← init(index2[0])
9: v22 ← init(index2[1])
10: v23 ← init(index2[2])
11: whilei< num do
12: v1 ← load (seq + 62 ∗ i) ▷ loads sequence data required for the ith iteration into avx512 register v1
Res11 ← XOR (v1, v11) ▷ XOR operation
14: res21 ← XOR(v1, v21)
15 v1 ← leftShif (v1, 8) ▷ v1 register data left shift 8bit
16: res12 ← XOR(v1, v12)
17: res22 ← XOR(v1, v22)
Res12 ← OR (res11, res12) ▷ ORing the results of the two XORs
19: res22 ← OR(res21, res22)
20: v1 ← leftShift(v1, 8)
21: res13 ← XOR(v1, v13)
22: res23 ← XOR(v1, v23)
23: res13 ← OR(res12, res13)
24: res23 ← OR(res22, res23)
Res1 ← mask (res13) ▷ use mask to convert the result of or operation into mask64
26: res2 ← mask(res23)
27: j ← 0
28: while j < 8 do
T1 ← res1 AND 255 ▷ mask64
30: t2 ← res2 AND 255
31 res1 ← rightShif (res1, 8) ▷ mask64 right shift 8bit
32: res2 ← rightShift(res2, 8)
33 putseetable [ i ] [ j ] [ t1] inter seed1 ▷ Using a lookup table to query the location of the "seed" in the array seed-table
34: putseedtable[i][j][t2] into seed2
35: j + +
36: end while
37: i + +
38: end while
39: end function
The reason why the division is 62 instead of 64 is that a base sequence with the length of 3 is searched for, when the last two (63 rd, 64 th) bases of a 64-base sequence are completely matched with the first two bases of the "seed", but the next base data cannot be read in, the final result is not considered to be the position of the beginning of the "seed", but actually, if the next base is exactly matched with the third base of the "seed", the position is the position of the "seed". Therefore, when dividing, the last two bits of the last 64 base sequences need to be considered, the first two of the next 64 base sequences should be the last 64 base sequences, the two bases are redundantly read, which is equivalent to the first real 64 base reading, and only 62 new bases are read from the second time, so that the division is performed according to the size of 62 bases, L-2 is the first 64 base reading, the whole reading is performed because L-2 is not exactly a multiple of 62, and if the last remaining bases are less than 62 bases, a new round is also needed to complete the division.
Further, the use of the AVX512 vector register is given above for better explanation, and it is understood that other storage capacity vector registers may be used in the implementation.
Further, to prove the effectiveness of the solution of the present invention (as shown in fig. 4, which is a schematic diagram of the overall framework structure thereof), experimental tests were performed as follows:
wherein the Low-Quality base-Trim is indicated by Low-Quality base-Trim in FIG. 4. According to the mass fraction in FASTQ format data, a sliding window with a fixed size is used for scanning from the 3 ' end to the 5 ' end of sequencing data, when the average mass fraction of the bases in the sliding window is smaller than a threshold value designated by a user, the sliding window is moved leftwards by one base and is continuously judged until the average mass fraction of the bases in the sliding window meets the threshold value, the sliding window stops sliding, and then the base sequences at the position where the sliding window is located and behind the sliding window (close to the 3 ' end) are cut out.
The Adapter-Trim in FIG. 4 represents the Adapter in the trimmed sequencing sequence, i.e., the linker used in sequencing, and is also a nucleotide sequence in nature. Firstly, according to the first three bases of the adapter sequence specified by the user, the adapter sequence is used as a 'seed', all positions where the 'seed' appears are found in a sequencing sequence, and the positions are used as candidate positions. And traversing candidate positions in the sequence from left to right, comparing the sequence at the candidate positions with the adapter sequence specified by the user, calculating the Hamming editing distance, taking the first candidate position with the distance meeting the threshold specified by the user as a cutting position, and cutting the candidate position and the base sequence behind the candidate position (close to the 3' end). The process uses a vectorization mode, the speed of searching for the candidate position is increased, the processing efficiency of the working thread is improved, and the vectorization process adopts the above algorithm 1.
To evaluate the correctness and performance of RabbitTrim program, we performed a series of experiments and analyses on different simulated data and real data, and the specific experiments are as follows:
(1) correctness testing
In this embodiment, the simulation data is used for correctness verification. We generated single-ended data and double-ended data at different sequencing error rates using the simulation data generation software, and experimentally tested that our RabbitTrim was consistent in correctness with the original Ktrim.
(2) Performance testing
In the embodiment, a performance test experiment is performed by using a plurality of simulation data and real data, and the running time and the thread expansibility of RabbitTrim, Ktirm and Trimmomatic are recorded. Taking a real data as an example, the data comes from GEO (GSE 81178) of ncbi (national Center for Biotechnology information) database, using GSM2144218 to GSM2144224, and merging file name suffix of read1.fastq into position.read1. fq, merging file suffix name of read2.fastq into position.read2. fq, and the size of each file is 19 GB.
Table 1 single ended data run time (seconds):
thread count tool name 1 2 4 6 8
RabbitTrim 42.68 21.4 11.14 9.52 9.58
Ktrim 51.86 38.76 26.58 26.86 26.69
Trimmomatic 167.23 86.52 87.51 84.25 86.64
Table 2 single ended data speed-up ratio:
thread count tool name 1 2 4 6 8
RabbitTrim 1 1.99 3.83 4.49 4.46
Ktrim 1 1.34 1.95 1.93 1.94
Trimmomatic 1 1.93 1.91 1.98 1.93
Table 3 double ended data run time (seconds):
thread count tool name 1 2 4 6 8
RabbitTrim 62.88 32.77 22.69 22.73 22.51
Ktrim 96.04 68.56 69.05 66.81 69.32
Trimmomatic 364.38 149.72 124.76 124.6 123.19
Table 4 double-ended data acceleration ratio:
thread count tool name 1 2 4 6 8
RabbitTrim 1 1.92 2.77 2.77 2.79
Ktrim 1 1.4 1.39 1.44 1.39
Trimmomatic 1 2.43 2.92 2.92 2.96
Through the experimental results of tables 1 to 4, it can be seen that the RabbitTrim has a larger improvement in performance than ktrm, a single thread execution speed is improved, thread expansibility is better, the optimal processing speed is about 3 times of ktrm and is about 2 GB/sec (ktrm is about 730 MB/sec under the same platform), and a disk read-write performance peak (IO Bound) of the test platform is reached.
The scheme of the invention solves the problem of low IO efficiency by optimizing the input/output module and the vectorization module, and improves the rate of searching the Adapter. Through experimental tests, the accuracy of the RabbitTrim is consistent with that of the Ktrim, but the performance is greatly improved, and the data processing speed can be close to the performance peak value (-2 GB/s) of the reading and writing of a magnetic disk. On the same experimental platform, the optimal performance peak value of the RabbitTrim processing data is about 3 times of that of the original Ktrim, the speed of preprocessing the new generation sequencing data is greatly improved, the method has important significance for accelerating the downstream analysis task, and a faster quality-adapter-trim tool is provided for workers in the field of biological information.
Example two:
the purpose of this example is to provide a quick pruning system for biological sequencing sequences.
A rapid biological sequencing sequence pruning system, comprising:
a data acquisition unit for acquiring a biological sequencing sequence to be trimmed;
a data processing unit for performing a read operation, a trim operation, and a write operation on the biological sequencing sequence; the read operation, the trim operation and the write operation are decoupled based on a producer-consumer model, and asynchronous execution is realized; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trim operation.
In further embodiments, there is also provided:
an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment one. For brevity, no further description is provided herein.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The method and the system for quickly trimming the biological sequencing sequence can be realized, and have wide application prospect.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for quickly pruning a biological sequencing sequence is characterized by comprising the following steps:
obtaining a biological sequencing sequence to be trimmed;
performing a read operation, a trim operation, and a write operation on the biological sequencing sequence; the read operation, the trim operation and the write operation are decoupled based on a producer-consumer model, and asynchronous execution is realized; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trim operation.
2. The rapid biological sequencing sequence trimming method according to claim 1, wherein the reading operation, the trimming operation and the writing operation are respectively implemented by using independent threads, wherein one reading thread and one writing thread are provided, and one or more trimming threads are provided.
3. The method of claim 1, wherein the read operation is configured to read the biological sequencing sequence in a block manner through a read thread, and store the read block object in the first data queue.
4. The method for rapid pruning of biological sequencing sequences according to claim 3, wherein the creation of the block objects introduces a data pool concept, and only a preset number of block objects are created for reuse.
5. The method of claim 1, wherein the pruning operation is used to obtain data from the first data queue through a pruning thread, format the biological sequencing sequence, and remove low quality base sequences and linker sequences in the biological sequencing sequence; and storing the processed sequence into a second data queue.
6. The method of claim 5, wherein the obtaining of the linker sequence in the trim thread comprises: treating each base in the biological sequencing sequence as a character; and based on the vector register, obtaining the position of the connector sequence in the sequence data with the preset length by adopting a plurality of times of bit operations.
7. The method of claim 1, wherein the write operation is used to obtain the processed biological sequencing sequence from the second data queue through a write thread and store the biological sequencing sequence.
8. The method according to claim 1, wherein the threads corresponding to the read operation, the trim operation and the write operation are created only once until the processing task is completed and then destroyed.
9. The method of claim 1, wherein the formatting comprises parsing the data according to FASTQ format.
10. A rapid biological sequencing sequence pruning system, comprising:
a data acquisition unit for acquiring a biological sequencing sequence to be trimmed;
a data processing unit for performing a read operation, a trim operation, and a write operation on the biological sequencing sequence; the read operation, the trim operation and the write operation are decoupled based on a producer-consumer model, and asynchronous execution is realized; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trim operation.
CN202210308606.6A 2022-03-28 2022-03-28 Rapid trimming method and system for biological sequencing sequence Active CN114420210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210308606.6A CN114420210B (en) 2022-03-28 2022-03-28 Rapid trimming method and system for biological sequencing sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210308606.6A CN114420210B (en) 2022-03-28 2022-03-28 Rapid trimming method and system for biological sequencing sequence

Publications (2)

Publication Number Publication Date
CN114420210A true CN114420210A (en) 2022-04-29
CN114420210B CN114420210B (en) 2022-09-20

Family

ID=81263884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210308606.6A Active CN114420210B (en) 2022-03-28 2022-03-28 Rapid trimming method and system for biological sequencing sequence

Country Status (1)

Country Link
CN (1) CN114420210B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092587A (en) * 2023-04-11 2023-05-09 山东大学 Biological sequence analysis system and method based on producer-consumer model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070073804A1 (en) * 2005-09-03 2007-03-29 International Business Machines Corporation Pruning method
US20150039614A1 (en) * 2013-07-25 2015-02-05 Kbiobox Inc. Method and system for rapid searching of genomic data and uses thereof
US20160147814A1 (en) * 2014-11-25 2016-05-26 Anil Kumar Goel In-Memory Database System Providing Lockless Read and Write Operations for OLAP and OLTP Transactions
WO2019133928A1 (en) * 2017-12-30 2019-07-04 Uda, Llc Hierarchical, parallel models for extracting in real-time high-value information from data streams and system and method for creation of same
US20200058379A1 (en) * 2018-08-20 2020-02-20 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Compressing Genetic Sequencing Data and Uses Thereof
CN111278995A (en) * 2017-08-28 2020-06-12 普梭梅根公司 Method and system for characterizing conditions related to the female reproductive system associated with a microbial organism
CN113192558A (en) * 2021-05-26 2021-07-30 北京自由猫科技有限公司 Reading and writing method for third-generation gene sequencing data and distributed file system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070073804A1 (en) * 2005-09-03 2007-03-29 International Business Machines Corporation Pruning method
US20150039614A1 (en) * 2013-07-25 2015-02-05 Kbiobox Inc. Method and system for rapid searching of genomic data and uses thereof
US20160147814A1 (en) * 2014-11-25 2016-05-26 Anil Kumar Goel In-Memory Database System Providing Lockless Read and Write Operations for OLAP and OLTP Transactions
CN111278995A (en) * 2017-08-28 2020-06-12 普梭梅根公司 Method and system for characterizing conditions related to the female reproductive system associated with a microbial organism
WO2019133928A1 (en) * 2017-12-30 2019-07-04 Uda, Llc Hierarchical, parallel models for extracting in real-time high-value information from data streams and system and method for creation of same
US20200058379A1 (en) * 2018-08-20 2020-02-20 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Compressing Genetic Sequencing Data and Uses Thereof
CN113192558A (en) * 2021-05-26 2021-07-30 北京自由猫科技有限公司 Reading and writing method for third-generation gene sequencing data and distributed file system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CANTU VA等: ""PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets"", 《PEERJ PREPRINTS》 *
KUN SUN等: ""Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data"", 《BIOINFORMATICS》 *
许凯: ""基于哈希的高通量生物基因测序数据处理算法优化"", 《中国优秀博硕士学位论文全文数据库(博士)·基础科学辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092587A (en) * 2023-04-11 2023-05-09 山东大学 Biological sequence analysis system and method based on producer-consumer model
CN116092587B (en) * 2023-04-11 2023-08-18 山东大学 Biological sequence analysis system and method based on producer-consumer model

Also Published As

Publication number Publication date
CN114420210B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
Alser et al. Accelerating genome analysis: A primer on an ongoing journey
Shen et al. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Huangfu et al. RADAR: A 3D-ReRAM based DNA alignment accelerator architecture
Eddy A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure
CN108985008B (en) Method and system for rapidly comparing gene data
Bauer et al. Lightweight BWT construction for very large string collections
CN114420210B (en) Rapid trimming method and system for biological sequencing sequence
CN112735528A (en) Gene sequence comparison method and system
Chen et al. A hybrid short read mapping accelerator
Lavenier et al. Dna mapping using processor-in-memory architecture
Ramachandran et al. FPGA accelerated DNA error correction
WO2015127058A1 (en) Efficient encoding and storage and retrieval of genomic data
Li et al. A real linear and parallel multiple longest common subsequences (MLCS) algorithm
Bingöl et al. GateKeeper-GPU: Fast and accurate pre-alignment filtering in short read mapping
Ng et al. Acceleration of short read alignment with runtime reconfiguration
CN112035461A (en) Migration method and system for table data file of database
US20130041593A1 (en) Method for fast and accurate alignment of sequences
Xiao et al. EMS3: An improved algorithm for finding edit-distance based motifs
CN114420209A (en) Sequencing data-based pathogenic microorganism detection method and system
WO2020182172A1 (en) Method and system for memory allocation to optimize computer operations of seeding for burrows wheeler alignment
CN103577728B (en) A kind of method using contraction to perform dependency graph identification built-in function
Wang et al. Efficient dominant point algorithms for the multiple longest common subsequence (MLCS) problem
Lin et al. Hierarchical parallelism of bit-parallel algorithm for approximate string matching on GPUs
Shao et al. BSAlign: a library for nucleotide sequence alignment
Anderson et al. An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant