WO2022082879A1 - 基因测序数据处理方法和基因测序数据处理装置 - Google Patents
基因测序数据处理方法和基因测序数据处理装置 Download PDFInfo
- Publication number
- WO2022082879A1 WO2022082879A1 PCT/CN2020/127101 CN2020127101W WO2022082879A1 WO 2022082879 A1 WO2022082879 A1 WO 2022082879A1 CN 2020127101 W CN2020127101 W CN 2020127101W WO 2022082879 A1 WO2022082879 A1 WO 2022082879A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- algorithm
- sequencing data
- gene sequencing
- idle state
- gpu
- Prior art date
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 190
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 154
- 238000012545 processing Methods 0.000 title claims abstract description 95
- 238000003672 processing method Methods 0.000 title claims abstract description 18
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 260
- 238000004364 calculation method Methods 0.000 claims abstract description 56
- 238000000034 method Methods 0.000 claims abstract description 47
- 238000004458 analytical method Methods 0.000 claims abstract description 34
- 239000011159 matrix material Substances 0.000 claims description 63
- 230000009466 transformation Effects 0.000 claims description 17
- 229910052754 neon Inorganic materials 0.000 claims description 5
- GKAOGPIIYCISHV-UHFFFAOYSA-N neon atom Chemical compound [Ne] GKAOGPIIYCISHV-UHFFFAOYSA-N 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 5
- 238000007405 data analysis Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 18
- 241000894007 species Species 0.000 description 10
- 230000002068 genetic effect Effects 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 101100285899 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) SSE2 gene Proteins 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 108091028043 Nucleic acid sequence Proteins 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 230000006835 compression Effects 0.000 description 3
- 230000000875 corresponding effect Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000002864 sequence alignment Methods 0.000 description 3
- 108020004414 DNA Proteins 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 238000001712 DNA sequencing Methods 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010199 gene set enrichment analysis Methods 0.000 description 1
- 238000002865 local sequence alignment Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
- G06F13/4204—Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
- G06F13/4221—Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the invention relates to the technical field of data processing, in particular to a gene sequencing data processing method and a gene sequencing data processing device.
- the traditional alignment algorithm bwa uses the bwt algorithm and the Smith-Waterman of the imprecise alignment algorithm.
- the algorithm is also implemented based on the SSE2 instructions of the x86 architecture.
- the BWT comparison algorithm based on x86 runs faster on the CPU of the x86 architecture, it cannot be calculated in large batches at the same time, and the BWT algorithm cannot adapt to the SIMT operation mode of the GPU, resulting in a greatly reduced efficiency of BWT running on the GPU , thereby affecting the efficiency of the entire alignment process.
- the existing Smith-Waterman algorithm only runs on the x86 architecture, lacks the support of SSE2 acceleration in the ARM platform, and runs slowly; and the algorithm is also not suitable for computing on the GPU architecture.
- the present invention provides a gene sequencing data processing device and a gene sequencing data processing method, so as to solve the problem that the existing gene sequencing data analysis and processing process steps can only be run on the x86 framework and the running speed is slow on the GPU. Causes the problem of low efficiency in the process of gene sequencing data processing.
- An embodiment of the present invention provides a gene sequencing data processing method, which is applied to a gene sequencing data processing device, wherein the gene sequencing data processing device is a heterogeneous multi-core architecture, including: an ARM architecture, a GPU architecture, and a PCI bus ; Described ARM framework connects described GPU framework through described PCI bus; Described ARM framework includes at least one CPU module; Described GPU framework includes at least one GPU module; Described method comprises the following steps:
- Step S1 the CPU module in the idle state reads the gene sequencing data in batches to obtain the batched gene sequencing data
- Step S2 the CPU module in the idle state divides the gene analysis method to obtain the first algorithm and the second algorithm;
- Step S3 The CPU module in the idle state divides the batched gene sequencing data according to the first algorithm to obtain each short sequence, and sends each of the short sequences and the second algorithm to the idle state.
- the GPU module The CPU module in the idle state divides the batched gene sequencing data according to the first algorithm to obtain each short sequence, and sends each of the short sequences and the second algorithm to the idle state.
- Step S4 the GPU module in the idle state calculates each of the short sequences according to the second algorithm, and sends the calculation result to the CPU module in the idle state;
- Step S5 the CPU module in the idle state obtains a batch processing result according to the calculation result and the first algorithm
- Steps S1 to S5 are repeated until the processing of the gene sequencing data is completed, and the CPU module in an idle state performs an integrated operation on each of the batch processing results to obtain a final processing result.
- the CPU module in the idle state scans each of the GPU modules, determines the number of GPU modules in the idle state and the data processing volume of the GPU modules in the idle state, and determines the number of GPU modules in the idle state and the data processing volume of each GPU module according to the idle state. Batch reads of genetic sequencing data.
- the gene analysis algorithms include gene alignment algorithm, Dotplot algorithm, blast algorithm, PAM algorithm, HMM algorithm and AI inference algorithm.
- the gene alignment algorithm includes a BWT algorithm, and the first algorithm includes an anchor cut algorithm;
- the CPU module in the idle state uses the anchor point cutting algorithm to perform anchor point positioning on the batched gene sequencing data, and extends the length of N bp forward and backward respectively with the anchor point fixed point as the center, and uses the NEON instruction to perform the anchor point setting.
- the batch gene sequencing data is cut with a length of 2N+1 bp to obtain each of the short sequences, where N is any positive integer.
- step of obtaining each of the short sequences including: using the following formula to calculate and obtain each of the short sequences:
- x represents the number of anchor points
- N represents the number of extended bp
- L represents the length of the batch gene sequencing data
- the second algorithm is a Hash algorithm; the GPU module in the idle state performs a Hash operation on each of the short sequences according to the Hash algorithm, obtains a Hash calculation result, and sends the Hash calculation result to the idle state.
- CPU module wherein the Hash calculation result is the value of the BWT algorithm matrix, which is used for the calculation of the BWT algorithm matrix.
- the first algorithm also includes a BWT matrix transformation algorithm
- the CPU module in the idle state uses the BWT matrix transformation algorithm to transform the BWT algorithm matrix to obtain a BWT transformation result of the short sequence.
- the comparison algorithm includes the Smith-Waterman algorithm, and the second algorithm includes a scoring matrix algorithm;
- the GPU module in the idle state calculates the Smith-Waterman scoring matrix according to the scoring matrix algorithm, each of the short sequences and the reference species sequence, and sends the Smith-Waterman scoring matrix to the CPU module in the idle state.
- the Smith-Waterman scoring matrix is calculated using the following formula:
- M represents the Smith-Waterman scoring matrix
- R represents the length of the candidate interval sequence of the reference species
- C represents the length of the short sequence formed by screening and splicing each short sequence received from the CPU module in the idle state
- L represents the length of the short sequence.
- a and b represent constants.
- An embodiment of the present invention provides a gene sequencing data processing device, the gene sequencing data processing device is a heterogeneous multi-core framework, and the gene sequencing data processing device executes the gene sequencing data processing method.
- the gene sequencing data processing device and the gene sequencing data processing method in the embodiments of the present invention are applied to the device, and the gene sequencing data processing device is a heterogeneous multi-core framework, including an ARM framework, a GPU framework and a PCI bus, wherein the ARM framework includes at least one The CPU module, while the GPU framework includes at least one GPU module, the CPU module is connected to the GPU module through a PCI bus, and information can be transmitted between the two.
- the method includes the CPU module in the idle state, which is mainly used to read the gene sequencing data in batches and divide the gene analysis method, so as to obtain the batched gene sequencing data and the first algorithm (this algorithm is the most suitable algorithm for the CPU module to run) and the second algorithm (this algorithm is the most suitable algorithm for the GPU module), then the first algorithm is used to segment the batched gene sequencing data to obtain a series of short sequences, and these short sequences and the second algorithm are passed through the PCI bus It is transmitted to the GPU module in the idle state; the GPU module calculates these short sequences according to the second algorithm, and then returns the calculation result to the CPU module in the idle state; the CPU module in the idle state calculates according to the calculation result and the first algorithm.
- the gene sequencing data processing device and the gene sequencing data processing method separate the analysis method (ie the analysis process) of the gene sequencing data, and let them run on the CPU module and the GPU module respectively according to the characteristics, which greatly improves the efficiency of gene sequencing data analysis .
- the gene sequencing data processing device can be provided with multiple CPU modules and GPU modules, and multiple GPU modules can simultaneously calculate short sequences of different lengths, which can solve the problem of low GPU parallel efficiency.
- FIG. 1 is a schematic structural diagram of a gene sequencing data processing device in an embodiment of the present invention.
- FIG. 2 is a schematic diagram of a data processing process of a gene sequencing data processing device in an embodiment of the present invention
- FIG. 3 is a schematic diagram of anchor cutting performed by a CPU module on batched gene sequencing data in an embodiment of the present invention
- FIG. 4 is a schematic diagram of a GPU module using a Hash algorithm to perform Hash operation on a short sequence in an embodiment of the present invention
- FIG. 5 is a schematic flowchart of a method for processing gene sequencing data in an embodiment of the present invention.
- Gene refers to a DNA or RNA sequence that carries genetic information (that is, a gene is a DNA or RNA segment with genetic effects), also known as a genetic factor, and is the basic genetic unit that controls traits. Genes express the genetic information they carry by directing the synthesis of proteins, thereby controlling the performance of individual organisms.
- Gene sequencing is a new type of gene detection technology, which analyzes and determines the entire gene sequence from blood or saliva, so as to predict the possibility of suffering from various diseases, and the behavioral characteristics and reasonable behavior of individuals.
- Short sequence It is a small short sequence fragment, which is the sequencing data generated by a high-throughput sequencer. Sequencing the entire genome will generate tens of millions of reads, and then splicing these reads together can Obtain the full sequence of the genome.
- the short sequences (reads) sequenced by NGS are stored in the FASTQ file. Although they are originally from an ordered genome, after DNA library construction and sequencing, the sequence relationship between different reads in the file is It's all been lost. Therefore, there is no positional relationship between the two reads next to each other in the FASTQ file, they are just short sequences randomly derived from a certain position in the original genome. Therefore, we need to smooth out this large pile of short sequences, compare them one by one with the reference genome of the species, find the position of each read on the reference genome, and then arrange them in order. This process is called sequencing. comparison of data.
- Alignment Algorithms Computational methods for sequence alignments are generally divided into two categories: global alignments and local alignments. Computing a global route is a form of global optimization that enforces alignment of all query sequences over the entire length. In contrast, local alignments only identify local similarities while entire long sequences are often very different. Local alignments are often desirable, but can be more difficult to compute because there are challenges from identifying other similar regions.
- Various computational algorithms have been applied to sequence alignment problems, including slow but formal optimization methods like dynamic programming, efficient but incomplete heuristics, or probabilistic methods designed to search large databases.
- ARM ARM architecture, Advanced RISC Machine, earlier known as Acorn Reduced Instruction Set Machine, Acorn RISC Machine
- Acorn RISC Machine is a reduced instruction set (RISC) processor architecture family, which is widely used in many embedded system design. Due to the characteristics of energy saving, it also has many achievements in other fields.
- the ARM processor is very suitable for the field of mobile communication, and its main design goals are low cost, high performance, and low power consumption. On the other hand, supercomputers consume a lot of power, and ARM is also seen as a more efficient choice.
- ARM Holdings developed this architecture and authorized other companies to use it for them to implement one of ARM's architectures and develop their own system-on-module (system-on-module, SoC).
- GPU Graphics Processing Unit
- GPU Graphics Processing Unit
- display core visual processor, display device, or graphics device
- graphics processor reduces the dependence of the graphics card on the central processing unit (CPU), and shares part of the work originally performed by the central processing unit, especially when performing 3D graphics operations, the effect is more obvious.
- CUDA Computer Unified Device Architecture, unified computing architecture
- NVIDIA is the company's official name for GPGPU.
- NVIDIA is the company's official name for GPGPU.
- NVIDIA is the company's official name for GPGPU.
- NVIDIA GeForce 8 and later GPUs and newer Quadro GPUs for computing.
- a GPU can be used as a development environment for a C-compiler.
- NVIDIA When NVIDIA is marketing, it tends to mix and promote compilers and architectures, causing confusion.
- CUDA is compatible with OpenCL or its own C-compiler. Whether it is CUDA C-language or OpenCL, the instructions will eventually be converted into PTX code by the driver, which is then calculated by the display core.
- BWT (Burrows–Wheeler Transform, referred to as block sorting compression), is an algorithm applied in data compression technology (such as bzip2).
- the algorithm was invented in 1994 by Michael Burrows and David Wheeler at the DEC Systems Research Center in Palo Alto, California. It is based on an undisclosed conversion method previously invented by Wheeler in 1983.
- the algorithm When a string is converted with this algorithm, the algorithm only changes the order of characters in the string without changing its characters. If the original string has several substrings that appear multiple times, then the converted string will have some consecutive repeating characters, which is useful for compression. This method makes it easier to compress codes based on techniques that deal with consecutive repeating characters in strings, such as MTF transform and run-length coding.
- Smith-waterman (Smith-Waterman algorithm) is an algorithm that performs local sequence alignment (as opposed to global alignment) to find similar regions between two nucleotide sequences or protein sequences.
- the purpose of this algorithm is not to align the entire sequence, but to find fragments with high similarity in two sequences.
- HASH Also known as hash algorithm, hash function, is a method of creating small digital "fingerprints" from any kind of data.
- the hash function compresses the message or data into a digest, making the amount of data smaller and fixing the format of the data. This function shuffles the data and recreates a fingerprint called hash values (hash values, hash codes, hash sums, or hashes).
- hash values are usually represented by a short string of random letters and numbers.
- Good hash functions rarely have hash collisions in the input domain. In hash tables and data processing, not suppressing collisions to distinguish data can make database records more difficult to find.
- SSE2 (Streaming SIMD Extensions 2), is a SIMD (Single Instruction Multiple Data) instruction set of the IA-32 architecture.
- SSE2 is an instruction set that was launched in 2001 with Intel's release of the first-generation Pentium 4 processor. It extends the earlier SSE instruction set and can completely replace the MMX instruction set.
- FIG. 1 is a schematic structural diagram of a gene sequencing data processing device. As shown in Figure 1, a genetic test
- the sequence data processing device, the gene sequencing data processing device is a heterogeneous multi-core framework, including: an ARM framework 10, a GPU framework 20 and a PCI bus 30; the ARM framework 10 is connected to the GPU framework 20 through the PCI bus 30; the ARM framework 10 includes at least one CPU module
- the GPU framework 30 includes at least one GPU module; the CPU module in the idle state is used to read the gene sequencing data in batches to obtain the batched gene sequencing data, and divide the gene analysis method to obtain the first algorithm and the second algorithm; An algorithm divides the batched gene sequencing data to obtain each short sequence, and sends each short sequence and the second algorithm to the GPU module in idle state; the GPU module in idle state is used to calculate each short sequence according to the second algorithm , and send the calculation result to the CPU module in the idle state; the CPU module in the idle state is also used to obtain batch processing results according to the calculation result and the first algorithm; the CPU module in the idle state and the GPU module in the idle state repeatedly execute the above steps , until the processing of the gene sequencing data is completed
- the gene sequencing data processing device is a heterogeneous multi-core framework, that is, an ARM+GPU framework, wherein the ARM framework 10 includes a CPU module, and the GPU framework 20 includes a GPU module; the number of CPU modules and GPU modules is not fixed, and can be based on actual For example, it is determined according to the amount of gene sequencing data, CPU module performance, GPU module performance (such as GPU memory, CUDA core number, CUDA core frequency), and the algorithm complexity used in gene analysis.
- each CPU module may be the same or different.
- the processing or computing capabilities of each GPU module can also be the same or different.
- the GPU module may be a GPU computing card, where the GPU computing card usually adopts a SIMT architecture.
- the CPU module adopts the NENO acceleration technology; using this acceleration technology can further improve the running speed of the CPU module.
- the gene sequencing data device can use the Jetson nano TX1 released by NVIDIA.
- the device uses a Maxwell architecture GPU with 128 Cuda cores and a computing power of 472G.
- Jetson-nano also has a 4-core A57 processor as ARM CPU core arithmetic unit.
- Gene analysis methods refer to the methods used in the analysis and processing of gene sequencing data, including sequence comparison, gene set enrichment analysis (including GO analysis, KEGG analysis), and gene regulatory network analysis.
- the first algorithm and the second algorithm obtained by dividing the gene analysis method are mainly divided according to the characteristics of a gene analysis method, that is, the algorithm suitable for CPU module processing is divided from the gene analysis method.
- the first algorithm; the algorithm suitable for GPU module processing is also divided from the gene analysis method to form the second algorithm; it can be seen that the first algorithm and the second algorithm can be part of the gene analysis method, and can be composed of one or more algorithms. It consists of small steps, and there are no strict algorithm rules in the segmentation process, that is, as long as the segmentation principle is met.
- the segmentation principle mainly includes: the first algorithm usually requires a lot of logical judgments, and there are dependencies between the calculation results, such as the second step calculation dependence or the first step calculation results as the basis, involving yes or no judgments, etc.;
- the second algorithm is usually that multiple data can run the calculation at the same time, and no logical judgment is involved between each data or there is no dependency between the data.
- each CPU module may be different, that is, some CPU modules are in a running state, while others are in an idle state.
- the GPU modules in the GPU architecture 20 have a similar situation. Therefore, in this embodiment, the CPU modules and GPU modules in an idle state are used to perform corresponding operations, and the selected CPU modules and GPU modules may be all modules in an idle state, or may be a part of them.
- the gene sequencing data may be data obtained by gene sequencing of any species, including DNA sequencing fragments, RNA sequencing fragments, and the like. Since a large amount of data is generated in one sequencing, the amount of gene sequencing data is relatively large, and the data can be analyzed and processed in batches, thereby avoiding data transmission congestion and the like. Therefore, in this embodiment, the CPU modules in the idle state read gene sequencing data in batches, and the amount of gene sequencing data read each time may not be equal. Specifically, the number of GPU modules and the data processing capability of each GPU module and The data reading capability of the CPU module and the data transmission capability of the PCI bus are considered to determine the most suitable amount of gene sequencing data, so as to ensure the highest data processing efficiency to the greatest extent.
- the first algorithm is used to segment the batched gene sequencing data, wherein the lengths of the short sequences that are cut into short sequences may be different, and the number of the short sequences to be cut is not fixed.
- the number of , the number of GPU modules in the idle state, and the GPU processing capacity are considered to select the most appropriate value.
- the GPU module in the idle state calculates each short sequence according to the second algorithm, and the CPU module in the idle state can perform the next
- the gene sequencing data is read and divided in batches; when the GPU module in the idle state completes the processing of the short sequence, the calculation results are transmitted to the CPU module in the idle state, and the CPU module can calculate the batches according to the calculation results and the first algorithm.
- the calculation result is repeated continuously, and a pipeline is formed between the CPU module and the GPU module until all the gene sequencing data are processed.
- the gene sequencing data processing apparatus is a heterogeneous multi-core framework, including an ARM framework 10, a GPU framework 20 and a PCI bus 30, wherein the ARM framework includes at least one CPU module, and the GPU framework includes At least one GPU module and the CPU module are connected with the GPU module through the PCI bus, and information can be transmitted between them.
- the CPU module in the idle state is mainly used to read the gene sequencing data in batches and divide the gene analysis method, so as to obtain the batched gene sequencing data, the first algorithm (this algorithm is the most suitable algorithm for the CPU module to run) and the second.
- this algorithm is the most suitable algorithm for the GPU module to run
- the first algorithm to segment the batched gene sequencing data to obtain a series of short sequences, and transmit these short sequences and the second algorithm to the location at the PCI bus through the PCI bus.
- the GPU module in the idle state; the GPU module calculates these short sequences according to the second algorithm, and then returns the calculation result to the CPU module in the idle state; the CPU module in the idle state obtains an allocation process according to the calculation result and the first algorithm.
- the CPU module in the idle state and the GPU module in the idle state repeatedly perform the above steps until the gene sequencing data is processed, and then the CPU module in the idle state integrates each batch processing result to obtain the final processing result.
- the gene sequencing data processing device and the gene sequencing data processing method separate the analysis method (ie the analysis process) of the gene sequencing data, and let them run on the CPU module and the GPU module respectively according to the characteristics, which greatly improves the efficiency of gene sequencing data analysis .
- the gene sequencing data processing device can be provided with multiple CPU modules and GPU modules, and multiple GPU modules can simultaneously calculate short sequences of different lengths, which can solve the problem of low GPU parallel efficiency.
- the CPU module in the idle state is further configured to scan each GPU module, determine the number of the GPU modules in the idle state and the data processing amount of the GPU modules in the idle state, and process the data according to the number of the GPU modules in the idle state and the data processing capacity of the GPU modules in the idle state. Quantitative batch reads of gene sequencing data.
- the CPU module in the idle state starts the gene analysis, it can scan the GPU module to determine the number of GPUs currently available and the data processing capacity of the available GPU module, so as to determine the batch read gene sequencing this time. The amount of data, and then read the gene sequencing data according to this amount.
- Time T2 These split short sequences are transmitted to one of the idle GPU modules through the PCI bus, and the CPU module can then process the next batch of data Di+1, forming a 2-stage pipeline.
- Time T3 When the Di data is transferred to the video memory in the GPU, the second algorithm of the GPU can be started. At this time, Di+1 enters the PCI transmission stage, and the CPU module processes the next batch of data Di+2, forming a 3-stage pipeline. .
- Time T4 Di data calculation is completed, and the calculation results are sent back to the CPU module through PCI. At this time, Di+1 enters the GPU module calculation stage, Di+2 enters the PCI input stage, and Di+3 is processed by the CPU module. 4-stage pipeline.
- Time T5 After the calculation result of Di data is returned, it is handed over to the CPU module to use the first algorithm to continue to complete the operation of the subsequent stage of the comparison algorithm. At this time, a 5-stage pipeline is formed.
- the gene analysis algorithm includes a gene alignment algorithm, a Dotplot algorithm, a blast algorithm, a PAM algorithm, an HMM algorithm, and an AI inference algorithm.
- the Dotplot algorithm and the blast algorithm are a sequence alignment algorithm.
- the PAM algorithm is a data mining clustering algorithm that can be used in single-cell sequencing to analyze cell subsets.
- HMM algorithm hidden Markov clustering algorithm
- hidden Markov clustering algorithm is a statistical model, which is used to describe a Markov process with hidden unknown parameters, which can be used in the prediction of target genes.
- AI inference algorithm (DeepVariant), a deep learning algorithm, can be used to identify genetic mutations, etc.
- the AI inference algorithm may be an inference algorithm related to CNN (Convolutional Neural Network) or RNN (Recurrent Neural Network).
- the algorithm when the gene analysis algorithm is the Dotplot algorithm, the blast algorithm, or the PAM algorithm, the algorithm usually needs to be CUDAized first. CUDAization of the algorithm makes the method more suitable to run on the gene sequencing data processing device in the embodiment of the present invention.
- the gene alignment algorithm includes the BWT algorithm, and the first algorithm includes the anchor cut algorithm; the CPU module in the idle state is further configured to use the anchor cut algorithm to perform anchor point positioning on the batched gene sequencing data, and use the anchor cut algorithm Point and point as the center to extend N bp lengths forward and backward respectively, and use the NEON command to cut the batched gene sequencing data by 2N+1 bp lengths to obtain each short sequence, where N is any positive integer.
- the step of obtaining each short sequence includes: using the following formula to calculate and obtain each short sequence:
- x represents the number of anchor points
- N represents the number of extended bp
- L represents the length of the batch gene sequencing data.
- the gene alignment algorithm may be the BWT algorithm
- the first algorithm may be the anchor point cutting algorithm and the BWT matrix transformation algorithm
- the second algorithm may be the Hash algorithm.
- the specific process is as follows: as shown in Figure 3, the CPU module in the idle state uses the first algorithm (ie the anchor point cutting algorithm) to process the data Di; first, the gene sequencing data (ie, read) of length L is read in batches.
- the anchor point is fixed, and the length of N bp is extended forward and backward to obtain a short read with a length of 2N+1, and then the NEON command is used to cut and transport the read with a length of 2N+1.
- the number of anchor points is x
- the number of N is related to the following formula:
- x represents the number of anchor points
- N represents the number of extended bp
- L represents the length of the batch gene sequencing data.
- the second algorithm is a Hash algorithm
- the GPU module in the idle state is also used to perform a Hash operation on each short sequence according to the Hash algorithm, obtain a Hash calculation result, and send the Hash calculation result to the idle state CPU module
- the Hash calculation result is the value of the BWT algorithm matrix, which is used for the calculation of the BWT algorithm matrix.
- the gene alignment algorithm may be the BWT algorithm
- the first algorithm may be the anchor point cutting algorithm and the BWT matrix transformation algorithm
- the second algorithm may be the Hash algorithm.
- the short sequence x*K short sequences calculated by the first algorithm are transferred to the video memory in the GPU module in the idle state, where K represents the number of Di, and the number of short sequences is related to multiple GPU modules.
- the total video memory is positively correlated.
- the Hash algorithm is beneficial to the operation of the SIMT architecture of the GPU, the kernel function of the GPU is used to perform the hash calculation on multiple short sequences to obtain the Hash calculation result, and the Hash calculation result is sent to the CPU module in the idle state; the Hash calculation result is the BWT algorithm.
- the value of the matrix which is used in the calculation of the matrix of the BWT algorithm. Compared with other traditional calculations (such as kmer calculation site algorithm), the use of Hash algorithm can greatly save memory space.
- the first algorithm further includes a BWT matrix transformation algorithm; the CPU module in the idle state is further configured to use the BWT matrix transformation algorithm to transform the BWT algorithm matrix to obtain a BWT transformation result of a short sequence.
- the gene alignment algorithm may be a BWT algorithm
- the first algorithm may be an anchor point cutting algorithm and a BWT matrix transformation algorithm.
- the CPU module After the GPU module sends the Hash calculation result to the CPU module in the idle state, the CPU module will use the Hash calculation result as the value of the BWT algorithm matrix for the calculation of the BWT algorithm matrix, and then use the BWT matrix transformation algorithm to perform the BWT algorithm matrix. Transform to get the BWT transform result of the short sequence.
- h represents the Hash calculation result
- Y represents the BWT algorithm matrix
- r represents the short sequence. The method can quickly and accurately obtain the BWT transformation result of the short sequence, so that the compression of the gene sequencing data can be quickly completed, and the subsequent processing is more convenient.
- the alignment algorithm includes a Smith-Waterman algorithm
- the second algorithm includes a scoring matrix algorithm
- the GPU module in the idle state is further configured to calculate the Smith-Waterman scoring matrix according to the scoring matrix algorithm, each short sequence and the reference species sequence, And send the Smith-Waterman scoring matrix to the CPU module in idle state.
- step of calculating the Smith-Waterman scoring matrix comprising:
- M represents the Smith-Waterman scoring matrix
- R represents the length of the candidate interval sequence of the reference species
- C represents the length of the short sequence formed by screening and splicing each short sequence received from the CPU module in the idle state
- L represents the length of the short sequence.
- a and b represent constants.
- the traditional Smith-Waterman algorithm is relatively inefficient in the GPU, and cannot be directly used in the gene sequencing data processing device in the embodiment of the present invention, so the Smith-Waterman algorithm is improved.
- there is a scoring matrix in the Smith-Waterman algorithm and the size is R*C; if the steps of calculating the scoring matrix are placed in the GPU module, then the second algorithm is the intended matrix algorithm at this time.
- M represents the Smith-Waterman scoring matrix
- R is the length of the candidate interval sequence of the reference species
- C represents the length of the short sequence formed by screening and splicing each short sequence received from the CPU module in the idle state
- L represents the score.
- Length of batch gene sequencing data, a and b represent constants.
- the length of C is related to the Hash calculation result calculated by the GPU module in the BWT algorithm. In this way, the traditional Smith-Waterman algorithm can be improved, so that it is suitable for running in the GPU, and the running efficiency is high.
- an embodiment of the present invention further provides a gene sequencing data processing method.
- a gene sequencing data processing method As shown in Figure 5, a gene sequencing data processing method, the method is applied to a gene sequencing data processing device, comprising the following steps:
- Step S1 the CPU module in the idle state reads the gene sequencing data in batches to obtain the batched gene sequencing data
- Step S2 the CPU module in the idle state divides the gene analysis method to obtain the first algorithm and the second algorithm;
- Step S3 the CPU module in the idle state divides the batched gene sequencing data according to the first algorithm to obtain each short sequence, and sends each short sequence and the second algorithm to the GPU module in the idle state;
- Step S4 the GPU module in idle state calculates each short sequence according to the second algorithm, and sends the calculation result to the CPU module in idle state;
- Step S5 the CPU module in the idle state obtains the batch processing result according to the calculation result and the first algorithm calculation
- Steps S1 to S5 are repeated until the processing of the gene sequencing data is completed, and the CPU module in an idle state performs an integrated operation on the batch processing results to obtain a final processing result.
- the amount of gene sequencing data is relatively large, and the data can be analyzed and processed in batches, thereby avoiding data transmission congestion and the like.
- the gene sequencing data read by the i-th batch of CPU in idle state is recorded as Di.
- the CPU module in the idle state reads the gene sequencing data Di, and divides the gene analysis method to obtain the first algorithm and the second algorithm; divides the gene sequencing data Di according to the first algorithm to obtain each short sequence, and divides each short sequence.
- the sequence and the second algorithm are sent to the GPU module in the idle state, and then the GPU module in the idle state calculates each short sequence according to the second algorithm, and sends the calculation result to the CPU module in the idle state; then the CPU module in the idle state calculates according to the second algorithm.
- the result and the first algorithm are calculated to obtain batch processing results; in addition, the CPU module in the idle state reads the gene sequencing data Di+1, divides the gene sequencing data Di+1, and divides the divided gene sequencing data Di+1
- the corresponding short sequence is sent to the GPU module in the idle state, and Di+1 represents the gene sequencing data read in batch i+1; the short sequence corresponding to the gene sequencing data Di+1 after being split by the GPU module in the idle state is processed.
- the processing results are then sent to the CPU module in the idle state; the CPU module in the idle state and the GPU module in the idle state continuously read, segment, transmit, calculate and return the gene sequencing data (that is, repeat steps S1-S5 continuously. ), until all the gene sequencing data is processed, a pipeline is formed between the CPU module in the idle state and the GPU module in the idle state in the process.
- the CPU module in the idle state scans each GPU module, determines the number of the GPU modules in the idle state and the data processing amount of the GPU modules in the idle state, and divides them into batches according to the number of the GPU modules in the idle state and the processing amount of each data. Read gene sequencing data.
- the gene analysis algorithm includes a gene alignment algorithm, a Dotplot algorithm, a blast algorithm, a PAM algorithm, an HMM algorithm, and an AI inference algorithm.
- the gene alignment algorithm includes a BWT algorithm, and the first algorithm includes an anchor point cutting algorithm; the CPU module in an idle state uses the anchor point cutting algorithm to perform anchor point positioning on the batched gene sequencing data, and the anchor point setting point is The center is extended forward and backward by N bp lengths, respectively, and the batch gene sequencing data is cut by 2N+1 bp lengths using the NEON command to obtain each short sequence, where N is any positive integer.
- the step of obtaining each short sequence includes: using the following formula to calculate and obtain each short sequence:
- x represents the number of anchor points
- N represents the number of extended bp
- L represents the length of the batch gene sequencing data.
- the second algorithm is a Hash algorithm
- the GPU module in the idle state is also used to perform a Hash operation on each short sequence according to the Hash algorithm, obtain a Hash calculation result, and send the Hash calculation result to the idle state CPU module;
- Hash is the value of the BWT algorithm matrix, which is used for the calculation of the BWT algorithm matrix.
- the first algorithm further includes a BWT matrix transformation algorithm; the CPU module in the idle state transforms the BWT algorithm matrix by using the BWT matrix transformation algorithm to obtain a BWT transformation result of a short sequence.
- the alignment algorithm includes a Smith-Waterman algorithm
- the second algorithm includes a scoring matrix algorithm
- the GPU module in the idle state is further configured to calculate the Smith-Waterman scoring matrix according to the scoring matrix algorithm, each short sequence and the reference species sequence, And send the Smith-Waterman scoring matrix to the CPU module in idle state.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims (10)
- 一种基因测序数据处理方法,所述方法应用于基因测序数据处理装置,其特征在于,其中所述基因测序数据处理装置为异构多核构架,包括:ARM构架、GPU构架以及PCI总线;所述ARM构架通过所述PCI总线连接所述GPU构架;所述ARM构架包括至少一个CPU模块;所述GPU构架包括至少一个GPU模块;所述方法包括以下步骤:步骤S1:空闲状态的所述CPU模块分批读取基因测序数据得到分批基因测序数据;步骤S2:空闲状态的所述CPU模块对基因分析方法进行分切得到第一算法和第二算法;步骤S3:空闲状态的所述CPU模块根据所述第一算法对所述分批基因测序数据进行切分得到各短序列,并把各所述短序列和所述第二算法发送至空闲状态的所述GPU模块;步骤S4:空闲状态的所述GPU模块根据所述第二算法对各所述短序列进行计算,并把计算结果发送至空闲状态的所述CPU模块;步骤S5:空闲状态的所述CPU模块根据所述计算结果和所述第一算法计算得到分批处理结果;重复步骤S1~S5,直至将所述基因测序数据处理完成,空闲状态的所述CPU模块将各所述分批处理结果进行整合运算,得到最终处理结果。
- 根据权利要求1所述的基因测序数据处理方法,其特征在于,空闲状态的所述CPU模块扫描各所述GPU模块,确定空闲状态的GPU模块数量以及各空闲状态的GPU模块的数据处理量,并根据所述空闲状态的GPU模块数量以及各所述数据处理量分批读取基因测序数据。
- 根据权利要求1所述的基因测序数据处理方法,其特征在于,所述基因分析算法包括基因比对算法、Dotplot算法、blast算法、PAM算法、HMM算法以及AI推断算法。
- 根据权利要求3所述的基因测序数据处理方法,其特征在于,所述基因比对算法包括BWT算法,所述第一算法包括锚点切割算法;空闲状态的所述CPU模块将所述分批基因测序数据采用锚点切割算 法进行锚点定点,并以所述锚点定点为中心分别向前后延伸N个bp长度,并采用NEON指令对所述分批基因测序数据进行2N+1个bp长度的切割,得到各所述短序列,其中N为任意正整数。
- 根据权利要求4所述的基因测序数据处理方法,其特征在于,在得到各所述短序列的步骤中,包括:采用以下公式计算得到各所述短序列:(2*N+1)*x<L其中,x表示锚点个数,N表示延伸的bp数量,L表示所述分批基因测序数据的长度。
- 根据权利要求3或4所述的基因测序数据处理方法,其特征在于,所述第二算法为Hash算法;空闲状态的所述GPU模块根据所述Hash算法对各所述短序列进行Hash运算,得到Hash计算结果,并将所述Hash计算结果发送至空闲状态的所述CPU模块;其中所述Hash计算结果为BWT算法矩阵的值,用于BWT算法矩阵的计算。
- 根据权利要求6所述的基因测序数据处理方法,其特征在于,所述第一算法还包括BWT矩阵变换算法;空闲状态的所述CPU模块采用所述BWT矩阵变换算法对所述BWT算法矩阵进行变换,得到所述短序列的BWT变换结果。
- 根据权利要求3所述的基因测序数据处理方法,其特征在于,所述比对算法包括Smith-Waterman算法,所述第二算法包括打分矩阵算法;空闲状态的所述GPU模块根据所述打分矩阵算法、各所述短序列以及参考物种序列计算Smith-Waterman打分矩阵,并将所述Smith-Waterman打分矩阵发送至空闲状态的所述CPU模块。
- 根据权利要求8所述的基因测序数据处理方法,其特征在于,在计算Smith-Waterman打分矩阵的步骤中,包括:采用以下公式计算Smith-Waterman打分矩阵:M=R*CR=a*L 2+b其中,M表示Smith-Waterman打分矩阵,R为参考物种备选区间序列的长度,C表示对从空闲状态的所述CPU模块接收的到各短序列进行筛选拼接形成的短序列的长度,L表示表示所述分批基因测序数据的长度,a和b表示常数。
- 一种基因测序数据处理装置,其特征在于,所述基因测序数据处理装置执行权利要求1-9任一项所述的基因测序数据处理方法。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021571845A JP7393439B2 (ja) | 2020-10-22 | 2020-11-06 | 遺伝子シークエンシングデータ処理方法及び遺伝子シークエンシングデータ処理装置 |
AU2020450960A AU2020450960A1 (en) | 2020-10-22 | 2020-11-06 | Method for processing gene sequencing data and apparatus for processing gene sequencing data |
EP20937176.4A EP4235678A1 (en) | 2020-10-22 | 2020-11-06 | Gene sequencing data processing method and gene sequencing data processing device |
IL288594A IL288594A (en) | 2020-10-22 | 2021-12-01 | Method and apparatus for processing gene sequence data |
AU2023266239A AU2023266239A1 (en) | 2020-10-22 | 2023-11-13 | Method for processing gene sequencing data and apparatus for processing gene sequencing data |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011139823.4 | 2020-10-22 | ||
CN202011139823.4A CN112259168B (zh) | 2020-10-22 | 2020-10-22 | 基因测序数据处理方法和基因测序数据处理装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022082879A1 true WO2022082879A1 (zh) | 2022-04-28 |
Family
ID=74264788
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/127101 WO2022082879A1 (zh) | 2020-10-22 | 2020-11-06 | 基因测序数据处理方法和基因测序数据处理装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112259168B (zh) |
WO (1) | WO2022082879A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116932631A (zh) * | 2023-07-18 | 2023-10-24 | 哈尔滨晨文科技开发有限公司 | 一种基于大数据的检测数据可视化管理系统及方法 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113299344A (zh) * | 2021-06-23 | 2021-08-24 | 深圳华大医学检验实验室 | 基因测序分析方法、装置、存储介质和计算机设备 |
TWI819480B (zh) | 2022-01-27 | 2023-10-21 | 緯創資通股份有限公司 | 加速系統及其動態配置方法 |
CN114328399B (zh) * | 2022-03-15 | 2022-05-24 | 四川大学华西医院 | 一种基因测序多样本数据文件自动配对的方法和系统 |
CN116594745A (zh) * | 2023-05-11 | 2023-08-15 | 阿里巴巴达摩院(杭州)科技有限公司 | 任务执行方法、系统、芯片及电子设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239732A (zh) * | 2014-09-24 | 2014-12-24 | 湖南大学 | 一种运行于多核计算机平台的并行通用序列的比对方法 |
CN104504303A (zh) * | 2014-09-29 | 2015-04-08 | 肇庆学院 | 基于cpu+gpu异构系统的序列比对方法 |
EP3428798A1 (en) * | 2016-04-08 | 2019-01-16 | Huawei Technologies Co., Ltd. | Resource allocation method and device for genetic analysis |
CN110135584A (zh) * | 2019-03-30 | 2019-08-16 | 华南理工大学 | 基于自适应并行遗传算法的大规模符号回归方法及系统 |
CN110473593A (zh) * | 2019-07-25 | 2019-11-19 | 深圳大学 | 一种基于FPGA的Smith-Waterman算法实现方法及装置 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279445A (zh) * | 2012-09-26 | 2013-09-04 | 上海中科高等研究院 | 运算任务的计算方法及超算系统 |
CN106295250B (zh) * | 2016-07-28 | 2019-03-29 | 北京百迈客医学检验所有限公司 | 二代测序短序列快速比对分析方法及装置 |
WO2020124275A1 (en) * | 2018-12-21 | 2020-06-25 | Huawei Technologies Co., Ltd. | Method, system, and computing device for optimizing computing operations of gene sequencing system |
CN110427262B (zh) * | 2019-09-26 | 2020-05-15 | 深圳华大基因科技服务有限公司 | 一种基因数据分析方法及异构调度平台 |
-
2020
- 2020-10-22 CN CN202011139823.4A patent/CN112259168B/zh active Active
- 2020-11-06 WO PCT/CN2020/127101 patent/WO2022082879A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239732A (zh) * | 2014-09-24 | 2014-12-24 | 湖南大学 | 一种运行于多核计算机平台的并行通用序列的比对方法 |
CN104504303A (zh) * | 2014-09-29 | 2015-04-08 | 肇庆学院 | 基于cpu+gpu异构系统的序列比对方法 |
EP3428798A1 (en) * | 2016-04-08 | 2019-01-16 | Huawei Technologies Co., Ltd. | Resource allocation method and device for genetic analysis |
CN110135584A (zh) * | 2019-03-30 | 2019-08-16 | 华南理工大学 | 基于自适应并行遗传算法的大规模符号回归方法及系统 |
CN110473593A (zh) * | 2019-07-25 | 2019-11-19 | 深圳大学 | 一种基于FPGA的Smith-Waterman算法实现方法及装置 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116932631A (zh) * | 2023-07-18 | 2023-10-24 | 哈尔滨晨文科技开发有限公司 | 一种基于大数据的检测数据可视化管理系统及方法 |
Also Published As
Publication number | Publication date |
---|---|
CN112259168A (zh) | 2021-01-22 |
CN112259168B (zh) | 2023-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022082879A1 (zh) | 基因测序数据处理方法和基因测序数据处理装置 | |
Nobile et al. | Graphics processing units in bioinformatics, computational biology and systems biology | |
CN107563150B (zh) | 蛋白质结合位点的预测方法、装置、设备及存储介质 | |
Ng et al. | Reconfigurable acceleration of genetic sequence alignment: A survey of two decades of efforts | |
Sadasivan et al. | Accelerating Minimap2 for accurate long read alignment on GPUs | |
Wu et al. | FPGA accelerated INDEL realignment in the cloud | |
Du et al. | A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA | |
Chen et al. | A hybrid short read mapping accelerator | |
Du et al. | Deepadd: protein function prediction from k-mer embedding and additional features | |
Houtgast et al. | An efficient gpuaccelerated implementation of genomic short read mapping with bwamem | |
Aguado-Puig et al. | Accelerating edit-distance sequence alignment on GPU using the wavefront algorithm | |
Ng et al. | Acceleration of short read alignment with runtime reconfiguration | |
Soto et al. | JACC-FPGA: A hardware accelerator for Jaccard similarity estimation using FPGAs in the cloud | |
WO2021113779A1 (en) | Rapid detection of gene fusions | |
Yin et al. | Improving the prediction of DNA-protein binding by integrating multi-scale dense convolutional network with fault-tolerant coding | |
JP7393439B2 (ja) | 遺伝子シークエンシングデータ処理方法及び遺伝子シークエンシングデータ処理装置 | |
RU2799005C2 (ru) | Способ обработки данных секвенирования генов и устройство для обработки данных секвенирования генов | |
CN114999566A (zh) | 基于词向量表征和注意力机制的药物重定位方法及系统 | |
Hazelhurst | Algorithms for clustering expressed sequence tags: the wcd tool: reviewed article | |
Gudur et al. | Hardware-algorithm codesign for fast and energy efficient approximate string matching on FPGA for computational biology | |
Anderson et al. | An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models | |
Nasrin et al. | PSALR: Parallel Sequence Alignment for long Sequence Read with Hash model | |
Kieu-Do-Nguyen et al. | High-Performance FPGA-Based BWA-MEM Accelerator | |
Kawam et al. | A GPU-CPU heterogeneous algorithm for NGS read alignment | |
Surendar et al. | Micro Sequence Identification of DNA Data Using Pattern Mining Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2021571845 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2020450960 Country of ref document: AU Date of ref document: 20201106 Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20937176 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020937176 Country of ref document: EP Effective date: 20230522 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 521431052 Country of ref document: SA |