CN111326216B - Rapid partitioning method for big data gene sequencing file - Google Patents
Rapid partitioning method for big data gene sequencing file Download PDFInfo
- Publication number
- CN111326216B CN111326216B CN202010122470.0A CN202010122470A CN111326216B CN 111326216 B CN111326216 B CN 111326216B CN 202010122470 A CN202010122470 A CN 202010122470A CN 111326216 B CN111326216 B CN 111326216B
- Authority
- CN
- China
- Prior art keywords
- file
- node
- processed
- fastq
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Abstract
The invention relates to the field of high-performance calculation, in particular to a rapid segmentation method of a big data gene sequencing file, which ensures that the actual segmentation of the sequencing file is not needed in the multi-node gene analysis process, subfiles are not generated, and a flexible partitioning scheme is provided according to a subsequent analysis program, so that the loads of all nodes are more balanced, hard disk reading and writing are reduced, and the partitioning efficiency is improved.
Description
Technical Field
The invention relates to the field of high-performance calculation, in particular to a rapid segmentation method of a big data gene sequencing file.
Background
With the rapid development of the field of great health, genetic analysis technology plays an increasingly important role. The gene sequencer produces a large number of sequencing files, the most commonly used sequencing file format being fastq format. Each sequencing file is few G, and tens G to hundreds G more. How to rapidly process these large data becomes increasingly the bottleneck of genetic analysis.
Since the sequencing file is large, it takes a lot of time to perform analysis processing with a single node, and thus multiple nodes are required to perform parallel calculation to reduce the time for gene analysis. This requires the division of the sequencing file, each node processes only a portion of the sequencing file, and the processing results are finally combined, thereby obtaining complete results of the genetic analysis in a short time.
When a plurality of nodes are used for processing the sequencing files, the common segmentation method is to divide the sequencing files equally according to the number of the nodes, then generate a plurality of subfiles, write the subfiles into a hard disk, and each node respectively reads the corresponding subfiles for processing. The method is simple and convenient, but can increase the read-write burden of the hard disk.
Moreover, the common segmentation method may affect the results of the subsequent procedure. In sequencing analysis, the sequence alignment software such as bwa and bowtie is generally used for analysis and alignment. For example, the bwa program reads a file in blocks, each time a block of fastq file is processed during its operation. Since the conventional slicing method does not take this into consideration, the result of bwa is affected, and inconsistency of the comparison result is easily caused.
Disclosure of Invention
The invention provides a rapid partitioning method for big data gene sequencing files, which comprises the following steps:
step 101, setting the size of a file block;
102, analyzing and counting fastq files according to the file block sizes set in the step 101, and dividing the fastq files into a plurality of file blocks; storing the position information of each file block and the total number of the file blocks into an information file;
step 103, calculating the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed by each node in the fastq file;
step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed by each node in the fastq file, which are determined in step 103, and providing the reading instruction for a subsequent program in a pipeline mode.
Preferably, step 102 in the above method further comprises: if the ending position of a file block is in the middle of a sequence, the file block is extended to the end of the sequence.
Preferably, the position information of the file block in the above method includes a start position and an end position of the file block.
Preferably, the file block size in the above method may have a value ranging from 1M to 100M.
Preferably, the number of file blocks that each node needs to process in step 103 in the above method is calculated according to the following formula:
wherein B is i The number of file blocks processed for the ith node;
c i the number of cores for the ith node;
B t the total file block number;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
According to another aspect of the present invention, there is provided a rapid partitioning method for big data gene sequencing files, comprising:
step 201, analyzing and counting fastq files according to sequences to obtain position information of each sequence and total number of the sequences;
step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of a file part to be processed of each node in a fastq file;
step 203, generating a read instruction according to the start position and the end position of the file part to be processed by each node in the fastq file, which are obtained in step 202, and providing the read instruction to a subsequent program in a pipeline mode.
Preferably, the position information of the sequence of the above method includes a start position and an end position of the sequence.
Preferably, the number of sequences that each node needs to process in step 202 of the above method is calculated according to the following formula:
wherein S is i The number of sequences processed for the ith node;
c i the number of cores for the ith node;
S t is the total number of sequences;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements any of the methods described above.
A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements any of the methods described above when executing the program.
Aiming at the defects of the prior art, the invention adopts a lazy division strategy for fastq files, and generates no subfiles, thereby avoiding the reading, writing and storage of subfiles. And various dividing modes are added for the subsequent analysis software. The method reduces the read-write times of the hard disk, improves the file dividing speed and eliminates the comparison error.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of a partitioning method by block according to one embodiment of the invention.
FIG. 2 is a flow diagram of a method of partitioning by sequence according to one embodiment of the invention.
Detailed Description
Before describing the method in detail, the format of the fastq file will be briefly described. The fastq file is a text file, one sequence every four lines, the first line is name information of the sequence, the second line is a base sequence, the third line is explanatory information, and the fourth line is quality score information of the sequence. The length of each sequence is not exactly the same. The sequencing files are divided into single-ended sequencing files and double-ended sequencing files, wherein the single-ended sequencing file only comprises one file, the double-ended sequencing file comprises a pair of files, and each sequence in the pair of files corresponds to each other.
According to one embodiment of the present invention, a block-wise partitioning method is described in connection with fig. 1, which method comprises the following steps.
In step 101, the file block size is set, and preferably, the value range can be between 1M and 100M.
The inventors have found that the gene sequencing analysis tool bwa reads the sequencing files in blocks during the alignment process. The inventors have found that partitioning fastq files in blocks facilitates load balancing when fastq files are aligned in parallel using a multi-node run bwa tool. The file block size can adopt different values according to different analysis software and processing capacities of the nodes, and preferably, the value range can be between 1M and 100M. In one embodiment of the present invention, when the size of the file block is 10M, a better processing speed and load balancing effect can be obtained.
102, analyzing and counting fastq files according to the block sizes set in the previous step, and dividing the fastq files into a plurality of file blocks; and saving the analysis result in the information file.
In analyzing fastq files by block size, taking the example of a block size of 10M, a 10 Mbyte offset backward from the file start position, if this byte data is just at the end of a sequence, the start position of the first file block is set to 0 and the end position is set to 10M. If the byte data is in the middle of a sequence, the end position of the first file block is set to the end position of the sequence. After the first file block position of the fastq file is found, the starting position and the ending position of the fastq file are stored in the information file. It can be seen that this first file block is greater than or equal to 10M in size and contains complete sequence data. Then, the end position of the first file block is shifted by one byte as the start position of the second file block, the shift back by 10 mbytes is continued, if the current byte data is just at the end of one sequence, the position of the current byte is set to the end position of the second block, and if the current byte data is at the beginning or middle of one sequence, the end position of the second block is set to the end position of the sequence. And so on, analyzing until the end of the fastq file, finding the starting positions and the ending positions of all file blocks and storing the information file. According to one embodiment of the invention, it is obvious that only the starting position can be stored in the information file. It can be seen that the size of the last file block may be less than 10M, and therefore, the size of each file block, except the last file block, may be slightly different, floating around 10M. And each file block contains a plurality of complete sequences, one sequence being present in only one file block. According to one embodiment of the invention, the number of file blocks is also accumulated and stored in the information file during the analysis.
For double-ended sequencing files, if the sizes of the two files are consistent, performing block-by-block statistical analysis according to one file, and if the sizes of the two files are inconsistent, dividing the two files according to another method for dividing the two files according to the sequence in the invention.
Step 103, calculating the number of file blocks to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed of each node in the fastq file according to the statistical information obtained in step 102.
Specifically, in the multi-node genetic analysis flow, the number of cores and the computing power of each computing node are not the same, so that when the sequencing file is divided, the processing range of each node is determined by taking the situations into consideration, and the load is more balanced.
According to one embodiment of the invention, the number of file blocks processed by each node is calculated according to equation 1.
Wherein B is i The number of blocks processed for the ith node;
c i the number of cores for the ith node;
B t the total file block number;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
After the number of file blocks to be processed of each node is calculated, the starting position and the ending position of the file part to be processed of each node in the original sequencing file can be determined through the information file.
Step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed of each node in the original sequencing file, and providing the reading instruction for a subsequent program in a pipeline mode.
Pipeline commands are prior art and will be briefly described herein by way of example using the Linux system. Linux pipes use vertical lines "|" to connect multiple commands, which are called pipe symbols. The concrete syntax format of the Linux pipeline is as follows:
command1|command2
in the present invention, command1 is an instruction to read the fastq file range, and command2 is an instruction of a per-block analysis tool such as bwa.
In accordance with another aspect of the present invention, the inventors have also found that bowtie is a tool for processing fastq files in sequence, and therefore, when the subsequent processing procedure is a tool for sequential analysis of bowtie or the like, the fastq files need to be divided in sequence.
In the following, a method of dividing in sequence is described in connection with fig. 2, according to an embodiment of the invention, which method comprises the following steps.
Step 201, performing analysis statistics on fastq files according to the sequence. And storing the analysis result into the information file while analyzing the fastq file.
Specifically, in the analysis process, the starting and ending positions of each sequence are analyzed and recorded, the number of the sequences is counted, and the sequences are saved in an information file.
For double-ended sequencing files, if the sizes of the two files are consistent, carrying out statistical analysis according to the sequence according to one file, and if the sizes of the two files are inconsistent, carrying out statistical analysis according to the sequence respectively.
Step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed of each node in the fastq file according to the statistical information obtained in step 201.
According to one embodiment of the invention, in the multi-node gene analysis flow, the number of cores and the computing power of each computing node are different, so that when sequencing file division is performed, the processing range of each node is determined according to the situations, and the load is more balanced.
The number of sequences processed by each node is calculated according to equation 2.
Wherein S is i The number of sequences processed for the ith node;
c i the number of cores for the ith node;
S t is the total number of sequences;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
After the number of sequences to be processed of each node is calculated, the starting position and the ending position of the file part to be processed of each node in the original sequencing file can be determined through the information file.
In step 20304, a read command is generated according to the start position and the end position of the file portion to be processed by each node in the original sequencing file, and the read command is provided to the subsequent program in a pipeline manner.
The pipe command format is as follows:
command1|command2
command1 is an instruction for reading the fastq file range, and command2 is an instruction for a tool for analyzing sequences, such as bowtie.
The invention provides a rapid dividing method for big data gene sequencing files, which ensures that in the multi-node gene analysis process, the sequencing files are not required to be actually divided, subfiles are not generated, and a flexible dividing scheme is provided according to a subsequent analysis program, so that the loads of all nodes are more balanced, hard disk reading and writing are reduced, and the dividing efficiency is improved.
It should be noted that, the steps in the foregoing embodiments are not necessary, and those skilled in the art may perform appropriate operations, substitutions, modifications and the like according to actual needs.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the invention has been described in detail with reference to the embodiments, those skilled in the art will understand that modifications and equivalents may be made thereto without departing from the spirit and scope of the invention, which is intended to be covered by the claims.
Claims (6)
1. A rapid partitioning method for big data gene sequencing files, comprising:
step 101, setting the size of a file block;
102, analyzing and counting fastq files according to the file block sizes set in the step 101, and dividing the fastq files into a plurality of file blocks; storing the position information of each file block and the total number of the file blocks into an information file, wherein the position information of the file blocks comprises a starting position and an ending position of the file blocks;
step 103, calculating the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed by each node in the fastq file;
step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed by each node in the fastq file, which are determined in step 103, providing the reading instruction to a subsequent program in a pipeline mode,
the number of file blocks to be processed by each node in step 103 is calculated according to the following formula:
wherein B is i The number of file blocks processed for the ith node;
c i the number of cores for the ith node;
B t the total file block number;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
2. The rapid partitioning method for big data gene sequencing file of claim 1, said step 102 further comprising: if the ending position of a file block is in the middle of a sequence, the file block is extended to the end of the sequence.
3. The rapid partitioning method for big data gene sequencing file as claimed in claim 1, wherein the file block size has a value ranging from 1M to 100M.
4. A rapid partitioning method for big data gene sequencing files, comprising:
step 201, analyzing and counting fastq files according to sequences, obtaining position information of each sequence and total number of the sequences, and storing the position information of each sequence into an information file, wherein the position information of each sequence comprises a starting position and an ending position of each sequence;
step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of a file part to be processed of each node in a fastq file;
step 203, generating a read instruction according to the start position and the end position of the file part to be processed by each node in the fastq file obtained in step 202, providing the read instruction to a subsequent program in a pipeline manner,
the number of sequences that each node needs to process in step 202 is calculated according to the following formula:
wherein S is i The number of sequences processed for the ith node;
c i the number of cores for the ith node;
S t is the total number of sequences;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
5. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method according to any of claims 1-4.
6. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the method according to any of claims 1-4 when executing the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010122470.0A CN111326216B (en) | 2020-02-27 | 2020-02-27 | Rapid partitioning method for big data gene sequencing file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010122470.0A CN111326216B (en) | 2020-02-27 | 2020-02-27 | Rapid partitioning method for big data gene sequencing file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111326216A CN111326216A (en) | 2020-06-23 |
CN111326216B true CN111326216B (en) | 2023-07-21 |
Family
ID=71168260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010122470.0A Active CN111326216B (en) | 2020-02-27 | 2020-02-27 | Rapid partitioning method for big data gene sequencing file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111326216B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005011430A (en) * | 2003-06-19 | 2005-01-13 | Hitachi Ltd | File management method, recording device, reproducing device, and recording medium |
CN101446976A (en) * | 2008-12-26 | 2009-06-03 | 中兴通讯股份有限公司 | File storage method in distributed file system |
CN102930005A (en) * | 2012-10-29 | 2013-02-13 | 北京奇虎科技有限公司 | Method and device for binding file in host file |
CN103186617A (en) * | 2011-12-30 | 2013-07-03 | 北京新媒传信科技有限公司 | Data storage method and device |
CN103559020A (en) * | 2013-11-07 | 2014-02-05 | 中国科学院软件研究所 | Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data |
EP2759953A1 (en) * | 2013-01-28 | 2014-07-30 | Hasso-Plattner-Institut für Softwaresystemtechnik GmbH | System and method for genomic data processing with an in-memory database system and real-time analysis |
CN105095686A (en) * | 2014-05-15 | 2015-11-25 | 中国科学院青岛生物能源与过程研究所 | High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware |
CN106021538A (en) * | 2016-05-27 | 2016-10-12 | 成都索贝数码科技股份有限公司 | Word segmentation method and system based on storage of FICS objects |
CN106446254A (en) * | 2016-10-14 | 2017-02-22 | 北京百度网讯科技有限公司 | File detection method and device |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050240583A1 (en) * | 2004-01-21 | 2005-10-27 | Li Peter W | Literature pipeline |
US7478376B2 (en) * | 2004-12-02 | 2009-01-13 | International Business Machines Corporation | Computer program code size partitioning method for multiple memory multi-processing systems |
US9081501B2 (en) * | 2010-01-08 | 2015-07-14 | International Business Machines Corporation | Multi-petascale highly efficient parallel supercomputer |
US20140067887A1 (en) * | 2012-08-29 | 2014-03-06 | Sas Institute Inc. | Grid Computing System Alongside A Distributed File System Architecture |
CN103049680B (en) * | 2012-12-29 | 2016-09-07 | 深圳先进技术研究院 | gene sequencing data reading method and system |
CN104504257B (en) * | 2014-12-12 | 2017-08-11 | 国家电网公司 | A kind of online Prony analysis methods calculated based on Dual parallel |
WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
MX2019004131A (en) * | 2016-10-11 | 2020-01-30 | Genomsys Sa | Method and apparatus for the access to bioinformatics data structured in access units. |
WO2018068827A1 (en) * | 2016-10-11 | 2018-04-19 | Genomsys Sa | Efficient data structures for bioinformatics information representation |
CN107145766A (en) * | 2017-03-27 | 2017-09-08 | 中国科学院深圳先进技术研究院 | Gene order read method and reading system |
CN107169313A (en) * | 2017-03-29 | 2017-09-15 | 中国科学院深圳先进技术研究院 | The read method and computer-readable recording medium of DNA data files |
CN109698010A (en) * | 2017-10-23 | 2019-04-30 | 北京哲源科技有限责任公司 | A kind of processing method for gene data |
CN110120247A (en) * | 2018-01-14 | 2019-08-13 | 广州明领基因科技有限公司 | A kind of distributed genetic big data storage platform |
US20200004592A1 (en) * | 2018-06-29 | 2020-01-02 | International Business Machines Corporation | Hybridized storage optimization for genomic workloads |
CN109616156B (en) * | 2018-12-03 | 2021-07-06 | 郑州云海信息技术有限公司 | Gene sequencing data storage method and device |
CN109785905B (en) * | 2018-12-18 | 2021-07-23 | 中国科学院计算技术研究所 | Accelerating device for gene comparison algorithm |
CN110427270B (en) * | 2019-08-09 | 2022-11-01 | 华东师范大学 | Dynamic load balancing method for distributed connection operator in RDMA (remote direct memory Access) network |
-
2020
- 2020-02-27 CN CN202010122470.0A patent/CN111326216B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005011430A (en) * | 2003-06-19 | 2005-01-13 | Hitachi Ltd | File management method, recording device, reproducing device, and recording medium |
CN101446976A (en) * | 2008-12-26 | 2009-06-03 | 中兴通讯股份有限公司 | File storage method in distributed file system |
CN103186617A (en) * | 2011-12-30 | 2013-07-03 | 北京新媒传信科技有限公司 | Data storage method and device |
CN102930005A (en) * | 2012-10-29 | 2013-02-13 | 北京奇虎科技有限公司 | Method and device for binding file in host file |
EP2759953A1 (en) * | 2013-01-28 | 2014-07-30 | Hasso-Plattner-Institut für Softwaresystemtechnik GmbH | System and method for genomic data processing with an in-memory database system and real-time analysis |
CN103559020A (en) * | 2013-11-07 | 2014-02-05 | 中国科学院软件研究所 | Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data |
CN105095686A (en) * | 2014-05-15 | 2015-11-25 | 中国科学院青岛生物能源与过程研究所 | High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware |
CN106021538A (en) * | 2016-05-27 | 2016-10-12 | 成都索贝数码科技股份有限公司 | Word segmentation method and system based on storage of FICS objects |
CN106446254A (en) * | 2016-10-14 | 2017-02-22 | 北京百度网讯科技有限公司 | File detection method and device |
Non-Patent Citations (4)
Title |
---|
Gene Panel流程的并行设计与优化研究;王元戎等;计算机学报;第42卷(第11期);全文 * |
PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead;Lingqi Zhang;Genes;全文 * |
基于Hadoop Streaming的Last比对软件并行化的研究与实现;董本志;李文浩;景维鹏;;计算机工程与应用(第02期);全文 * |
基于高通量转录组测序的序列比对算法研究;张勇等;中国优秀硕士学位论文全文数据库 (信息科技辑)(第3期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111326216A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108985008B (en) | Method and system for rapidly comparing gene data | |
WO2021072850A1 (en) | Feature word extraction method and apparatus, text similarity calculation method and apparatus, and device | |
US7536432B2 (en) | Parallel merge/sort processing device, method, and program for sorting data strings | |
CN110718264A (en) | Method and device for testing information of solid state disk, computer equipment and storage medium | |
US9886561B2 (en) | Efficient encoding and storage and retrieval of genomic data | |
CN111326216B (en) | Rapid partitioning method for big data gene sequencing file | |
CN109658985B (en) | Redundancy removal optimization method and system for gene reference sequence | |
CN114420210B (en) | Rapid trimming method and system for biological sequencing sequence | |
CN110704573A (en) | Directory storage method and device, computer equipment and storage medium | |
WO2016122318A1 (en) | A computer implemented method for generating a variant call file | |
CN106919340B (en) | System and method for improving RAID reading performance | |
CN104239748A (en) | System and method for aligning a genome sequence considering mismatches | |
WO2020182175A1 (en) | Method and system for merging alignment and sorting to optimize | |
WO2020182172A1 (en) | Method and system for memory allocation to optimize computer operations of seeding for burrows wheeler alignment | |
CN111370070B (en) | Compression processing method for big data gene sequencing file | |
CN112364580A (en) | Method and device for automatically inserting specific code into register transmission level design file | |
JP4128439B2 (en) | Array compression method | |
TWI582582B (en) | A system and method to improve reading performance of raid | |
US20180232205A1 (en) | Apparatus and method for recursive processing | |
JP4540556B2 (en) | Data access method and program thereof | |
CN110941730A (en) | Retrieval method and device based on human face feature data migration | |
CN116665772B (en) | Genome map analysis method, device and medium based on memory calculation | |
WO2020182173A1 (en) | Method and system for merging duplicate merging marking to optimize computer operations of gene sequencing system | |
CN115454983B (en) | Massive Hbase data deduplication method based on bloom filter | |
CN117393046B (en) | Space transcriptome sequencing method, system, medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |