CN111326216B - Rapid partitioning method for big data gene sequencing file - Google Patents

Rapid partitioning method for big data gene sequencing file Download PDF

Info

Publication number
CN111326216B
CN111326216B CN202010122470.0A CN202010122470A CN111326216B CN 111326216 B CN111326216 B CN 111326216B CN 202010122470 A CN202010122470 A CN 202010122470A CN 111326216 B CN111326216 B CN 111326216B
Authority
CN
China
Prior art keywords
file
node
processed
fastq
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010122470.0A
Other languages
Chinese (zh)
Other versions
CN111326216A (en
Inventor
张中海
谭光明
张春明
姚二林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010122470.0A priority Critical patent/CN111326216B/en
Publication of CN111326216A publication Critical patent/CN111326216A/en
Application granted granted Critical
Publication of CN111326216B publication Critical patent/CN111326216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The invention relates to the field of high-performance calculation, in particular to a rapid segmentation method of a big data gene sequencing file, which ensures that the actual segmentation of the sequencing file is not needed in the multi-node gene analysis process, subfiles are not generated, and a flexible partitioning scheme is provided according to a subsequent analysis program, so that the loads of all nodes are more balanced, hard disk reading and writing are reduced, and the partitioning efficiency is improved.

Description

Rapid partitioning method for big data gene sequencing file
Technical Field
The invention relates to the field of high-performance calculation, in particular to a rapid segmentation method of a big data gene sequencing file.
Background
With the rapid development of the field of great health, genetic analysis technology plays an increasingly important role. The gene sequencer produces a large number of sequencing files, the most commonly used sequencing file format being fastq format. Each sequencing file is few G, and tens G to hundreds G more. How to rapidly process these large data becomes increasingly the bottleneck of genetic analysis.
Since the sequencing file is large, it takes a lot of time to perform analysis processing with a single node, and thus multiple nodes are required to perform parallel calculation to reduce the time for gene analysis. This requires the division of the sequencing file, each node processes only a portion of the sequencing file, and the processing results are finally combined, thereby obtaining complete results of the genetic analysis in a short time.
When a plurality of nodes are used for processing the sequencing files, the common segmentation method is to divide the sequencing files equally according to the number of the nodes, then generate a plurality of subfiles, write the subfiles into a hard disk, and each node respectively reads the corresponding subfiles for processing. The method is simple and convenient, but can increase the read-write burden of the hard disk.
Moreover, the common segmentation method may affect the results of the subsequent procedure. In sequencing analysis, the sequence alignment software such as bwa and bowtie is generally used for analysis and alignment. For example, the bwa program reads a file in blocks, each time a block of fastq file is processed during its operation. Since the conventional slicing method does not take this into consideration, the result of bwa is affected, and inconsistency of the comparison result is easily caused.
Disclosure of Invention
The invention provides a rapid partitioning method for big data gene sequencing files, which comprises the following steps:
step 101, setting the size of a file block;
102, analyzing and counting fastq files according to the file block sizes set in the step 101, and dividing the fastq files into a plurality of file blocks; storing the position information of each file block and the total number of the file blocks into an information file;
step 103, calculating the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed by each node in the fastq file;
step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed by each node in the fastq file, which are determined in step 103, and providing the reading instruction for a subsequent program in a pipeline mode.
Preferably, step 102 in the above method further comprises: if the ending position of a file block is in the middle of a sequence, the file block is extended to the end of the sequence.
Preferably, the position information of the file block in the above method includes a start position and an end position of the file block.
Preferably, the file block size in the above method may have a value ranging from 1M to 100M.
Preferably, the number of file blocks that each node needs to process in step 103 in the above method is calculated according to the following formula:
wherein B is i The number of file blocks processed for the ith node;
c i the number of cores for the ith node;
B t the total file block number;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
According to another aspect of the present invention, there is provided a rapid partitioning method for big data gene sequencing files, comprising:
step 201, analyzing and counting fastq files according to sequences to obtain position information of each sequence and total number of the sequences;
step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of a file part to be processed of each node in a fastq file;
step 203, generating a read instruction according to the start position and the end position of the file part to be processed by each node in the fastq file, which are obtained in step 202, and providing the read instruction to a subsequent program in a pipeline mode.
Preferably, the position information of the sequence of the above method includes a start position and an end position of the sequence.
Preferably, the number of sequences that each node needs to process in step 202 of the above method is calculated according to the following formula:
wherein S is i The number of sequences processed for the ith node;
c i the number of cores for the ith node;
S t is the total number of sequences;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements any of the methods described above.
A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements any of the methods described above when executing the program.
Aiming at the defects of the prior art, the invention adopts a lazy division strategy for fastq files, and generates no subfiles, thereby avoiding the reading, writing and storage of subfiles. And various dividing modes are added for the subsequent analysis software. The method reduces the read-write times of the hard disk, improves the file dividing speed and eliminates the comparison error.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of a partitioning method by block according to one embodiment of the invention.
FIG. 2 is a flow diagram of a method of partitioning by sequence according to one embodiment of the invention.
Detailed Description
Before describing the method in detail, the format of the fastq file will be briefly described. The fastq file is a text file, one sequence every four lines, the first line is name information of the sequence, the second line is a base sequence, the third line is explanatory information, and the fourth line is quality score information of the sequence. The length of each sequence is not exactly the same. The sequencing files are divided into single-ended sequencing files and double-ended sequencing files, wherein the single-ended sequencing file only comprises one file, the double-ended sequencing file comprises a pair of files, and each sequence in the pair of files corresponds to each other.
According to one embodiment of the present invention, a block-wise partitioning method is described in connection with fig. 1, which method comprises the following steps.
In step 101, the file block size is set, and preferably, the value range can be between 1M and 100M.
The inventors have found that the gene sequencing analysis tool bwa reads the sequencing files in blocks during the alignment process. The inventors have found that partitioning fastq files in blocks facilitates load balancing when fastq files are aligned in parallel using a multi-node run bwa tool. The file block size can adopt different values according to different analysis software and processing capacities of the nodes, and preferably, the value range can be between 1M and 100M. In one embodiment of the present invention, when the size of the file block is 10M, a better processing speed and load balancing effect can be obtained.
102, analyzing and counting fastq files according to the block sizes set in the previous step, and dividing the fastq files into a plurality of file blocks; and saving the analysis result in the information file.
In analyzing fastq files by block size, taking the example of a block size of 10M, a 10 Mbyte offset backward from the file start position, if this byte data is just at the end of a sequence, the start position of the first file block is set to 0 and the end position is set to 10M. If the byte data is in the middle of a sequence, the end position of the first file block is set to the end position of the sequence. After the first file block position of the fastq file is found, the starting position and the ending position of the fastq file are stored in the information file. It can be seen that this first file block is greater than or equal to 10M in size and contains complete sequence data. Then, the end position of the first file block is shifted by one byte as the start position of the second file block, the shift back by 10 mbytes is continued, if the current byte data is just at the end of one sequence, the position of the current byte is set to the end position of the second block, and if the current byte data is at the beginning or middle of one sequence, the end position of the second block is set to the end position of the sequence. And so on, analyzing until the end of the fastq file, finding the starting positions and the ending positions of all file blocks and storing the information file. According to one embodiment of the invention, it is obvious that only the starting position can be stored in the information file. It can be seen that the size of the last file block may be less than 10M, and therefore, the size of each file block, except the last file block, may be slightly different, floating around 10M. And each file block contains a plurality of complete sequences, one sequence being present in only one file block. According to one embodiment of the invention, the number of file blocks is also accumulated and stored in the information file during the analysis.
For double-ended sequencing files, if the sizes of the two files are consistent, performing block-by-block statistical analysis according to one file, and if the sizes of the two files are inconsistent, dividing the two files according to another method for dividing the two files according to the sequence in the invention.
Step 103, calculating the number of file blocks to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed of each node in the fastq file according to the statistical information obtained in step 102.
Specifically, in the multi-node genetic analysis flow, the number of cores and the computing power of each computing node are not the same, so that when the sequencing file is divided, the processing range of each node is determined by taking the situations into consideration, and the load is more balanced.
According to one embodiment of the invention, the number of file blocks processed by each node is calculated according to equation 1.
Wherein B is i The number of blocks processed for the ith node;
c i the number of cores for the ith node;
B t the total file block number;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
After the number of file blocks to be processed of each node is calculated, the starting position and the ending position of the file part to be processed of each node in the original sequencing file can be determined through the information file.
Step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed of each node in the original sequencing file, and providing the reading instruction for a subsequent program in a pipeline mode.
Pipeline commands are prior art and will be briefly described herein by way of example using the Linux system. Linux pipes use vertical lines "|" to connect multiple commands, which are called pipe symbols. The concrete syntax format of the Linux pipeline is as follows:
command1|command2
in the present invention, command1 is an instruction to read the fastq file range, and command2 is an instruction of a per-block analysis tool such as bwa.
In accordance with another aspect of the present invention, the inventors have also found that bowtie is a tool for processing fastq files in sequence, and therefore, when the subsequent processing procedure is a tool for sequential analysis of bowtie or the like, the fastq files need to be divided in sequence.
In the following, a method of dividing in sequence is described in connection with fig. 2, according to an embodiment of the invention, which method comprises the following steps.
Step 201, performing analysis statistics on fastq files according to the sequence. And storing the analysis result into the information file while analyzing the fastq file.
Specifically, in the analysis process, the starting and ending positions of each sequence are analyzed and recorded, the number of the sequences is counted, and the sequences are saved in an information file.
For double-ended sequencing files, if the sizes of the two files are consistent, carrying out statistical analysis according to the sequence according to one file, and if the sizes of the two files are inconsistent, carrying out statistical analysis according to the sequence respectively.
Step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed of each node in the fastq file according to the statistical information obtained in step 201.
According to one embodiment of the invention, in the multi-node gene analysis flow, the number of cores and the computing power of each computing node are different, so that when sequencing file division is performed, the processing range of each node is determined according to the situations, and the load is more balanced.
The number of sequences processed by each node is calculated according to equation 2.
Wherein S is i The number of sequences processed for the ith node;
c i the number of cores for the ith node;
S t is the total number of sequences;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
After the number of sequences to be processed of each node is calculated, the starting position and the ending position of the file part to be processed of each node in the original sequencing file can be determined through the information file.
In step 20304, a read command is generated according to the start position and the end position of the file portion to be processed by each node in the original sequencing file, and the read command is provided to the subsequent program in a pipeline manner.
The pipe command format is as follows:
command1|command2
command1 is an instruction for reading the fastq file range, and command2 is an instruction for a tool for analyzing sequences, such as bowtie.
The invention provides a rapid dividing method for big data gene sequencing files, which ensures that in the multi-node gene analysis process, the sequencing files are not required to be actually divided, subfiles are not generated, and a flexible dividing scheme is provided according to a subsequent analysis program, so that the loads of all nodes are more balanced, hard disk reading and writing are reduced, and the dividing efficiency is improved.
It should be noted that, the steps in the foregoing embodiments are not necessary, and those skilled in the art may perform appropriate operations, substitutions, modifications and the like according to actual needs.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the invention has been described in detail with reference to the embodiments, those skilled in the art will understand that modifications and equivalents may be made thereto without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (6)

1. A rapid partitioning method for big data gene sequencing files, comprising:
step 101, setting the size of a file block;
102, analyzing and counting fastq files according to the file block sizes set in the step 101, and dividing the fastq files into a plurality of file blocks; storing the position information of each file block and the total number of the file blocks into an information file, wherein the position information of the file blocks comprises a starting position and an ending position of the file blocks;
step 103, calculating the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed by each node in the fastq file;
step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed by each node in the fastq file, which are determined in step 103, providing the reading instruction to a subsequent program in a pipeline mode,
the number of file blocks to be processed by each node in step 103 is calculated according to the following formula:
wherein B is i The number of file blocks processed for the ith node;
c i the number of cores for the ith node;
B t the total file block number;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
2. The rapid partitioning method for big data gene sequencing file of claim 1, said step 102 further comprising: if the ending position of a file block is in the middle of a sequence, the file block is extended to the end of the sequence.
3. The rapid partitioning method for big data gene sequencing file as claimed in claim 1, wherein the file block size has a value ranging from 1M to 100M.
4. A rapid partitioning method for big data gene sequencing files, comprising:
step 201, analyzing and counting fastq files according to sequences, obtaining position information of each sequence and total number of the sequences, and storing the position information of each sequence into an information file, wherein the position information of each sequence comprises a starting position and an ending position of each sequence;
step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of a file part to be processed of each node in a fastq file;
step 203, generating a read instruction according to the start position and the end position of the file part to be processed by each node in the fastq file obtained in step 202, providing the read instruction to a subsequent program in a pipeline manner,
the number of sequences that each node needs to process in step 202 is calculated according to the following formula:
wherein S is i The number of sequences processed for the ith node;
c i the number of cores for the ith node;
S t is the total number of sequences;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
5. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method according to any of claims 1-4.
6. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the method according to any of claims 1-4 when executing the program.
CN202010122470.0A 2020-02-27 2020-02-27 Rapid partitioning method for big data gene sequencing file Active CN111326216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010122470.0A CN111326216B (en) 2020-02-27 2020-02-27 Rapid partitioning method for big data gene sequencing file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010122470.0A CN111326216B (en) 2020-02-27 2020-02-27 Rapid partitioning method for big data gene sequencing file

Publications (2)

Publication Number Publication Date
CN111326216A CN111326216A (en) 2020-06-23
CN111326216B true CN111326216B (en) 2023-07-21

Family

ID=71168260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010122470.0A Active CN111326216B (en) 2020-02-27 2020-02-27 Rapid partitioning method for big data gene sequencing file

Country Status (1)

Country Link
CN (1) CN111326216B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005011430A (en) * 2003-06-19 2005-01-13 Hitachi Ltd File management method, recording device, reproducing device, and recording medium
CN101446976A (en) * 2008-12-26 2009-06-03 中兴通讯股份有限公司 File storage method in distributed file system
CN102930005A (en) * 2012-10-29 2013-02-13 北京奇虎科技有限公司 Method and device for binding file in host file
CN103186617A (en) * 2011-12-30 2013-07-03 北京新媒传信科技有限公司 Data storage method and device
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
EP2759953A1 (en) * 2013-01-28 2014-07-30 Hasso-Plattner-Institut für Softwaresystemtechnik GmbH System and method for genomic data processing with an in-memory database system and real-time analysis
CN105095686A (en) * 2014-05-15 2015-11-25 中国科学院青岛生物能源与过程研究所 High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware
CN106021538A (en) * 2016-05-27 2016-10-12 成都索贝数码科技股份有限公司 Word segmentation method and system based on storage of FICS objects
CN106446254A (en) * 2016-10-14 2017-02-22 北京百度网讯科技有限公司 File detection method and device

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240583A1 (en) * 2004-01-21 2005-10-27 Li Peter W Literature pipeline
US7478376B2 (en) * 2004-12-02 2009-01-13 International Business Machines Corporation Computer program code size partitioning method for multiple memory multi-processing systems
US9081501B2 (en) * 2010-01-08 2015-07-14 International Business Machines Corporation Multi-petascale highly efficient parallel supercomputer
US20140067887A1 (en) * 2012-08-29 2014-03-06 Sas Institute Inc. Grid Computing System Alongside A Distributed File System Architecture
CN103049680B (en) * 2012-12-29 2016-09-07 深圳先进技术研究院 gene sequencing data reading method and system
CN104504257B (en) * 2014-12-12 2017-08-11 国家电网公司 A kind of online Prony analysis methods calculated based on Dual parallel
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
MX2019004131A (en) * 2016-10-11 2020-01-30 Genomsys Sa Method and apparatus for the access to bioinformatics data structured in access units.
WO2018068827A1 (en) * 2016-10-11 2018-04-19 Genomsys Sa Efficient data structures for bioinformatics information representation
CN107145766A (en) * 2017-03-27 2017-09-08 中国科学院深圳先进技术研究院 Gene order read method and reading system
CN107169313A (en) * 2017-03-29 2017-09-15 中国科学院深圳先进技术研究院 The read method and computer-readable recording medium of DNA data files
CN109698010A (en) * 2017-10-23 2019-04-30 北京哲源科技有限责任公司 A kind of processing method for gene data
CN110120247A (en) * 2018-01-14 2019-08-13 广州明领基因科技有限公司 A kind of distributed genetic big data storage platform
US20200004592A1 (en) * 2018-06-29 2020-01-02 International Business Machines Corporation Hybridized storage optimization for genomic workloads
CN109616156B (en) * 2018-12-03 2021-07-06 郑州云海信息技术有限公司 Gene sequencing data storage method and device
CN109785905B (en) * 2018-12-18 2021-07-23 中国科学院计算技术研究所 Accelerating device for gene comparison algorithm
CN110427270B (en) * 2019-08-09 2022-11-01 华东师范大学 Dynamic load balancing method for distributed connection operator in RDMA (remote direct memory Access) network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005011430A (en) * 2003-06-19 2005-01-13 Hitachi Ltd File management method, recording device, reproducing device, and recording medium
CN101446976A (en) * 2008-12-26 2009-06-03 中兴通讯股份有限公司 File storage method in distributed file system
CN103186617A (en) * 2011-12-30 2013-07-03 北京新媒传信科技有限公司 Data storage method and device
CN102930005A (en) * 2012-10-29 2013-02-13 北京奇虎科技有限公司 Method and device for binding file in host file
EP2759953A1 (en) * 2013-01-28 2014-07-30 Hasso-Plattner-Institut für Softwaresystemtechnik GmbH System and method for genomic data processing with an in-memory database system and real-time analysis
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
CN105095686A (en) * 2014-05-15 2015-11-25 中国科学院青岛生物能源与过程研究所 High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware
CN106021538A (en) * 2016-05-27 2016-10-12 成都索贝数码科技股份有限公司 Word segmentation method and system based on storage of FICS objects
CN106446254A (en) * 2016-10-14 2017-02-22 北京百度网讯科技有限公司 File detection method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Gene Panel流程的并行设计与优化研究;王元戎等;计算机学报;第42卷(第11期);全文 *
PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead;Lingqi Zhang;Genes;全文 *
基于Hadoop Streaming的Last比对软件并行化的研究与实现;董本志;李文浩;景维鹏;;计算机工程与应用(第02期);全文 *
基于高通量转录组测序的序列比对算法研究;张勇等;中国优秀硕士学位论文全文数据库 (信息科技辑)(第3期);全文 *

Also Published As

Publication number Publication date
CN111326216A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN108985008B (en) Method and system for rapidly comparing gene data
WO2021072850A1 (en) Feature word extraction method and apparatus, text similarity calculation method and apparatus, and device
US7536432B2 (en) Parallel merge/sort processing device, method, and program for sorting data strings
CN110718264A (en) Method and device for testing information of solid state disk, computer equipment and storage medium
US9886561B2 (en) Efficient encoding and storage and retrieval of genomic data
CN111326216B (en) Rapid partitioning method for big data gene sequencing file
CN109658985B (en) Redundancy removal optimization method and system for gene reference sequence
CN114420210B (en) Rapid trimming method and system for biological sequencing sequence
CN110704573A (en) Directory storage method and device, computer equipment and storage medium
WO2016122318A1 (en) A computer implemented method for generating a variant call file
CN106919340B (en) System and method for improving RAID reading performance
CN104239748A (en) System and method for aligning a genome sequence considering mismatches
WO2020182175A1 (en) Method and system for merging alignment and sorting to optimize
WO2020182172A1 (en) Method and system for memory allocation to optimize computer operations of seeding for burrows wheeler alignment
CN111370070B (en) Compression processing method for big data gene sequencing file
CN112364580A (en) Method and device for automatically inserting specific code into register transmission level design file
JP4128439B2 (en) Array compression method
TWI582582B (en) A system and method to improve reading performance of raid
US20180232205A1 (en) Apparatus and method for recursive processing
JP4540556B2 (en) Data access method and program thereof
CN110941730A (en) Retrieval method and device based on human face feature data migration
CN116665772B (en) Genome map analysis method, device and medium based on memory calculation
WO2020182173A1 (en) Method and system for merging duplicate merging marking to optimize computer operations of gene sequencing system
CN115454983B (en) Massive Hbase data deduplication method based on bloom filter
CN117393046B (en) Space transcriptome sequencing method, system, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant