CN107169313A - The read method and computer-readable recording medium of DNA data files - Google Patents
The read method and computer-readable recording medium of DNA data files Download PDFInfo
- Publication number
- CN107169313A CN107169313A CN201710195158.2A CN201710195158A CN107169313A CN 107169313 A CN107169313 A CN 107169313A CN 201710195158 A CN201710195158 A CN 201710195158A CN 107169313 A CN107169313 A CN 107169313A
- Authority
- CN
- China
- Prior art keywords
- files
- blocks
- dna
- size
- starting point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
Abstract
The invention discloses a kind of read method and computer-readable recording medium of DNA data files, this method includes:Based on default number of processes, the size of each blocks of files of DNA data files is determined;Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process;Each task parallelism reads corresponding blocks of files.The present invention can accelerate the reading speed of DNA data files, greatly shorten read access time, particularly TB, PB rank DNA file and read, it is possible to reduce the memory consumption that unit stores DNA data files.
Description
Technical field
The present invention relates to technical field of biological information, more particularly to a kind of read method and computer of DNA data files
Readable storage medium storing program for executing.
Background technology
Being presently used for the software of gene sequencing mainly has composite software EULER, SSAKE, VCAKE, velvet,
IDBA.Wherein, it is to realize the readings of DNA data files by the way of serially read file in velvet, IDBA scheduling algorithm
Take, the request memory that this is stored to unit is very high.In view of the data volume of DNA data files is huge, it is desirable to provide a kind of DNA data
The read method of file, to reduce the memory consumption that unit stores DNA data files.
The content of the invention
To overcome the problem of memory consumption is excessive in DNA data files reading process in the prior art, the embodiment of the present invention
On the one hand a kind of read method of DNA data files is provided, including:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain corresponding to each process
Blocks of files, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, it is determined that
The initial address of each blocks of files and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, respectively to be entered
Blocks of files corresponding to journey;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
Wherein, DNA sequencing fragment is opened in the original position based on each blocks of files, and the DNA data files
Initial point, determines initial address and the end address of each blocks of files, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
Wherein, the original position in any blocks of files, searches the starting point of the first DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, the first DNA sequencing fragment is searched
Starting point.
Wherein, each task parallelism reads corresponding blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, the size of the data block
To preset adjustable dimension.
On the other hand, the embodiments of the invention provide a kind of computer-readable recording medium, including:
For storing the computer program that can be run on a processor;The computer program is used for:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain corresponding to each process
Blocks of files, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, it is determined that
The initial address of each blocks of files and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, respectively to be entered
Blocks of files corresponding to journey;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
Wherein, DNA sequencing fragment is opened in the original position based on each blocks of files, and the DNA data files
Initial point, determines initial address and the end address of each blocks of files, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
Wherein, the original position in any blocks of files, searches the starting point of the first DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, the first DNA sequencing fragment is searched
Starting point.
Wherein, each task parallelism reads corresponding blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, the size of the data block
To preset adjustable dimension.
The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with
And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real
Now each task parallelism reads corresponding blocks of files, can accelerate the reading speed of DNA data files, when greatly shortening reading
Between, particularly TB, PB rank DNA file are read, it is possible to reduce the memory consumption that unit stores DNA data files.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, makes required in being described below to embodiment
Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for
For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings
Accompanying drawing.
Fig. 1 is the schematic flow sheet of the first embodiment of the read method of the DNA data files of the present invention;
Fig. 2 is the schematic flow sheet of the second embodiment of the read method of the DNA data files of the present invention.
Embodiment
In order that technical problem solved by the invention, technical scheme and beneficial effect are more clearly understood, below in conjunction with
Drawings and Examples, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used
To explain the present invention, it is not intended to limit the present invention.
Fig. 1 is refer to, is the schematic flow sheet of the first embodiment of the read method of the DNA data files of the present invention.Should
Read method comprises the following steps:
101:Based on default number of processes, the size of each blocks of files of DNA data files is determined.Preferably, in order to true
The uniform in size of each blocks of files is protected, the size of each blocks of files is size/p, wherein, size is the size chi of DNA data files
Very little, p is default number of processes.
102:Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process.
Since it is determined the size of each blocks of files, can successively be split with DNA data files according to the size of each blocks of files, so as to obtain each
Blocks of files corresponding to process.
103:Each task parallelism reads corresponding blocks of files.
The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with
And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real
Now each task parallelism reads corresponding blocks of files, it is possible to reduce unit stores the memory consumption of DNA data files.
Fig. 2 is refer to, is the schematic flow sheet of the second embodiment of the read method of the DNA data files of the present invention.Should
Read method comprises the following steps:
201:Based on default number of processes, the size of each blocks of files of DNA data files is determined.Preferably, in order to true
The uniform in size of each blocks of files is protected, the size of DNA data files, and being dimensioned so as to each blocks of files are obtained in advance
Size/p, wherein, size is the size dimension of DNA data files, and p is default number of processes.
Step 201 can be realized based on MPI environment, and the initialization of MPI environment is carried out first:Carrying out, parallelization is defeated
Before entering, to carry out a series of initial work and call initialization function, the entitled MPI_Init () of call function;Then,
Statement participates in all of all computer node information, the process identification number of participation group communication and the communication of participation group calculated
Number of processes.Then input file parameters carry out Initialize installation;According to different assembling demands, input is corresponding initial
Change parameter, length, the input of gene sequencing sequential file including k-mer, the outgoing position of gene sequencing sequential file.
DNA data files are finally read, its size size is calculated.Each participates in the process all concurrent invocation MPI_File_ calculated
Open () function is opened pending DNA data files and handled.And read the total size of the DNA data files.
202:Based on the size of each blocks of files, the original position of each blocks of files is determined.Since it is determined of each blocks of files
Beginning position, the continuity based on blocks of files just can determine that the end position of each blocks of files.Set the text corresponding to i-th of process
The original position of part block is i*size/p, then the end position of the blocks of files corresponding to i-th of process is then (i+1) * size/
The end position of blocks of files corresponding to the i-th -1 process of p is then i*size/p.
203:Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and DNA data files, it is determined that
The initial address of each blocks of files and end address.Specifically, step 103 can include:In the original position of any blocks of files, look into
Look for the starting point of the first DNA sequencing fragment;The start address of any blocks of files is defined as to the starting point of DNA sequencing fragment;Will
The end address of a upper blocks of files for any blocks of files is defined as the starting point of DNA sequencing fragment.Can be based on blocks of files dynamic
Algorithm FAA searches the starting point of the first DNA sequencing fragment in original position.
Wherein, FAA algorithms are as follows:It is the blocks of files for inputting fasta or fastq forms first, size is size;Then
The size that each blocks of files is got is determined according to the Thread Count proc of distribution size, size is size/p;Corresponding i pairs of thread
The input starting position start answered is i*size/p and end position end is (i+1) * size/p;Be not as readBuf ">”i
It is incremental plus until find first ">" untill;And the start=start+sendAdujstDelta of current file block, together
When current file block a upper data block end=end+sendAdujstDelta;So go down successively, each thread correspondence
The original position of blocks of files all accurately position.
204:Initial address and end address based on each blocks of files, split to DNA data files, respectively to be entered
Blocks of files corresponding to journey.Specifically, in MPI environment, MPI_File_set_view () function can be called, and combine
FAA algorithms carry out piecemeal to DNA data files.
205:Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.Specifically
Ground, can the form based on data block, each task parallelism reads corresponding blocks of files, wherein, data block size
To preset adjustable dimension.Because data block size is default adjustable dimension, so, by adjusting data
Block size, can improve the I/O performances of parallel file system.
MPI parallel files, which are read, above all will accurately know store data file to be processed hereof
Particular location, could so cause all processes while input while the various pieces of file, so as to realize file
Parallelization is inputted.The embodiment of the present invention uses the parallel file read-write mode of many viewports in MPI, and combines gene data text
The feature of part itself, calls MPI_File_set_view () function pair view to carry out piecemeal, and has used FAA algorithms, it is determined that
The positional information of the starting and ending of each process blocks of files to be processed.Finally realize the blocks of files of the high concurrent of multi-process
Read work.
In addition, the embodiment of the present invention by the initial address of each blocks of files and end address by being all positioned at DNA sequence dna piece
The starting point of section, just will not so carry out rough segmentation to DNA sequence dna, will not cause the fracture point of identical DNA sequence fragment
Cut, so as to ensure that each DNA sequencing fragment is all allocated and has been assigned exclusively to a certain blocks of files.
The present invention also provides a kind of computer-readable recording medium, and the computer-readable recording medium is used to be stored in processing
The computer program run on device, the wherein computer program can be performed the reading side in Fig. 1~Fig. 2 any embodiments
Method.The computer-readable recording medium includes at least one of USB flash disk, CD and terminal, server, does not limit herein.
In a particular embodiment, the computer program is used for:
For storing the computer program that can be run on a processor;The computer program is used for:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain the text corresponding to each process
Part block, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and DNA data files, each text is determined
The initial address of part block and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, respectively to be entered
Blocks of files corresponding to journey;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
Wherein, the starting point of DNA sequencing fragment in the original position based on each blocks of files, and the DNA data files,
Initial address and the end address of each blocks of files are determined, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
Wherein, in the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, the first DNA sequencing fragment is searched
Starting point.
Wherein, each task parallelism reads corresponding blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, and data block size is pre-
If adjustable dimension.
The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with
And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real
Now each task parallelism reads corresponding blocks of files, it is possible to reduce unit stores the memory consumption of DNA data files.
It is to combine one or more embodiments that particular content is provided as described above, does not assert the specific reality of the present invention
Apply and be confined to these explanations.It is all approximate with method of the invention, structure etc., identical, or for present inventive concept under the premise of
Some technology deduction or replace are made, should all be considered as protection scope of the present invention.
Claims (10)
1. a kind of read method of DNA data files, it is characterised in that including:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
2. read method as claimed in claim 1, it is characterised in that the size based on each blocks of files, to the DNA numbers
According to file division, to obtain the blocks of files corresponding to each process, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, each text is determined
The initial address of part block and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, to obtain each process institute
Corresponding blocks of files;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
3. read method as claimed in claim 2, it is characterised in that the original position based on each blocks of files, Yi Jisuo
The starting point of DNA sequencing fragment in DNA data files is stated, initial address and the end address of each blocks of files is determined, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
4. read method as claimed in claim 3, it is characterised in that the original position in any blocks of files, the is searched
The starting point of one DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, opening for the first DNA sequencing fragment is searched
Initial point.
5. read method as claimed in claim 1, it is characterised in that each task parallelism reads corresponding blocks of files, bag
Include:
Based on data block form, each task parallelism reads corresponding blocks of files, and the size of the data block is pre-
If adjustable dimension.
6. a kind of computer-readable recording medium, it is characterised in that for storing the computer program that can be run on a processor;
The computer program is used for:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
7. computer-readable recording medium as claimed in claim 6, it is characterised in that the size based on each blocks of files,
To the DNA data file segmentations, to obtain the blocks of files corresponding to each process, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, each text is determined
The initial address of part block and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, to obtain each process institute
Corresponding blocks of files;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
8. computer-readable recording medium as claimed in claim 7, it is characterised in that the start bit based on each blocks of files
Put, and in the DNA data files DNA sequencing fragment starting point, determine initial address and the end address of each blocks of files,
Including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
9. computer-readable recording medium as claimed in claim 8, it is characterised in that the start bit in any blocks of files
Put, search the starting point of the first DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, opening for the first DNA sequencing fragment is searched
Initial point.
10. computer-readable recording medium as claimed in claim 6, it is characterised in that each task parallelism reads correspondence
Blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, and the size of the datablock is default
Adjustable dimension.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710195158.2A CN107169313A (en) | 2017-03-29 | 2017-03-29 | The read method and computer-readable recording medium of DNA data files |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710195158.2A CN107169313A (en) | 2017-03-29 | 2017-03-29 | The read method and computer-readable recording medium of DNA data files |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107169313A true CN107169313A (en) | 2017-09-15 |
Family
ID=59848975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710195158.2A Pending CN107169313A (en) | 2017-03-29 | 2017-03-29 | The read method and computer-readable recording medium of DNA data files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107169313A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111326216A (en) * | 2020-02-27 | 2020-06-23 | 中国科学院计算技术研究所 | Rapid partitioning method for big data gene sequencing file |
CN111475304A (en) * | 2020-04-10 | 2020-07-31 | 中国科学院计算机网络信息中心 | Feature extraction acceleration method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101826109A (en) * | 2010-04-07 | 2010-09-08 | 深圳创维-Rgb电子有限公司 | Large-capacity file splitting method, device and system |
CN103049680A (en) * | 2012-12-29 | 2013-04-17 | 深圳先进技术研究院 | gene sequencing data reading method and system |
CN103559020A (en) * | 2013-11-07 | 2014-02-05 | 中国科学院软件研究所 | Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data |
-
2017
- 2017-03-29 CN CN201710195158.2A patent/CN107169313A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101826109A (en) * | 2010-04-07 | 2010-09-08 | 深圳创维-Rgb电子有限公司 | Large-capacity file splitting method, device and system |
CN103049680A (en) * | 2012-12-29 | 2013-04-17 | 深圳先进技术研究院 | gene sequencing data reading method and system |
CN103559020A (en) * | 2013-11-07 | 2014-02-05 | 中国科学院软件研究所 | Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data |
Non-Patent Citations (1)
Title |
---|
郭新 等: "基于大规模序列比对软件的并行优化方案", 《计算机工程》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111326216A (en) * | 2020-02-27 | 2020-06-23 | 中国科学院计算技术研究所 | Rapid partitioning method for big data gene sequencing file |
CN111475304A (en) * | 2020-04-10 | 2020-07-31 | 中国科学院计算机网络信息中心 | Feature extraction acceleration method and system |
CN111475304B (en) * | 2020-04-10 | 2023-10-03 | 中国科学院计算机网络信息中心 | Feature extraction acceleration method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11960726B2 (en) | Method and apparatus for SSD storage access | |
US8938603B2 (en) | Cache system optimized for cache miss detection | |
CN105095287A (en) | LSM (Log Structured Merge) data compact method and device | |
CN111913955A (en) | Data sorting processing device, method and storage medium | |
CN109710402A (en) | Method, apparatus, computer equipment and the storage medium of process resource acquisition request | |
US20120023110A1 (en) | Adaptive Processing for Sequence Alignment | |
CN111797096A (en) | Data indexing method and device based on ElasticSearch, computer equipment and storage medium | |
KR20120105294A (en) | Memory controller controlling a nonvolatile memory | |
US10915534B2 (en) | Extreme value computation | |
CN103164490A (en) | Method and device for achieving high-efficient storage of data with non-fixed lengths | |
CN112085644A (en) | Multi-column data sorting method and device, readable storage medium and electronic equipment | |
CN105830160B (en) | For the device and method of buffer will to be written to through shielding data | |
CN110955390B (en) | Data processing method, device, electronic equipment and storage medium | |
CN111931848A (en) | Data feature extraction method and device, computer equipment and storage medium | |
CN104407990B (en) | A kind of disk access method and device | |
US20190050298A1 (en) | Method and apparatus for improving database recovery speed using log data analysis | |
CN111813517A (en) | Task queue allocation method and device, computer equipment and medium | |
CN107169313A (en) | The read method and computer-readable recording medium of DNA data files | |
CN108268503A (en) | A kind of storage of database, querying method and device | |
CN116226681A (en) | Text similarity judging method and device, computer equipment and storage medium | |
CN110069772A (en) | Predict device, method and the storage medium of the scoring of question and answer content | |
CN114816322A (en) | External sorting method and device of SSD and SSD memory | |
CN111831585B (en) | Data storage device and data prediction method thereof | |
CN113254106A (en) | Task execution method and device based on Flink, computer equipment and storage medium | |
CN112148486A (en) | Memory page management method, device and equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170915 |
|
RJ01 | Rejection of invention patent application after publication |