CN107169313A - The read method and computer-readable recording medium of DNA data files - Google Patents

The read method and computer-readable recording medium of DNA data files Download PDF

Info

Publication number
CN107169313A
CN107169313A CN201710195158.2A CN201710195158A CN107169313A CN 107169313 A CN107169313 A CN 107169313A CN 201710195158 A CN201710195158 A CN 201710195158A CN 107169313 A CN107169313 A CN 107169313A
Authority
CN
China
Prior art keywords
files
blocks
dna
size
starting point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710195158.2A
Other languages
Chinese (zh)
Inventor
葛健秋
孟金涛
郭宁
滕彦宁
魏彦杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201710195158.2A priority Critical patent/CN107169313A/en
Publication of CN107169313A publication Critical patent/CN107169313A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Abstract

The invention discloses a kind of read method and computer-readable recording medium of DNA data files, this method includes:Based on default number of processes, the size of each blocks of files of DNA data files is determined;Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process;Each task parallelism reads corresponding blocks of files.The present invention can accelerate the reading speed of DNA data files, greatly shorten read access time, particularly TB, PB rank DNA file and read, it is possible to reduce the memory consumption that unit stores DNA data files.

Description

The read method and computer-readable recording medium of DNA data files
Technical field
The present invention relates to technical field of biological information, more particularly to a kind of read method and computer of DNA data files Readable storage medium storing program for executing.
Background technology
Being presently used for the software of gene sequencing mainly has composite software EULER, SSAKE, VCAKE, velvet, IDBA.Wherein, it is to realize the readings of DNA data files by the way of serially read file in velvet, IDBA scheduling algorithm Take, the request memory that this is stored to unit is very high.In view of the data volume of DNA data files is huge, it is desirable to provide a kind of DNA data The read method of file, to reduce the memory consumption that unit stores DNA data files.
The content of the invention
To overcome the problem of memory consumption is excessive in DNA data files reading process in the prior art, the embodiment of the present invention On the one hand a kind of read method of DNA data files is provided, including:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain corresponding to each process Blocks of files, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, it is determined that The initial address of each blocks of files and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, respectively to be entered Blocks of files corresponding to journey;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
Wherein, DNA sequencing fragment is opened in the original position based on each blocks of files, and the DNA data files Initial point, determines initial address and the end address of each blocks of files, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
Wherein, the original position in any blocks of files, searches the starting point of the first DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, the first DNA sequencing fragment is searched Starting point.
Wherein, each task parallelism reads corresponding blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, the size of the data block To preset adjustable dimension.
On the other hand, the embodiments of the invention provide a kind of computer-readable recording medium, including:
For storing the computer program that can be run on a processor;The computer program is used for:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain corresponding to each process Blocks of files, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, it is determined that The initial address of each blocks of files and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, respectively to be entered Blocks of files corresponding to journey;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
Wherein, DNA sequencing fragment is opened in the original position based on each blocks of files, and the DNA data files Initial point, determines initial address and the end address of each blocks of files, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
Wherein, the original position in any blocks of files, searches the starting point of the first DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, the first DNA sequencing fragment is searched Starting point.
Wherein, each task parallelism reads corresponding blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, the size of the data block To preset adjustable dimension.
The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real Now each task parallelism reads corresponding blocks of files, can accelerate the reading speed of DNA data files, when greatly shortening reading Between, particularly TB, PB rank DNA file are read, it is possible to reduce the memory consumption that unit stores DNA data files.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, makes required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing.
Fig. 1 is the schematic flow sheet of the first embodiment of the read method of the DNA data files of the present invention;
Fig. 2 is the schematic flow sheet of the second embodiment of the read method of the DNA data files of the present invention.
Embodiment
In order that technical problem solved by the invention, technical scheme and beneficial effect are more clearly understood, below in conjunction with Drawings and Examples, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention.
Fig. 1 is refer to, is the schematic flow sheet of the first embodiment of the read method of the DNA data files of the present invention.Should Read method comprises the following steps:
101:Based on default number of processes, the size of each blocks of files of DNA data files is determined.Preferably, in order to true The uniform in size of each blocks of files is protected, the size of each blocks of files is size/p, wherein, size is the size chi of DNA data files Very little, p is default number of processes.
102:Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process. Since it is determined the size of each blocks of files, can successively be split with DNA data files according to the size of each blocks of files, so as to obtain each Blocks of files corresponding to process.
103:Each task parallelism reads corresponding blocks of files.
The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real Now each task parallelism reads corresponding blocks of files, it is possible to reduce unit stores the memory consumption of DNA data files.
Fig. 2 is refer to, is the schematic flow sheet of the second embodiment of the read method of the DNA data files of the present invention.Should Read method comprises the following steps:
201:Based on default number of processes, the size of each blocks of files of DNA data files is determined.Preferably, in order to true The uniform in size of each blocks of files is protected, the size of DNA data files, and being dimensioned so as to each blocks of files are obtained in advance Size/p, wherein, size is the size dimension of DNA data files, and p is default number of processes.
Step 201 can be realized based on MPI environment, and the initialization of MPI environment is carried out first:Carrying out, parallelization is defeated Before entering, to carry out a series of initial work and call initialization function, the entitled MPI_Init () of call function;Then, Statement participates in all of all computer node information, the process identification number of participation group communication and the communication of participation group calculated Number of processes.Then input file parameters carry out Initialize installation;According to different assembling demands, input is corresponding initial Change parameter, length, the input of gene sequencing sequential file including k-mer, the outgoing position of gene sequencing sequential file. DNA data files are finally read, its size size is calculated.Each participates in the process all concurrent invocation MPI_File_ calculated Open () function is opened pending DNA data files and handled.And read the total size of the DNA data files.
202:Based on the size of each blocks of files, the original position of each blocks of files is determined.Since it is determined of each blocks of files Beginning position, the continuity based on blocks of files just can determine that the end position of each blocks of files.Set the text corresponding to i-th of process The original position of part block is i*size/p, then the end position of the blocks of files corresponding to i-th of process is then (i+1) * size/ The end position of blocks of files corresponding to the i-th -1 process of p is then i*size/p.
203:Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and DNA data files, it is determined that The initial address of each blocks of files and end address.Specifically, step 103 can include:In the original position of any blocks of files, look into Look for the starting point of the first DNA sequencing fragment;The start address of any blocks of files is defined as to the starting point of DNA sequencing fragment;Will The end address of a upper blocks of files for any blocks of files is defined as the starting point of DNA sequencing fragment.Can be based on blocks of files dynamic Algorithm FAA searches the starting point of the first DNA sequencing fragment in original position.
Wherein, FAA algorithms are as follows:It is the blocks of files for inputting fasta or fastq forms first, size is size;Then The size that each blocks of files is got is determined according to the Thread Count proc of distribution size, size is size/p;Corresponding i pairs of thread The input starting position start answered is i*size/p and end position end is (i+1) * size/p;Be not as readBuf ">”i It is incremental plus until find first ">" untill;And the start=start+sendAdujstDelta of current file block, together When current file block a upper data block end=end+sendAdujstDelta;So go down successively, each thread correspondence The original position of blocks of files all accurately position.
204:Initial address and end address based on each blocks of files, split to DNA data files, respectively to be entered Blocks of files corresponding to journey.Specifically, in MPI environment, MPI_File_set_view () function can be called, and combine FAA algorithms carry out piecemeal to DNA data files.
205:Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.Specifically Ground, can the form based on data block, each task parallelism reads corresponding blocks of files, wherein, data block size To preset adjustable dimension.Because data block size is default adjustable dimension, so, by adjusting data Block size, can improve the I/O performances of parallel file system.
MPI parallel files, which are read, above all will accurately know store data file to be processed hereof Particular location, could so cause all processes while input while the various pieces of file, so as to realize file Parallelization is inputted.The embodiment of the present invention uses the parallel file read-write mode of many viewports in MPI, and combines gene data text The feature of part itself, calls MPI_File_set_view () function pair view to carry out piecemeal, and has used FAA algorithms, it is determined that The positional information of the starting and ending of each process blocks of files to be processed.Finally realize the blocks of files of the high concurrent of multi-process Read work.
In addition, the embodiment of the present invention by the initial address of each blocks of files and end address by being all positioned at DNA sequence dna piece The starting point of section, just will not so carry out rough segmentation to DNA sequence dna, will not cause the fracture point of identical DNA sequence fragment Cut, so as to ensure that each DNA sequencing fragment is all allocated and has been assigned exclusively to a certain blocks of files.
The present invention also provides a kind of computer-readable recording medium, and the computer-readable recording medium is used to be stored in processing The computer program run on device, the wherein computer program can be performed the reading side in Fig. 1~Fig. 2 any embodiments Method.The computer-readable recording medium includes at least one of USB flash disk, CD and terminal, server, does not limit herein.
In a particular embodiment, the computer program is used for:
For storing the computer program that can be run on a processor;The computer program is used for:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain the text corresponding to each process Part block, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and DNA data files, each text is determined The initial address of part block and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, respectively to be entered Blocks of files corresponding to journey;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
Wherein, the starting point of DNA sequencing fragment in the original position based on each blocks of files, and the DNA data files, Initial address and the end address of each blocks of files are determined, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
Wherein, in the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, the first DNA sequencing fragment is searched Starting point.
Wherein, each task parallelism reads corresponding blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, and data block size is pre- If adjustable dimension.
The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real Now each task parallelism reads corresponding blocks of files, it is possible to reduce unit stores the memory consumption of DNA data files.
It is to combine one or more embodiments that particular content is provided as described above, does not assert the specific reality of the present invention Apply and be confined to these explanations.It is all approximate with method of the invention, structure etc., identical, or for present inventive concept under the premise of Some technology deduction or replace are made, should all be considered as protection scope of the present invention.

Claims (10)

1. a kind of read method of DNA data files, it is characterised in that including:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
2. read method as claimed in claim 1, it is characterised in that the size based on each blocks of files, to the DNA numbers According to file division, to obtain the blocks of files corresponding to each process, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, each text is determined The initial address of part block and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, to obtain each process institute Corresponding blocks of files;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
3. read method as claimed in claim 2, it is characterised in that the original position based on each blocks of files, Yi Jisuo The starting point of DNA sequencing fragment in DNA data files is stated, initial address and the end address of each blocks of files is determined, including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
4. read method as claimed in claim 3, it is characterised in that the original position in any blocks of files, the is searched The starting point of one DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, opening for the first DNA sequencing fragment is searched Initial point.
5. read method as claimed in claim 1, it is characterised in that each task parallelism reads corresponding blocks of files, bag Include:
Based on data block form, each task parallelism reads corresponding blocks of files, and the size of the data block is pre- If adjustable dimension.
6. a kind of computer-readable recording medium, it is characterised in that for storing the computer program that can be run on a processor; The computer program is used for:
Based on default number of processes, the size of each blocks of files of DNA data files is determined;
Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process;
Each task parallelism reads corresponding blocks of files.
7. computer-readable recording medium as claimed in claim 6, it is characterised in that the size based on each blocks of files, To the DNA data file segmentations, to obtain the blocks of files corresponding to each process, including:
Based on the size of each blocks of files, the original position of each blocks of files is determined;
Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, each text is determined The initial address of part block and end address;
Initial address and end address based on each blocks of files, split to the DNA data files, to obtain each process institute Corresponding blocks of files;
Each task parallelism reads corresponding blocks of files, including:
Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.
8. computer-readable recording medium as claimed in claim 7, it is characterised in that the start bit based on each blocks of files Put, and in the DNA data files DNA sequencing fragment starting point, determine initial address and the end address of each blocks of files, Including:
In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched;
The start address of any blocks of files is defined as the starting point;
The end address of a upper blocks of files for any blocks of files is defined as the starting point.
9. computer-readable recording medium as claimed in claim 8, it is characterised in that the start bit in any blocks of files Put, search the starting point of the first DNA sequencing fragment, including
In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, opening for the first DNA sequencing fragment is searched Initial point.
10. computer-readable recording medium as claimed in claim 6, it is characterised in that each task parallelism reads correspondence Blocks of files, including:
Based on data block form, each task parallelism reads corresponding blocks of files, and the size of the datablock is default Adjustable dimension.
CN201710195158.2A 2017-03-29 2017-03-29 The read method and computer-readable recording medium of DNA data files Pending CN107169313A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710195158.2A CN107169313A (en) 2017-03-29 2017-03-29 The read method and computer-readable recording medium of DNA data files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710195158.2A CN107169313A (en) 2017-03-29 2017-03-29 The read method and computer-readable recording medium of DNA data files

Publications (1)

Publication Number Publication Date
CN107169313A true CN107169313A (en) 2017-09-15

Family

ID=59848975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710195158.2A Pending CN107169313A (en) 2017-03-29 2017-03-29 The read method and computer-readable recording medium of DNA data files

Country Status (1)

Country Link
CN (1) CN107169313A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326216A (en) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN111475304A (en) * 2020-04-10 2020-07-31 中国科学院计算机网络信息中心 Feature extraction acceleration method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826109A (en) * 2010-04-07 2010-09-08 深圳创维-Rgb电子有限公司 Large-capacity file splitting method, device and system
CN103049680A (en) * 2012-12-29 2013-04-17 深圳先进技术研究院 gene sequencing data reading method and system
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826109A (en) * 2010-04-07 2010-09-08 深圳创维-Rgb电子有限公司 Large-capacity file splitting method, device and system
CN103049680A (en) * 2012-12-29 2013-04-17 深圳先进技术研究院 gene sequencing data reading method and system
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭新 等: "基于大规模序列比对软件的并行优化方案", 《计算机工程》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326216A (en) * 2020-02-27 2020-06-23 中国科学院计算技术研究所 Rapid partitioning method for big data gene sequencing file
CN111475304A (en) * 2020-04-10 2020-07-31 中国科学院计算机网络信息中心 Feature extraction acceleration method and system
CN111475304B (en) * 2020-04-10 2023-10-03 中国科学院计算机网络信息中心 Feature extraction acceleration method and system

Similar Documents

Publication Publication Date Title
US11960726B2 (en) Method and apparatus for SSD storage access
US8938603B2 (en) Cache system optimized for cache miss detection
CN105095287A (en) LSM (Log Structured Merge) data compact method and device
CN111913955A (en) Data sorting processing device, method and storage medium
CN109710402A (en) Method, apparatus, computer equipment and the storage medium of process resource acquisition request
US20120023110A1 (en) Adaptive Processing for Sequence Alignment
CN111797096A (en) Data indexing method and device based on ElasticSearch, computer equipment and storage medium
KR20120105294A (en) Memory controller controlling a nonvolatile memory
US10915534B2 (en) Extreme value computation
CN103164490A (en) Method and device for achieving high-efficient storage of data with non-fixed lengths
CN112085644A (en) Multi-column data sorting method and device, readable storage medium and electronic equipment
CN105830160B (en) For the device and method of buffer will to be written to through shielding data
CN110955390B (en) Data processing method, device, electronic equipment and storage medium
CN111931848A (en) Data feature extraction method and device, computer equipment and storage medium
CN104407990B (en) A kind of disk access method and device
US20190050298A1 (en) Method and apparatus for improving database recovery speed using log data analysis
CN111813517A (en) Task queue allocation method and device, computer equipment and medium
CN107169313A (en) The read method and computer-readable recording medium of DNA data files
CN108268503A (en) A kind of storage of database, querying method and device
CN116226681A (en) Text similarity judging method and device, computer equipment and storage medium
CN110069772A (en) Predict device, method and the storage medium of the scoring of question and answer content
CN114816322A (en) External sorting method and device of SSD and SSD memory
CN111831585B (en) Data storage device and data prediction method thereof
CN113254106A (en) Task execution method and device based on Flink, computer equipment and storage medium
CN112148486A (en) Memory page management method, device and equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170915

RJ01 Rejection of invention patent application after publication