CN107480466A - Genomic data storage method and electronic equipment - Google Patents
Genomic data storage method and electronic equipment Download PDFInfo
- Publication number
- CN107480466A CN107480466A CN201710546293.7A CN201710546293A CN107480466A CN 107480466 A CN107480466 A CN 107480466A CN 201710546293 A CN201710546293 A CN 201710546293A CN 107480466 A CN107480466 A CN 107480466A
- Authority
- CN
- China
- Prior art keywords
- storage part
- representing
- information
- gene order
- statistical information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/0674—Disk device
- G06F3/0676—Magnetic disk device
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B99/00—Subject matter not provided for in other groups of this subclass
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of genomic data storage method, including:During genome alignment, gene order comparison information is obtained, and creates gene order statistical information;The gene order comparison information is stored in disk, and by gene order comparison information in the comparison position of genome, the corresponding index of storage in internal memory;The index is storage location of the gene order comparison information in disk;The genome statistical information is classified, obtains the first statistical information and the second statistical information;First statistical information is stored in internal memory, first statistical information is higher than the statistical information of predeterminated frequency for access frequency in variation detection process;Second statistical information is stored in disk, second statistical information is the statistical information that access frequency is less than predeterminated frequency in the statistical information and/or variation detection process for can not be stored in internal memory.The invention also discloses a kind of electronic equipment using the genomic data storage method.
Description
Technical field
The present invention relates to technical field of data processing, particularly relates to a kind of genomic data storage method and electronic equipment.
Background technology
Genome mutation detects calculation process, generally can be divided into comparison, sequence, again deduplication, comparison, variation detection, mistake
The steps such as filter.Wherein, main step needs to use BAM files (SAM full name is sequence alignment map, sequence
Row comparison chart.And BAM files are exactly the file (B is derived from binary) of the binary format of SAM files) write as output file
Hard disk, it is read to internal memory from hard disk again in next step, is then further processed.
During the present invention is realized, inventor has found prior art, and there are the following problems:
In mankind's full-length genome data analysis, initial data is typically in 100GB or so, and middle Main Analysis step is all
Need to read and write GB up to a hundred file, whole calculating process expends substantial amounts of I/O resources and program efficiency is low.
And inventor has found have the main reason for causing the problem:
1st, intermediate file is too big, can not be directly placed into internal memory.
64GB internal memories are the machine configurations of a typical common analysis of biological information.Mankind's Whole genome analysis data,
Intermediate result in 100GB or so, directly can not typically be present in internal memory, and the detection process that makes a variation inherently needs loading to refer to
To in internal memory, the space for causing to be used for putting intermediate result further reduce for sequence and index file.
2nd, the form of intermediate file, cannot be used directly for calculating.
General intermediate file format is SAM/BAM forms, and this form is a kind of row record format, that is, often row is deposited
One record of storage, calculating can not be directly used in by being directly placed into internal memory.Data required for variation detection, mainly to each position
The statistical information of the comparisons situation of point, including the distribution of the number of each base analog in each site, insertion and deletion (InDel) sequence with
The information such as soft shearing (soft clipping) sequence in frequency, comparison.
The content of the invention
In view of this, it is an object of the invention to provide a kind of genomic data storage method and electronic equipment, can solve
The a large amount of binary files of continually input and output are certainly needed in genome mutation detection process and caused by low efficiency problem.
Based on above-mentioned purpose genomic data storage method provided by the invention, including:
In comparison process, gene order comparison information is obtained, and creates gene order statistical information;
The gene order comparison information is stored in disk, and aligned by ratio of the gene order comparison information in genome
Put, corresponding index is stored in internal memory;The index is storage location of the gene order comparison information in disk;
The genome statistical information is classified, obtains the first statistical information and the second statistical information;
First statistical information is stored in internal memory, first statistical information is higher than for access frequency in variation detection process
The statistical information of predeterminated frequency;
Second statistical information is stored in disk, second statistical information is that can not be stored in the statistical information of internal memory
And/or access frequency is less than the statistical information of predeterminated frequency in variation detection process.
Optionally, first statistical information includes base weighted quality Data-Statistics information, positive minus strand statistical information, insertion
Miss statistics information and soft shearing statistical information.
Optionally, for there is not insertion and deletion and soft shearing and 2 kinds of site, the position at most occurred in base type
First statistical information of point uses the first data structure storage;
First data structure, including:
For representing the first head of base type;
For representing the first mass value storage part of base weighted quality value;
For representing the first normal chain number storage part of normal chain quantity;
For representing the first minus strand number storage part of minus strand quantity.
Optionally, for having, insertion and deletion occurs and the site of 3-4 kinds, first statistics in the site occurred in base type
Information uses the first data structure and the second data structure storage;
Second data structure, including:
The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;Every kind of base type
Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include:For representing that base weights matter
Second mass value storage part of value, for representing the second normal chain number storage part of normal chain quantity, and, for representing minus strand number
Second minus strand number storage part of amount;
First insertion statistical information, is specifically included:For representing the first insetion sequence storage part of insetion sequence, for table
Show the first low quality insertion number storage part of low quality insertion quantity;
First miss statistics information, is specifically included:For representing the first missing length storage part of missing length, for table
Show the first high quality missing number storage part of high quality missing quantity, for representing that the first low quality of low quality missing quantity lacks
Lose number storage part;
First data structure, including:
With the second head of 11 fillings;
For indicating whether the first insertion information storage part in the presence of insertion, specifically include:Inserted for indicating whether to exist
The the first insertion sub- storage part of information entered, the sub- storage part of intubating length for representing intubating length, for representing that low quality is inserted
Enter the low quality insertion number storage part of quantity;
For indicating whether the first missing information storage part in the presence of missing, specifically include:Lacked for indicating whether to exist
The sub- storage part of the first missing information lost;
For pointing to the pointer of corresponding second data structure storage position.
Optionally, for there is the insertion and deletion of unnecessary 1, intubating length be more than the sites of 12 bases, the site
First statistical information uses the first data structure and the 3rd data structure storage, and believes for first statistics in such site
Breath, memory pool is created in internal memory to be stored;
3rd data structure, including:
The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;Every kind of base type
Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include:For representing that base weights matter
3rd mass value storage part of value, for representing the 3rd normal chain number storage part of normal chain quantity, and, for representing minus strand number
3rd minus strand number storage part of amount;
Second insertion statistical information, is specifically included:For representing the intubating length storage part of intubating length, for representing slotting
Enter the second insetion sequence storage part of sequence, for representing that the second low quality of low quality insertion quantity inserts number storage part, with
And for representing that the high quality of high quality insertion quantity inserts number storage part;
Second miss statistics information, is specifically included:For representing the second missing length storage part of missing length, for table
Show the second high quality missing number storage part of high quality missing quantity, for representing that the second low quality of low quality missing quantity lacks
Lose number storage part;
First data structure, including:
With the 3rd head of 11 fillings;
For indicating whether the second insertion information storage part in the presence of insertion, specifically include:Inserted for indicating whether to exist
The the second insertion sub- storage part of information entered, for indicating whether to have used the first sub- storage part of memory pool information of memory pool, is used
The sub- storage part of length is taken in represent the occupancy length in memory pool first;
For indicating whether the second missing information storage part in the presence of missing, specifically include:Lacked for indicating whether to exist
The sub- storage part of the second missing information lost, for indicating whether to have used the second sub- storage part of memory pool information of memory pool, use
The sub- storage part of length is taken in represent the occupancy length in memory pool second.
Optionally, for the soft shearing statistical information, recorded using a dynamic array, every record includes:
For representing the soft clipped position storage part of soft shearing present position on genome;
For representing that the soft shearing left-hand digit storage part in the number on the corresponding site left side occurs for soft shearing;
For representing that the soft shearing right-hand component storage part of the number on the right of corresponding site occurs for soft shearing.
Optionally, the index includes both-end comparison information index and single-ended comparison information index;
Indexed for both-end comparison information, comparing structure of arrays using both-end is stored, and the both-end compares array knot
Structure includes:
For the first ID storage parts for the ID for representing gene order;
Position storage part is compared for representing that gene order is compared to first of the position on genome;
For the Insert Fragment length storage part for the Insert Fragment length for representing gene order;
For the first comparison mass value storage part of the comparison mass value for representing gene order;
For the first average mass values storage part of the average mass values for representing gene order;
Index for single-ended comparison information, stored using single-ended comparison structure of arrays, the single-ended comparison array knot
Structure includes:
For the 2nd ID storage parts for the ID for representing gene order;
Position storage part is compared for representing that gene order is compared to second of the position on genome;
For the second comparison mass value storage part of the comparison mass value for representing gene order;
For the second average mass values storage part of the average mass values for representing gene order;
Wherein, it is used for the gene order that compares for every, according to comparison position of the gene order on genome, its
Corresponding index is arranged in order.
Optionally, the gene order comparison information is stored in disk, specifically included:
So gene order comparison information is divided into 512 files and is stored in disk, each file stores certain gene
The gene order comparison information of class interval, the data storage structure of every gene order comparison information include:
For the sequence length storage part for the sequence length for representing gene order;
For representing the sequence storage part of gene order in itself;
For the mass value storage part for the mass value for representing gene order;
For representing the starting position storage part of alignment algorithm starting position of the gene order when comparing;
For representing the positive minus strand storage part of positive and negative chain information of the gene order when comparing;
The zone length storage part for the genomic region lengths chosen for representing gene order when comparing;
For representing the leftward position storage part of gene order riveting fixed position in the left side when comparing;
The right positions storage part of the position fixed for riveting on the right of representing gene order when comparing.
Optionally, described method also includes:
During deduplication, interference caused by repetitive sequence in the genome statistical information is subtracted;
And/or
In weight comparison process, the gene order of the heavy comparison area of genome is extracted, again than counterweight comparison area
After gene order, the genome statistical information of the gene order of weight comparison area is adjusted.
From the above it can be seen that genomic data storage method provided by the invention and electronic equipment, for variation
The characteristics of detecting the intermediate file in overall process, the data store organisation of exquisiteness is devised, by some of main mediants
According to being maintained in internal memory, these data can directly invoke from internal memory so that each step of variation detection overall process does not have to
The substantial amounts of I/O read-writes for carrying out disk, significantly improve the efficiency of whole variation detection and analysis flow.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of one embodiment of genomic data storage method provided by the invention;
Is are there is not insertion and deletion and soft shearing in Fig. 1 a, described and when 2 kinds of site at most occurred in base type
The schematic diagram of first data structure;
Fig. 1 b are to have insertion and deletion (InDel) appearance, and when the site of 3-4 kinds occurred in base type, described second
The schematic diagram of data structure;
Fig. 1 c are to have insertion and deletion (InDel) appearance, and when the site of 3-4 kinds occurred in base type, described first
The schematic diagram of data structure;
Fig. 1 d are the 3rd number when the insertion and deletion of unnecessary 1 occur, intubating length being more than the site of 12 bases
According to the schematic diagram of structure;
Fig. 1 e are first number when the insertion and deletion of unnecessary 1 occur, intubating length being more than the site of 12 bases
According to the schematic diagram of structure;
Fig. 1 f be for the soft shearing statistical information, during using a dynamic array to record, the dynamic array
Schematic diagram;
Fig. 1 g are the schematic diagram of the index;
Fig. 1 h are the schematic diagram of the data storage structure of every gene order comparison information;
Fig. 2 is the schematic flow sheet of one embodiment of genome sequence comparison method provided by the invention;
Fig. 3 is the structural representation of one embodiment of genomic data storage device provided by the invention;
Fig. 4 is the structural representation of one embodiment of electronic equipment provided by the invention.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with specific embodiment, and reference
Accompanying drawing, the present invention is described in more detail.
It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention
The non-equal entity of individual same names or non-equal parameter, it is seen that " first ", " second " should not only for the convenience of statement
The restriction to the embodiment of the present invention is interpreted as, subsequent embodiment no longer illustrates one by one to this.
Based on above-mentioned purpose, the one side of the embodiment of the present invention, it is proposed that a kind of genomic data storage method
One embodiment, can solve the problem that needs a large amount of binary files of continually input and output and causes in genome mutation detection process
Low efficiency problem.As shown in figure 1, the flow signal of one embodiment for genomic data storage method provided by the invention
Figure.
The genomic data storage method, including:
Step 101:In comparison process, gene order comparison information is obtained, and creates gene order statistical information;It is described
Gene order comparison information is caused gene order comparison result information during genome alignment, according to the gene order ratio
To object information, can therefrom extract to obtain the gene order statistical information;
Step 102:The gene order comparison information is stored in disk, and by gene order comparison information in genome
Comparison position, corresponding index is stored in internal memory;The index is gene order comparison information the depositing in disk
Storage space is put;
Step 103:The gene order statistical information is classified, obtains the first statistical information and the second statistics letter
Breath;
Step 104:First statistical information is stored in internal memory, first statistical information is in variation detection process
Access frequency is higher than the statistical information of predeterminated frequency;
Step 105:Second statistical information is stored in disk, second statistical information is that can not be stored in internal memory
Statistical information and/or variation detection process in access frequency be less than predeterminated frequency statistical information.
From above-described embodiment as can be seen that genomic data storage method provided in an embodiment of the present invention, is examined for variation
The characteristics of surveying the intermediate file in overall process (including compare, sort, deduplication, the again step such as comparison, the detection that makes a variation, filtering),
The data store organisation of exquisiteness is devised, some of main intermediate data are maintained in internal memory, these data can be from interior
Directly invoked in depositing so that each step of variation detection overall process does not have to the substantial amounts of I/O read-writes for carrying out disk, significantly carries
The high efficiency of whole variation detection and analysis flow.
In some optional embodiments, first statistical information includes base weighted quality Data-Statistics information, positive and negative
Chain statistical information, insertion and deletion statistical information and soft shearing statistical information;Specifically include:
The base weighted quality Data-Statistics information (Weighted Count):
Because each comparison has a mass value, between 0 and 40, the power of imparting to the base in reference gene sequence
Weight is as shown in the table:
Base Quality Scores | Parameter* | Weight |
0–10 | [0–Weight0] | 0 |
11–13 | (Weight0–Weight1) | 1 |
14–17 | (Weight1–Weight2) | 2 |
18–20 | (Weight2–Weight3) | 3 |
21–40 | (Weight3–40) | 4 |
All weights compared to the identical base of same position are added, obtain the mass value weight of this base type
With;
The positive minus strand statistical information (Strand Count):Forward and reverse compares the gene order number to same position
Statistics;
The insertion and deletion statistical information and insetion sequence information (InDel Count):Compare in gene order in base
Because organizing some position insertion and deletion sequence and the accumulative number occurred;
The soft shearing statistical information (Soft Clip Count):In genome, some position goes out in comparison gene order
The number of existing soft shearing (soft clip).
In some optional embodiments, simplest situation is considered, for there is not insertion and deletion and soft shearing, and
At most there is 2 kinds of site in base type, and first statistical information in the site uses the first data structure storage;Optionally,
First data structure is 8bytes data structure Counter (container), using 8bytes data structure
Counter preserves the information in a site, and whole human genome comprises about 3G site, it is therefore desirable to internal memory about 24GB;
First data structure, as shown in Figure 1a, the statistical informations of two bases (base1information and
Base2information same 4bytes data structure storages) are used, including:
For representing the first head of base type;Optionally, first head (base) represents alkali using 2bits
Base type, base A, C, G, T represent using 00,01,10 and 11 respectively;
For representing the first mass value storage part of base weighted quality value;Optionally, the first mass value storage part
(weighted count) represents weighted quality value and maximum 16383 using 14bits;
For representing the first normal chain number storage part of normal chain quantity;Optionally, the first normal chain number storage part (+ve
Strand count) using 1byte (8bits) represent the quantity of normal chain, maximum 255;
For representing the first minus strand number storage part of minus strand quantity;Optionally, the first minus strand number storage part (- ve
Strand count) using 1byte (8bits) represent the quantity of minus strand, maximum 255.
In some optional embodiments, for there is insertion and deletion (InDel) appearance, and there are 3-4 kinds in base type
Site, first statistical information in the site uses the first data structure and the second data structure storage;Optionally, using one
32bytes data structure OverflowCounter (spilling container) preserves the information in a site, base ACGT statistics
Information (base AZinformation, base C information, base G information and base T
Information it is) each respectively to be represented with 6bytes, insertion information (Insertion Info.) and missing information (Deletion
Info.) respectively represented with 4bytes;Rule of thumb 30X full-length genomes data about 200M such sites;
Second data structure, as shown in Figure 1 b, including:
The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;Every kind of base type
Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include:For representing that base weights matter
Second mass value storage part of value (weighted count, optionally, weighted quality value and maximum is represented using 2bytes
It is worth for 65535), for representing that the second normal chain number storage part of normal chain quantity (+ve strand count, optionally, uses
2bytes represent normal chain quantity, maximum 65535), and, for represent minus strand quantity the second minus strand number storage part (-
Ve strand count, optionally, minus strand quantity, maximum 65535 are represented using 2bytes);
First insertion statistical information, is specifically included:For representing the first insetion sequence storage part of insetion sequence
(Insertion Pattern, optionally, being represented using 3bytes, most long here to represent 12 bases), for representing low-quality
The first low quality insertion number storage part (LQ count, optionally, being represented using 1byte, maximum 255) of amount insertion quantity;
First miss statistics information, is specifically included:For representing the first missing length storage part of missing length
(Del.Len, optionally, being represented using 1byte, maximum 255), for representing that the first high quality of high quality missing quantity lacks
Number storage part (HQ count, optionally, being represented using 1byte, maximum 255) is lost, for representing low quality missing quantity
First low quality missing number storage part (LQ count, optionally, being represented using 1byte, maximum 255);Optionally, in addition to
1byte unused storage spaces;
When using OverflowCounter, the storage content of corresponding first data structure can change, institute
The first data structure is stated, as illustrated in figure 1 c, including:
With the second head of 11 fillings;It is original to be used for storing base1information and base2information two
The data of individual base type, it can all be filled to be " 11 " and represent to have used OverflowCounter;
It is optional for indicating whether the first insertion information storage part (Insertion Information) in the presence of insertion
, insertion information is preserved using 14bits, is specifically included:For indicating whether the first insertion information son storage in the presence of insertion
Portion (1bit), the sub- storage part of intubating length (Ins.Len. is represented using 4bits) for representing intubating length, for representing low
The low quality insertion number storage part (LQ count, represented using 8bits) of quality insertion quantity;Optionally, 1bit is arranged to
0;
It is optional for indicating whether the first missing information storage part (Deletion Information) in the presence of missing
, missing information is preserved using 14bits, indicates whether missing (the sub- storage part of the first missing information) be present using 1bit,
1bit is arranged to 0,12bits and (Unused) is not used;
For pointing to pointer (the array index pointing to of corresponding second data structure storage position
Dynamic array of overflow counter), optionally, pointed to using 4bytes to preserve
One pointer of the position of OverflowCounter data.
In some optional embodiments, for there is the insertion and deletion of unnecessary 1, intubating length is more than 12 bases
Site, first statistical information in the site uses the first data structure and the 3rd data structure storage, and for such position
First statistical information of point, memory pool (Memory Pool, specially opening up one piece of internal memory) is created in internal memory to be stored,
Base ACGT statistical information (base A information, base C information, base G information
With base T information) it is each respectively represented with 6bytes, and in OverflowCounter record insertion and deletion letter
The pointer of breath, as shown in Figure 1 d;
3rd data structure, as shown in Figure 1 d, including:
The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;Every kind of base type
Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include:For representing that base weights matter
3rd mass value storage part of value (weighted count, optionally, weighted quality value and maximum is represented using 2bytes
It is worth for 65535), for representing that the 3rd normal chain number storage part of normal chain quantity (+ve strand count, optionally, uses
2bytes represent normal chain quantity, maximum 65535), and, for represent minus strand quantity the 3rd minus strand number storage part (-
Ve strand count, optionally, minus strand quantity, maximum 65535 are represented using 2bytes);
Second insertion statistical information (Insertion Ptr), optionally, is represented using 4bytes, specifically included:For
The intubating length storage part (Insertion length, optionally, intubating length being represented using 1byte) of intubating length is represented,
For representing that the second insetion sequence storage part of insetion sequence (Insertion pattern, indefinite length, one is represented per 2bits
Individual base), for representing that the second low quality insertion number storage part of low quality insertion quantity (LQ count, optionally, uses
1byte represents low quality insertion quantity), and, for representing the high quality insertion number storage part (HQ of high quality insertion quantity
Count, optionally, high quality insertion quantity is represented using 1byte);
Second miss statistics information (Deletion Ptr), optionally, is represented using 4bytes, specifically included:For
Represent that the second missing length storage part of missing length (Deletion length, optionally, missing length is represented using 1byte
Degree), for representing that the second high quality missing number storage part of high quality missing quantity (HQ count, optionally, uses 1byte
Represent high quality missing quantity), for represent low quality missing quantity the second low quality missing number storage part (LQ count,
Optionally, low quality missing quantity is represented using 1byte);
At the same time, the information record in Counter changes as shown in fig. le, first data structure, including:
With the 3rd head of 11 fillings;It is original to be used for storing base1information and base2information two
The data of individual base type, it can all be filled to be " 11 " and represent to have used OverflowCounter;
It is optional for indicating whether the second insertion information storage part (Insertion Information) in the presence of insertion
, represented, specifically included using 14bits:For indicating whether the second insertion sub- storage part of information (1bit) in the presence of insertion,
For indicating whether to have used the first sub- storage part of memory pool information (1bit) of memory pool, for representing accounting in memory pool
The sub- storage part of length (12bits) is taken with the first of length;
It is optional for indicating whether the second missing information storage part (Deletion Information) in the presence of missing
, represented, specifically included using 14bits:For indicating whether the sub- storage part of the second missing information (1bit) in the presence of missing,
For indicating whether to have used the second sub- storage part of memory pool information (1bit) of memory pool, for representing accounting in memory pool
The sub- storage part of length (12bits) is taken with the second of length.
Rule of thumb, soft shearing (soft clipping) can only occur on seldom genomic locations, therefore be not necessarily to
One piece of memory space is individually opened up for each site.Therefore, in some optional embodiments, for the soft shearing statistics letter
Breath, recorded using a dynamic array, as shown in Figure 1 f, every record form for position, left counts,
Right counts }, 12bytes is taken, is specifically included:
For representing the soft clipped position storage part (position) of soft shearing present position on genome, take
4bytes;
For representing that the soft shearing left-hand digit storage part (left in the number on the corresponding site left side occurs for soft shearing
Counts), 4bytes is taken;
For representing that the soft shearing right-hand component storage part (right of the number on the right of corresponding site occurs for soft shearing
Counts), 4bytes is taken.
In some optional embodiments, as shown in Figure 1 g, the index includes both-end and compares (Pair End) information rope
Draw and single-ended comparison (Single End) information index;
Indexed for both-end comparison information, comparing structure of arrays using both-end is stored, and the both-end compares array knot
Structure (PairEndAlignmentInfo, taking 12bytes) includes:
For the first ID storage parts (ReadID) for the ID for representing gene order, 4bytes is taken;
Position storage part (Aligned is compared for representing that gene order is compared to first of the position on genome
Position), 4bytes is taken;
For the Insert Fragment length storage part (Insert Size) for the Insert Fragment length for representing gene order, take
2bytes;
For the first comparison mass value storage part (MAPQ) of the comparison mass value for representing gene order, 1byte is taken;
For the first average mass values storage part (Average Base of the average mass values for representing gene order
Quality), 1byte is taken;
Index for single-ended comparison information, stored using single-ended comparison structure of arrays, the single-ended comparison array knot
Structure (SingleEndAlignmentInfo, taking 10bytes) includes:
For the 2nd ID storage parts (ReadID) for the ID for representing gene order, 4bytes is taken;
Position storage part (Aligned is compared for representing that gene order is compared to second of the position on genome
Position), 4bytes is taken;
For the second comparison mass value storage part (MAPQ) of the comparison mass value for representing gene order, 1byte is taken;
For the second average mass values storage part (Average Base of the average mass values for representing gene order
Quality), 1byte is taken;
Wherein, it is used for the gene order that compares for every, according to comparison position of the gene order on genome, its
Corresponding index is arranged in order.
Because it is very random that sequence is read in the detection process that makes a variation, in some optional embodiments, by the gene sequence
Row comparison information is stored in disk, specifically includes:
So gene order comparison information is divided into 512 files (buckets) and is stored in disk, each file storage
Gene order comparison information between certain genomic region, as shown in figure 1h, the data storage knot of every gene order comparison information
Structure includes:
For the sequence length storage part (Read Length) for the sequence length for representing gene order, 2bytes is taken;
For representing the sequence storage part (Packed Read) of gene order in itself, indefinite length, represented using 2bits
One base;
For the mass value storage part (Base Qualities) for the mass value for representing gene order, indefinite length;
For representing starting position storage part (the DP Start of alignment algorithm starting position of the gene order when comparing
Pos.), 4bytes is taken;
For representing the positive minus strand storage part (Strand) of positive and negative chain information of the gene order when comparing, 1bit is taken;
Zone length storage part (the DP for the genomic region lengths chosen for representing gene order when comparing
Ref.length), 15bits is taken;
For representing the leftward position storage part (Left Anchor) of gene order riveting fixed position in the left side when comparing,
Take 4bytes;
The right positions storage part (Right anchor) of the position fixed for riveting on the right of representing gene order when comparing,
Take 4bytes.
It is optional at some outside the step of except in comparison process, creating statistics and index in foregoing embodiment
In embodiment, methods described also includes:
During deduplication (de-duplication), subtract in the genome statistical information caused by repetitive sequence
Interference;
And/or
During (realignment) is compared again, the gene order of the heavy comparison area of genome is extracted, is compared again
After the gene order of weight comparison area, the genome statistical information of the gene order of weight comparison area is adjusted;
In the detection process that makes a variation, directly using these statistical informations, the probability of various genotype is calculated.
By genomic data storage method provided in an embodiment of the present invention, whole process of analyzing does not have to output repeatedly largely
Binary file, by the algorithm optimization of entirety, the data of a full-length genome are analyzed, can be completed in 4 hours, and one
As analysis process need to complete for tens hours;The I/O processes during variation detection and analysis are greatly reduced, greatly
The analysis efficiency for improving program.
For the ease of the understanding of preceding solution, a kind of implementation of genome sequence comparison method is simply introduced herein
Example, for explaining the genome alignment process in previous embodiment in step 101.As shown in Fig. 2 it is gene provided by the invention
The schematic flow sheet of one embodiment of group sequence alignment method.
The genome sequence comparison method, comprises the following steps:
Step 201:Obtain reference gene group sequence and genome sequence file to be compared.Here file acquisition mode
Using conventional acquisition modes.Wherein, the form of the genome sequence file to be compared can be FASTQ files.
The genome sequence comparison method, sequence alignment is divided into 3 ranks to carry out;Every time from the to be compared of input
Genome sequence file in read a part of sequence, then successively to performing 1 grade, 2 grades, 3 grades of alignment algorithms, upper level do not have
There is the sequence on comparing, continue to compare into the alignment algorithm of next stage;Specifically include following steps.
Step 202:Partial genome sequence is read from genome sequence file to be compared.
Step 203:According to (the 1st grade of two-way BWT alignment algorithms:Two-way BWT alignment algorithms, two-way BWT:Bi-
Directional Burrows-Wheeler Transform, two-way Barrow this-Wheeler conversion), by the portion gene group sequence
Row are compared with reference gene group sequence.Wherein, the two-way BWT alignment algorithms handle most 4 base mistakes of permission
Reads is compared.Reads, length is read, is the sequencing sequence obtained in high-flux sequence, each read is one section of base sequence.
During analysis of biological information, each read is compared onto reference gene group, it is possible to obtain sequencing sequence and reference gene
The difference of group, so as to find to make a variation.
Optionally, the method that genome sequence is compared according to two-way BWT alignment algorithms, specifically may include following steps:
Reads is segmented using dovecote principle, 0-2 base mistake of every section of permission;
Then scan for comparing using two-way BWT alignment algorithms, including:
Establish the BWT of the BWT of the reference gene group sequence, Suffix array clustering and reference gene group sequence backward;
Using sweep backward (backward) and sweep forward (forward) respectively to reads or reads each piece
Both direction searches for its position in reference gene group sequence to section from right to left and from left to right.
The two-way BWT is compared when multiple base erroneous matchings are handled, and efficiency comparison is low.Allow 4 most
In the case of base erroneous matching, reads is segmented according to dovecote principle, each paragraph allows 0-2 base mistake
Match somebody with somebody, the comparison of most 2 base mistakes is so handled with two-way BWT, efficiency greatly increases.
Common comparison software BWA after the BWT of reference sequences and corresponding index and SA (suffix array) is established,
Searched for using backward, i.e., its position on genome is searched for from right to left to reads or reads each fragment.
The two-way BWT that this patent uses also is established in addition to establishing traditional BWT indexes (being designated as B) to the backward sequence of reference sequences
One BWT index (being designated as B ').Utilize B, B ', SA, searched in the two directions by backward, forward reads or
Positions of the seeds on genome, the efficiency of sequence alignment significantly improve.
Step 204:Whether in a pair of a readss only read comparison at least there is in the partial genome sequence
Upper (that is, in the partial genome sequence, at least a pair of reads are that only a read is compared);If so, enter
Enter step 208;If it is not, into step 205.
Step 205:According to single-ended Dynamic Programming alignment algorithm (the 2nd grade), will only have one in the partial genome sequence
Each pair reads in bar read comparisons, it is compared again with reference gene group sequence.In the two-way BWT by foregoing 1st grade
Alignment algorithm, in a pair of reads (A, A '), wherein one (A or A ') is compared onto reference gene group sequence, another (A '
Or A) but without comparing onto reference gene group sequence, it will continue to compare using the 2nd grade of alignment algorithm.
Optionally, the method that genome sequence is compared according to single-ended Dynamic Programming alignment algorithm, specifically may include following step
Suddenly:
Determine that the read (A or A ') in a pair of reads (A, A ') is compared to the spy in the reference gene group sequence
Positioning is put (pos positions);The data reads that both-end is sequenced to obtain is paired, it is assumed that wherein one of a pair of reads (A, A ')
Read (A or A ') is compared to the pos positions in reference gene group sequence, then (A ' or A) theoretical compare position by another read
Certain area around pos positions is in candidate region (candidate region);
Therefore, according to predeterminated position range threshold, the particular range around the ad-hoc location (pos positions) is chosen;Institute
Stating predeterminated position range threshold can be selected according to being actually needed, such as reference error tolerance is configured;Specifically
Ground, in both-end sequencing, a pair of reads are compared on genome, then the distance between two read grow with two read
The length that sum is equal to sequencing fragment (fragment) is spent, determines the position of candidate region around this principle.For example, sequencing piece
Section is 500bp, and each read is 150bp, then after comparing onto genome, the theoretical distance between two read is 200bp.
Because fragment length is sequenced, theoretical distance is about in 100bp~200bp;
Using dynamic programming algorithm to another not be compared in a pair of reads in the particular range
(A ' A) is compared;Step 206:It is equal whether in a pair of readss two reads at least there are in the partial genome sequence
Do not compare (that is, in the partial genome sequence, at least a pair of reads every read is not compared);If
It is, into step 108;If it is not, into step 207.
Step 207:According to both-end Dynamic Programming alignment algorithm (3rd level), by two in the partial genome sequence
The each pair reads that read is not compared, it is compared again with reference gene group sequence.In a pair of reads (A, A '), warp
Foregoing 1st grade of two-way BWT alignment algorithms and the 2nd grade of single-ended Dynamic Programming alignment algorithm are crossed, certain a pair of reads (A, A ') in
A and A ' without upper reference gene group sequence is compared, will continue to compare using 3rd level alignment algorithm.
Optionally, the method that genome sequence is compared according to both-end Dynamic Programming alignment algorithm, specifically may include following step
Suddenly:
Seed (seeds, substrings of a read) is built respectively to every (A and A ') in a pair of reads;
Specifically, it is respectively classified into many segments to a pair of reads (A, A ') every read, structure seed (seeds,
substrings of a read);When a pair of reads are compared onto genome, the distance between two read are in certain model
In enclosing, therefore the distance between two read seed also should be in certain scope;
Each seed is compared onto reference gene group sequence;
Specifically, the region of the seeds comparisons of the distance between (i.e. two seeds meet the requirements) in pairs is retrieved, really
Determine this candidate's comparison area to reads.Then reads is compared with dynamic programming algorithm and arrives candidate region.
If in a certain region of the reference gene group sequence, two (A and the A ') of the reads have corresponding kind respectively
On son compares, then the region is the candidate region of final comparison position;
Two (A and the A ') of the reads are compared respectively using dynamic programming algorithm in the candidate region
It is right;Enter step 208 after the completion of comparison;Step 208:Whether all compare and complete the genome sequence file to be compared;
If it is not, return to step 102;If so, into step 109.
Step 209:Export comparison result.Optionally, BAM files are the output files that genome sequence compares, and BAM is base
Because group sequence alignment result preserves form, position and detailed sequence ratio that genome sequence is listed in reference gene group sequence have recorded
To situation.
From above-described embodiment as can be seen that genome sequence comparison method provided by the invention, is compared by the way that setting is multistage
Algorithm, the part not compared is carried out continuing to contrast using next stage alignment algorithm after the completion of the comparison of previous stage algorithm, from
And allow the complexity of algorithm to match the complexity of data, and every one-level algorithm is optimized, and then total algorithm
Optimization in speed.Using genome sequence comparison method provided by the invention, in same asset and ensure the accuracy compared
On the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the comparison time of prior art
There is significant shortening, improve data analysis efficiency.
Based on above-mentioned purpose, second aspect of the embodiment of the present invention, it is proposed that a kind of genomic data storage device
One embodiment, can solve the problem that needs a large amount of binary files of continually input and output and causes in genome mutation detection process
Low efficiency problem.As shown in figure 3, the structural representation of one embodiment for genomic data storage device provided by the invention
Figure.
The genomic data storage device, including:
Creation module 301, for during genome alignment, obtaining gene order comparison information, and create gene sequence
Row statistical information;
Comparison information memory module 302, for the gene order comparison information to be stored in into disk, and press gene order
Comparison information is in the comparison position of genome, the corresponding index of storage in internal memory;The index compares for the gene order
Storage location of the information in disk;
Statistical information sort module 303, for classifying to the genome statistical information, obtain the first statistical information
With the second statistical information;
Statistical information memory module 304, for the first statistical information to be stored in into internal memory, first statistical information is change
Access frequency is higher than the statistical information of predeterminated frequency in different detection process;And the second statistical information is stored in disk, it is described
Second statistical information is that access frequency is less than predeterminated frequency in the statistical information and/or variation detection process for can not be stored in internal memory
Statistical information.
In some optional embodiments, first statistical information includes base weighted quality Data-Statistics information, positive and negative
Chain statistical information, insertion and deletion statistical information and soft shearing statistical information.
In some optional embodiments, for there is not insertion and deletion and soft shearing, and base type at most occurs
2 kinds of site is crossed, first statistical information in the site uses the first data structure storage;
First data structure, including:
For representing the first head of base type;
For representing the first mass value storage part of base weighted quality value;
For representing the first normal chain number storage part of normal chain quantity;
For representing the first minus strand number storage part of minus strand quantity.
In some optional embodiments, for there is insertion and deletion appearance, and there is the site of 3-4 kinds in base type,
First statistical information in the site uses the first data structure and the second data structure storage;
Second data structure, including:
The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;Every kind of base type
Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include:For representing that base weights matter
Second mass value storage part of value, for representing the second normal chain number storage part of normal chain quantity, and, for representing minus strand number
Second minus strand number storage part of amount;
First insertion statistical information, is specifically included:For representing the first insetion sequence storage part of insetion sequence, for table
Show the first low quality insertion number storage part of low quality insertion quantity;
First miss statistics information, is specifically included:For representing the first missing length storage part of missing length, for table
Show the first high quality missing number storage part of high quality missing quantity, for representing that the first low quality of low quality missing quantity lacks
Lose number storage part;
First data structure, including:
With the second head of 11 fillings;
For indicating whether the first insertion information storage part in the presence of insertion, specifically include:Inserted for indicating whether to exist
The the first insertion sub- storage part of information entered, the sub- storage part of intubating length for representing intubating length, for representing that low quality is inserted
Enter the low quality insertion number storage part of quantity;
For indicating whether the first missing information storage part in the presence of missing;
For pointing to the pointer of corresponding second data structure storage position.
In some optional embodiments, for there is the insertion and deletion of unnecessary 1, intubating length is more than 12 bases
Site, first statistical information in the site uses the first data structure and the 3rd data structure storage, and for such position
First statistical information of point, memory pool is created in internal memory to be stored;
3rd data structure, including:
The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;Every kind of base type
Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include:For representing that base weights matter
3rd mass value storage part of value, for representing the 3rd normal chain number storage part of normal chain quantity, and, for representing minus strand number
3rd minus strand number storage part of amount;
Second insertion statistical information, is specifically included:For representing the intubating length storage part of intubating length, for representing slotting
Enter the second insetion sequence storage part of sequence, for representing that the second low quality of low quality insertion quantity inserts number storage part, with
And for representing that the high quality of high quality insertion quantity inserts number storage part;
Second miss statistics information, is specifically included:For representing the second missing length storage part of missing length, for table
Show the second high quality missing number storage part of high quality missing quantity, for representing that the second low quality of low quality missing quantity lacks
Lose number storage part;
First data structure, including:
With the 3rd head of 11 fillings;
For indicating whether the second insertion information storage part in the presence of insertion, specifically include:Inserted for indicating whether to exist
The the second insertion sub- storage part of information entered, for indicating whether to have used the first sub- storage part of memory pool information of memory pool, is used
The sub- storage part of length is taken in represent the occupancy length in memory pool first;
For indicating whether the second missing information storage part in the presence of missing, specifically include:Lacked for indicating whether to exist
The sub- storage part of the second missing information lost, for indicating whether to have used the second sub- storage part of memory pool information of memory pool, use
The sub- storage part of length is taken in represent the occupancy length in memory pool second.
In some optional embodiments, for the soft shearing statistical information, recorded using a dynamic array, often
Bar record includes:
For representing the soft clipped position storage part of soft shearing present position on genome;
For representing that the soft shearing left-hand digit storage part in the number on the corresponding site left side occurs for soft shearing;
For representing that the soft shearing right-hand component storage part of the number on the right of corresponding site occurs for soft shearing.
In some optional embodiments, the index includes both-end comparison information index and single-ended comparison information index;
Indexed for both-end comparison information, comparing structure of arrays using both-end is stored, and the both-end compares array knot
Structure includes:
For the first ID storage parts for the ID for representing gene order;
Position storage part is compared for representing that gene order is compared to first of the position on genome;
For the Insert Fragment length storage part for the Insert Fragment length for representing gene order;
For the first comparison mass value storage part of the comparison mass value for representing gene order;
For the first average mass values storage part of the average mass values for representing gene order;
Index for single-ended comparison information, stored using single-ended comparison structure of arrays, the single-ended comparison array knot
Structure includes:
For the 2nd ID storage parts for the ID for representing gene order;
Position storage part is compared for representing that gene order is compared to second of the position on genome;
For the second comparison mass value storage part of the comparison mass value for representing gene order;
For the second average mass values storage part of the average mass values for representing gene order;
Wherein, it is used for the gene order that compares for every, according to comparison position of the gene order on genome, its
Corresponding index is arranged in order.
In some optional embodiments, the gene order comparison information is stored in disk, specifically included:
So gene order comparison information is divided into 512 files and is stored in disk, each file stores certain gene
The gene order comparison information of class interval, the data storage structure of every gene order comparison information include:
For the sequence length storage part for the sequence length for representing gene order;
For representing the sequence storage part of gene order in itself;
For the mass value storage part for the mass value for representing gene order;
For representing the starting position storage part of alignment algorithm starting position of the gene order when comparing;
For representing the positive minus strand storage part of positive and negative chain information of the gene order when comparing;
The zone length storage part for the genomic region lengths chosen for representing gene order when comparing;
For representing the leftward position storage part of gene order riveting fixed position in the left side when comparing;
The right positions storage part of the position fixed for riveting on the right of representing gene order when comparing.
Based on above-mentioned purpose, the 3rd aspect of the embodiment of the present invention, it is proposed that one kind performs the genomic data and deposited
One embodiment of the device of method for storing.As shown in figure 4, it is the execution genomic data storage method provided by the invention
The hardware architecture diagram of one embodiment of device.
As shown in figure 4, described device includes:
One or more processors 401 and memory 402, in Fig. 4 by taking a processor 401 as an example.
Performing the device of the genomic data storage method can also include:Input unit 403 and output device 404.
Processor 401, memory 402, input unit 403 and output device 404 can pass through bus or other modes
Connect, in Fig. 4 exemplified by being connected by bus.
Memory 402 is used as a kind of non-volatile computer readable storage medium storing program for executing, available for storage non-volatile software journey
Sequence, non-volatile computer executable program and module, such as the genomic data storage method in the embodiment of the present application
Corresponding programmed instruction/module is (for example, creation module 301, comparison information memory module 302, statistical information shown in accompanying drawing 3
Sort module 303 and statistical information memory module 304).Processor 401 is stored in non-volatile in memory 402 by operation
Property software program, instruction and module, so as to execute server various function application and data processing, that is, realize above-mentioned side
The genomic data storage method of method embodiment.
Memory 402 can include storing program area and storage data field, wherein, storing program area can store operation system
Application program required for system, at least one function;Storage data field can store the use according to genomic data storage device
Data created etc..In addition, memory 402 can include high-speed random access memory, non-volatile deposit can also be included
Reservoir, for example, at least a disk memory, flush memory device or other non-volatile solid state memory parts.In some implementations
In example, memory 402 is optional including that can pass through relative to the remotely located memory of processor 401, these remote memories
Network connection is to member user's behavior monitoring device.The example of above-mentioned network includes but is not limited to internet, intranet, office
Domain net, mobile radio communication and combinations thereof.
Input unit 403 can receive the numeral or character information of input, and produce and genomic data storage device
The key signals input that user is set and function control is relevant.Output device 404 may include the display devices such as display screen.
One or more of modules are stored in the memory 402, when by one or more of processors
During 401 execution, the genomic data storage method in above-mentioned any means embodiment is performed.It is described to perform the genomic data
The embodiment of the device of storage method, its technique effect and foregoing any means embodiment are same or similar.
The embodiment of the present application additionally provides a kind of non-transient computer storage medium, and the computer-readable storage medium is stored with
Computer executable instructions, the computer executable instructions can perform the place of the list items operation in above-mentioned any means embodiment
Reason method.The embodiment of the non-transient computer storage medium, its technique effect it is identical with foregoing any means embodiment or
Person is similar.
Finally it should be noted that one of ordinary skill in the art will appreciate that realizing the whole in above-described embodiment method
Or part flow, it is that related hardware can be instructed to complete by computer program, described program can be stored in computer
In read/write memory medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, it is described
Storage medium can be magnetic disc, CD, read-only storage (Read-Only Memory, ROM) or random access memory (Random
Access Memory, RAM) etc..The embodiment of the computer program, its technique effect and foregoing any means embodiment phase
It is same or similar.
Those of ordinary skills in the art should understand that:The discussion of any of the above embodiment is exemplary only, not
It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under the thinking of the present invention, above example
Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and exist such as
Many other changes of upper described different aspect of the invention, for simplicity, they are not provided in details.
In addition, to simplify explanation and discussing, and in order to obscure the invention, can in the accompanying drawing provided
To show or can not show that the known power ground with integrated circuit (IC) chip and other parts is connected.In addition, can
To show device in block diagram form, to avoid obscuring the invention, and this have also contemplated that following facts, i.e., on
The details of the embodiment of these block diagram arrangements be depend highly on will implement the present invention platform (that is, these details should
When being completely in the range of the understanding of those skilled in the art).Elaborating detail (for example, circuit) to describe the present invention
Exemplary embodiment in the case of, it will be apparent to those skilled in the art that can these be specific thin
Implement the present invention in the case of section or in the case that these details change.Therefore, these descriptions are considered as
It is bright property rather than restricted.
Although having been incorporated with specific embodiment of the invention, invention has been described, according to retouching above
State, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example
Such as, other memory architectures (for example, dynamic ram (DRAM)) can use discussed embodiment.
Embodiments of the invention be intended to fall within the broad range of appended claims it is all it is such replace,
Modifications and variations.Therefore, within the spirit and principles of the invention, any omission, modification, equivalent substitution, the improvement made
Deng should be included in the scope of the protection.
Claims (10)
- A kind of 1. genomic data storage method, it is characterised in that including:In comparison process, gene order comparison information is obtained, and creates gene order statistical information;The gene order comparison information is stored in disk, and by gene order comparison information in the comparison position of genome, Corresponding index is stored in internal memory;The index is storage location of the gene order comparison information in disk;The genome statistical information is classified, obtains the first statistical information and the second statistical information;First statistical information is stored in internal memory, first statistical information is access frequency in variation detection process higher than default The statistical information of frequency;Second statistical information is stored in disk, second statistical information be can not be stored in internal memory statistical information and/or Access frequency is less than the statistical information of predeterminated frequency in the detection process that makes a variation.
- 2. according to the method for claim 1, it is characterised in that first statistical information includes base weighted quality primary system Count information, positive minus strand statistical information, insertion and deletion statistical information and soft shearing statistical information.
- 3. according to the method for claim 2, it is characterised in that for there is not insertion and deletion and soft shearing and base class At most there is 2 kinds of site in type, and first statistical information in the site uses the first data structure storage;First data structure, including:For representing the first head of base type;For representing the first mass value storage part of base weighted quality value;For representing the first normal chain number storage part of normal chain quantity;For representing the first minus strand number storage part of minus strand quantity.
- 4. according to the method for claim 2, it is characterised in that for there is insertion and deletion appearance, and base type occurred The site of 3-4 kinds, first statistical information in the site use the first data structure and the second data structure storage;Second data structure, including:The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;The alkali of every kind of base type The storage organization of base weighted quality Data-Statistics information and positive minus strand statistical information specifically includes:For representing base weighted quality value The second mass value storage part, for representing the second normal chain number storage part of normal chain quantity, and, for representing minus strand quantity Second minus strand number storage part;First insertion statistical information, is specifically included:For representing the first insetion sequence storage part of insetion sequence, for representing low The first low quality insertion number storage part of quality insertion quantity;First miss statistics information, is specifically included:For representing the first missing length storage part of missing length, for representing high Quality lacks the first high quality missing number storage part of quantity, for representing the first low quality missing number of low quality missing quantity Storage part;First data structure, including:With the second head of 11 fillings;For indicating whether the first insertion information storage part in the presence of insertion, specifically include:For indicating whether in the presence of insertion The first insertion sub- storage part of information, the sub- storage part of intubating length for representing intubating length, for representing that low quality inserts number The low quality insertion number storage part of amount;For indicating whether the first missing information storage part in the presence of missing, specifically include:For indicating whether in the presence of missing The sub- storage part of first missing information;For pointing to the pointer of corresponding second data structure storage position.
- 5. according to the method for claim 2, it is characterised in that for occur unnecessary 1 insertion and deletion, intubating length it is big In the site of 12 bases, first statistical information in the site uses the first data structure and the 3rd data structure storage, and right The first statistical information in such site, memory pool is created in internal memory to be stored;3rd data structure, including:The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information;The alkali of every kind of base type The storage organization of base weighted quality Data-Statistics information and positive minus strand statistical information specifically includes:For representing base weighted quality value The 3rd mass value storage part, for representing the 3rd normal chain number storage part of normal chain quantity, and, for representing minus strand quantity 3rd minus strand number storage part;Second insertion statistical information, is specifically included:For representing the intubating length storage part of intubating length, for representing to insert sequence Second insetion sequence storage part of row, for representing that the second low quality of low quality insertion quantity inserts number storage part, and, use Number storage part is inserted in the high quality for representing high quality insertion quantity;Second miss statistics information, is specifically included:For representing the second missing length storage part of missing length, for representing high Quality lacks the second high quality missing number storage part of quantity, for representing the second low quality missing number of low quality missing quantity Storage part;First data structure, including:With the 3rd head of 11 fillings;For indicating whether the second insertion information storage part in the presence of insertion, specifically include:For indicating whether in the presence of insertion The second insertion sub- storage part of information, for indicating whether to have used the first sub- storage part of memory pool information of memory pool, for table Show that first of the occupancy length in memory pool takes the sub- storage part of length;For indicating whether the second missing information storage part in the presence of missing, specifically include:For indicating whether in the presence of missing The sub- storage part of second missing information, for indicating whether to have used the second sub- storage part of memory pool information of memory pool, for table Show that second of the occupancy length in memory pool takes the sub- storage part of length.
- 6. according to the method for claim 2, it is characterised in that for the soft shearing statistical information, using a dynamic Array records, and every record includes:For representing the soft clipped position storage part of soft shearing present position on genome;For representing that the soft shearing left-hand digit storage part in the number on the corresponding site left side occurs for soft shearing;For representing that the soft shearing right-hand component storage part of the number on the right of corresponding site occurs for soft shearing.
- 7. according to the method for claim 1, it is characterised in that the index includes both-end comparison information index and single-ended ratio To information index;Indexed for both-end comparison information, comparing structure of arrays using both-end is stored, and the both-end compares structure of arrays bag Include:For the first ID storage parts for the ID for representing gene order;Position storage part is compared for representing that gene order is compared to first of the position on genome;For the Insert Fragment length storage part for the Insert Fragment length for representing gene order;For the first comparison mass value storage part of the comparison mass value for representing gene order;For the first average mass values storage part of the average mass values for representing gene order;Index for single-ended comparison information, stored using single-ended comparison structure of arrays, the single-ended comparison structure of arrays bag Include:For the 2nd ID storage parts for the ID for representing gene order;Position storage part is compared for representing that gene order is compared to second of the position on genome;For the second comparison mass value storage part of the comparison mass value for representing gene order;For the second average mass values storage part of the average mass values for representing gene order;Wherein, for every gene order for being used for comparison, according to comparison position of the gene order on genome, its is corresponding Index be arranged in order.
- 8. according to the method for claim 1, it is characterised in that the gene order comparison information is stored in disk, had Body includes:So gene order comparison information is divided into 512 files and is stored in disk, each file stores certain genomic region Between gene order comparison information, the data storage structure of every gene order comparison information includes:For the sequence length storage part for the sequence length for representing gene order;For representing the sequence storage part of gene order in itself;For the mass value storage part for the mass value for representing gene order;For representing the starting position storage part of alignment algorithm starting position of the gene order when comparing;For representing the positive minus strand storage part of positive and negative chain information of the gene order when comparing;The zone length storage part for the genomic region lengths chosen for representing gene order when comparing;For representing the leftward position storage part of gene order riveting fixed position in the left side when comparing;The right positions storage part of the position fixed for riveting on the right of representing gene order when comparing.
- 9. according to the method described in claim any one of 1-8, it is characterised in that also include:During deduplication, interference caused by repetitive sequence in the genome statistical information is subtracted;And/orIn weight comparison process, the gene order of the heavy comparison area of genome is extracted, again than the gene of counterweight comparison area After sequence, the genome statistical information of the gene order of weight comparison area is adjusted.
- 10. a kind of electronic equipment, including:At least one processor;AndThe memory being connected with least one processor communication;Wherein,The memory storage has can be by the instruction of one computing device, and the instruction is by least one processor Perform, so that at least one processor is able to carry out the method as described in claim 1-9 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710546293.7A CN107480466B (en) | 2017-07-06 | 2017-07-06 | Genome data storage method and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710546293.7A CN107480466B (en) | 2017-07-06 | 2017-07-06 | Genome data storage method and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107480466A true CN107480466A (en) | 2017-12-15 |
CN107480466B CN107480466B (en) | 2020-08-11 |
Family
ID=60595629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710546293.7A Active CN107480466B (en) | 2017-07-06 | 2017-07-06 | Genome data storage method and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480466B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197433A (en) * | 2017-12-29 | 2018-06-22 | 厦门极元科技有限公司 | Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform |
CN108920902A (en) * | 2018-06-29 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of gene order processing method and its relevant device |
CN110879782A (en) * | 2019-11-08 | 2020-03-13 | 浪潮电子信息产业股份有限公司 | Method, device, equipment and medium for testing gene comparison software |
CN111081314A (en) * | 2019-12-13 | 2020-04-28 | 北京市商汤科技开发有限公司 | Method and apparatus for identifying genetic variation, electronic device, and storage medium |
WO2022082878A1 (en) * | 2020-10-22 | 2022-04-28 | 深圳华大基因股份有限公司 | Shared memory-based gene analysis method and apparatus, and computer device |
CN115602246A (en) * | 2022-10-31 | 2023-01-13 | 哈尔滨工业大学(Cn) | Sequence comparison method based on group genome |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103201744A (en) * | 2010-10-13 | 2013-07-10 | 考利达基因组股份有限公司 | Methods for estimating genome-wide copy number variations |
CN104361264A (en) * | 2014-12-11 | 2015-02-18 | 天津工业大学 | Quick counting method for quantity of nucleic acid fragments of genome |
CN106202991A (en) * | 2016-06-30 | 2016-12-07 | 厦门艾德生物医药科技股份有限公司 | The detection method of abrupt information in a kind of genome multiplex amplification order-checking product |
-
2017
- 2017-07-06 CN CN201710546293.7A patent/CN107480466B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103201744A (en) * | 2010-10-13 | 2013-07-10 | 考利达基因组股份有限公司 | Methods for estimating genome-wide copy number variations |
CN104361264A (en) * | 2014-12-11 | 2015-02-18 | 天津工业大学 | Quick counting method for quantity of nucleic acid fragments of genome |
CN106202991A (en) * | 2016-06-30 | 2016-12-07 | 厦门艾德生物医药科技股份有限公司 | The detection method of abrupt information in a kind of genome multiplex amplification order-checking product |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197433A (en) * | 2017-12-29 | 2018-06-22 | 厦门极元科技有限公司 | Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform |
CN108920902A (en) * | 2018-06-29 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of gene order processing method and its relevant device |
CN110879782A (en) * | 2019-11-08 | 2020-03-13 | 浪潮电子信息产业股份有限公司 | Method, device, equipment and medium for testing gene comparison software |
CN110879782B (en) * | 2019-11-08 | 2022-06-17 | 浪潮电子信息产业股份有限公司 | Method, device, equipment and medium for testing gene comparison software |
CN111081314A (en) * | 2019-12-13 | 2020-04-28 | 北京市商汤科技开发有限公司 | Method and apparatus for identifying genetic variation, electronic device, and storage medium |
WO2022082878A1 (en) * | 2020-10-22 | 2022-04-28 | 深圳华大基因股份有限公司 | Shared memory-based gene analysis method and apparatus, and computer device |
CN115602246A (en) * | 2022-10-31 | 2023-01-13 | 哈尔滨工业大学(Cn) | Sequence comparison method based on group genome |
Also Published As
Publication number | Publication date |
---|---|
CN107480466B (en) | 2020-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107480466A (en) | Genomic data storage method and electronic equipment | |
Barrett et al. | Dinosaur diversity and the rock record | |
CN103914506B (en) | Data searcher, date storage method and data retrieval method | |
CN106228398A (en) | Specific user's digging system based on C4.5 decision Tree algorithms and method thereof | |
CN109326325A (en) | A kind of method, system and associated component that gene order compares | |
CN103761236A (en) | Incremental frequent pattern increase data mining method | |
CN105389480A (en) | Multiclass unbalanced genomics data iterative integrated feature selection method and system | |
US11841839B1 (en) | Preprocessing and imputing method for structural data | |
CN107480470A (en) | Known the variation method for detecting and device examined based on Bayes and Poisson distribution | |
US7584173B2 (en) | Edit distance string search | |
CN103336771A (en) | Data similarity detection method based on sliding window | |
CN110428868A (en) | Gene sequencing quality row data compression pretreatment, decompression restoring method and system | |
CN106874322A (en) | A kind of data table correlation method and device | |
CN107678972A (en) | The appraisal procedure and relevant apparatus of a kind of test case | |
CN107070897A (en) | Network log storage method based on many attribute Hash duplicate removals in intruding detection system | |
CN106844312A (en) | A kind of data table transmition method, device, computer-readable recording medium and storage control | |
US20240177077A1 (en) | Attribution analysis method, electronic device, and storage medium | |
CN104077307B (en) | Unicellular phenotype Database Systems and search engine | |
CN104133836B (en) | A kind of method and device realizing change Data Detection | |
CN102436535B (en) | Identification method and system for creative inflection point in computer aided design process | |
CN116610987A (en) | Kmeans log classification method and device based on distributed sample screening | |
CN104904122B (en) | Signal reconstruction method and device based on compressed sensing | |
CN104778202B (en) | The analysis method and system of event evolutionary process based on keyword | |
CN103969484B (en) | High refresh rate waveform synthesizer and high refresh rate oscillograph | |
CN110309881A (en) | A kind of classification method of non-equilibrium data collection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 1002-1, 10th floor, No.56, Beisihuan West Road, Haidian District, Beijing 100080 Patentee after: Ronglian Technology Group Co., Ltd Address before: 100080, Beijing, Haidian District, No. 56 West Fourth Ring Road, glorious Times Building, 10, 1002-1 Patentee before: UNITED ELECTRONICS Co.,Ltd. |