CN107480466A

CN107480466A - Genomic data storage method and electronic equipment

Info

Publication number: CN107480466A
Application number: CN201710546293.7A
Authority: CN
Inventors: 蔡文君; 何光铸; 王东辉; 孔令雪
Original assignee: UNITED ELECTRONICS CO Ltd
Current assignee: Ronglian Technology Group Co., Ltd
Priority date: 2017-07-06
Filing date: 2017-07-06
Publication date: 2017-12-15
Anticipated expiration: 2037-07-06
Also published as: CN107480466B

Abstract

The invention discloses a kind of genomic data storage method, including：During genome alignment, gene order comparison information is obtained, and creates gene order statistical information；The gene order comparison information is stored in disk, and by gene order comparison information in the comparison position of genome, the corresponding index of storage in internal memory；The index is storage location of the gene order comparison information in disk；The genome statistical information is classified, obtains the first statistical information and the second statistical information；First statistical information is stored in internal memory, first statistical information is higher than the statistical information of predeterminated frequency for access frequency in variation detection process；Second statistical information is stored in disk, second statistical information is the statistical information that access frequency is less than predeterminated frequency in the statistical information and/or variation detection process for can not be stored in internal memory.The invention also discloses a kind of electronic equipment using the genomic data storage method.

Description

Genomic data storage method and electronic equipment

Technical field

The present invention relates to technical field of data processing, particularly relates to a kind of genomic data storage method and electronic equipment.

Background technology

Genome mutation detects calculation process, generally can be divided into comparison, sequence, again deduplication, comparison, variation detection, mistake The steps such as filter.Wherein, main step needs to use BAM files (SAM full name is sequence alignment map, sequence Row comparison chart.And BAM files are exactly the file (B is derived from binary) of the binary format of SAM files) write as output file Hard disk, it is read to internal memory from hard disk again in next step, is then further processed.

During the present invention is realized, inventor has found prior art, and there are the following problems：

In mankind's full-length genome data analysis, initial data is typically in 100GB or so, and middle Main Analysis step is all Need to read and write GB up to a hundred file, whole calculating process expends substantial amounts of I/O resources and program efficiency is low.

And inventor has found have the main reason for causing the problem：

1st, intermediate file is too big, can not be directly placed into internal memory.

64GB internal memories are the machine configurations of a typical common analysis of biological information.Mankind's Whole genome analysis data, Intermediate result in 100GB or so, directly can not typically be present in internal memory, and the detection process that makes a variation inherently needs loading to refer to To in internal memory, the space for causing to be used for putting intermediate result further reduce for sequence and index file.

2nd, the form of intermediate file, cannot be used directly for calculating.

General intermediate file format is SAM/BAM forms, and this form is a kind of row record format, that is, often row is deposited One record of storage, calculating can not be directly used in by being directly placed into internal memory.Data required for variation detection, mainly to each position The statistical information of the comparisons situation of point, including the distribution of the number of each base analog in each site, insertion and deletion (InDel) sequence with The information such as soft shearing (soft clipping) sequence in frequency, comparison.

The content of the invention

In view of this, it is an object of the invention to provide a kind of genomic data storage method and electronic equipment, can solve The a large amount of binary files of continually input and output are certainly needed in genome mutation detection process and caused by low efficiency problem.

Based on above-mentioned purpose genomic data storage method provided by the invention, including：

In comparison process, gene order comparison information is obtained, and creates gene order statistical information；

The gene order comparison information is stored in disk, and aligned by ratio of the gene order comparison information in genome Put, corresponding index is stored in internal memory；The index is storage location of the gene order comparison information in disk；

The genome statistical information is classified, obtains the first statistical information and the second statistical information；

First statistical information is stored in internal memory, first statistical information is higher than for access frequency in variation detection process The statistical information of predeterminated frequency；

Second statistical information is stored in disk, second statistical information is that can not be stored in the statistical information of internal memory And/or access frequency is less than the statistical information of predeterminated frequency in variation detection process.

Optionally, first statistical information includes base weighted quality Data-Statistics information, positive minus strand statistical information, insertion Miss statistics information and soft shearing statistical information.

Optionally, for there is not insertion and deletion and soft shearing and 2 kinds of site, the position at most occurred in base type First statistical information of point uses the first data structure storage；

First data structure, including：

For representing the first head of base type；

For representing the first mass value storage part of base weighted quality value；

For representing the first normal chain number storage part of normal chain quantity；

For representing the first minus strand number storage part of minus strand quantity.

Optionally, for having, insertion and deletion occurs and the site of 3-4 kinds, first statistics in the site occurred in base type Information uses the first data structure and the second data structure storage；

Second data structure, including：

The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information；Every kind of base type Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include：For representing that base weights matter Second mass value storage part of value, for representing the second normal chain number storage part of normal chain quantity, and, for representing minus strand number Second minus strand number storage part of amount；

First insertion statistical information, is specifically included：For representing the first insetion sequence storage part of insetion sequence, for table Show the first low quality insertion number storage part of low quality insertion quantity；

First miss statistics information, is specifically included：For representing the first missing length storage part of missing length, for table Show the first high quality missing number storage part of high quality missing quantity, for representing that the first low quality of low quality missing quantity lacks Lose number storage part；

First data structure, including：

With the second head of 11 fillings；

For indicating whether the first insertion information storage part in the presence of insertion, specifically include：Inserted for indicating whether to exist The the first insertion sub- storage part of information entered, the sub- storage part of intubating length for representing intubating length, for representing that low quality is inserted Enter the low quality insertion number storage part of quantity；

For indicating whether the first missing information storage part in the presence of missing, specifically include：Lacked for indicating whether to exist The sub- storage part of the first missing information lost；

For pointing to the pointer of corresponding second data structure storage position.

Optionally, for there is the insertion and deletion of unnecessary 1, intubating length be more than the sites of 12 bases, the site First statistical information uses the first data structure and the 3rd data structure storage, and believes for first statistics in such site Breath, memory pool is created in internal memory to be stored；

3rd data structure, including：

The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information；Every kind of base type Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include：For representing that base weights matter 3rd mass value storage part of value, for representing the 3rd normal chain number storage part of normal chain quantity, and, for representing minus strand number 3rd minus strand number storage part of amount；

Second insertion statistical information, is specifically included：For representing the intubating length storage part of intubating length, for representing slotting Enter the second insetion sequence storage part of sequence, for representing that the second low quality of low quality insertion quantity inserts number storage part, with And for representing that the high quality of high quality insertion quantity inserts number storage part；

Second miss statistics information, is specifically included：For representing the second missing length storage part of missing length, for table Show the second high quality missing number storage part of high quality missing quantity, for representing that the second low quality of low quality missing quantity lacks Lose number storage part；

First data structure, including：

With the 3rd head of 11 fillings；

For indicating whether the second insertion information storage part in the presence of insertion, specifically include：Inserted for indicating whether to exist The the second insertion sub- storage part of information entered, for indicating whether to have used the first sub- storage part of memory pool information of memory pool, is used The sub- storage part of length is taken in represent the occupancy length in memory pool first；

For indicating whether the second missing information storage part in the presence of missing, specifically include：Lacked for indicating whether to exist The sub- storage part of the second missing information lost, for indicating whether to have used the second sub- storage part of memory pool information of memory pool, use The sub- storage part of length is taken in represent the occupancy length in memory pool second.

Optionally, for the soft shearing statistical information, recorded using a dynamic array, every record includes：

For representing the soft clipped position storage part of soft shearing present position on genome；

For representing that the soft shearing left-hand digit storage part in the number on the corresponding site left side occurs for soft shearing；

For representing that the soft shearing right-hand component storage part of the number on the right of corresponding site occurs for soft shearing.

Optionally, the index includes both-end comparison information index and single-ended comparison information index；

Indexed for both-end comparison information, comparing structure of arrays using both-end is stored, and the both-end compares array knot Structure includes：

For the first ID storage parts for the ID for representing gene order；

Position storage part is compared for representing that gene order is compared to first of the position on genome；

For the Insert Fragment length storage part for the Insert Fragment length for representing gene order；

For the first comparison mass value storage part of the comparison mass value for representing gene order；

For the first average mass values storage part of the average mass values for representing gene order；

Index for single-ended comparison information, stored using single-ended comparison structure of arrays, the single-ended comparison array knot Structure includes：

For the 2nd ID storage parts for the ID for representing gene order；

Position storage part is compared for representing that gene order is compared to second of the position on genome；

For the second comparison mass value storage part of the comparison mass value for representing gene order；

For the second average mass values storage part of the average mass values for representing gene order；

Wherein, it is used for the gene order that compares for every, according to comparison position of the gene order on genome, its Corresponding index is arranged in order.

Optionally, the gene order comparison information is stored in disk, specifically included：

So gene order comparison information is divided into 512 files and is stored in disk, each file stores certain gene The gene order comparison information of class interval, the data storage structure of every gene order comparison information include：

For the sequence length storage part for the sequence length for representing gene order；

For representing the sequence storage part of gene order in itself；

For the mass value storage part for the mass value for representing gene order；

For representing the starting position storage part of alignment algorithm starting position of the gene order when comparing；

For representing the positive minus strand storage part of positive and negative chain information of the gene order when comparing；

The zone length storage part for the genomic region lengths chosen for representing gene order when comparing；

For representing the leftward position storage part of gene order riveting fixed position in the left side when comparing；

The right positions storage part of the position fixed for riveting on the right of representing gene order when comparing.

Optionally, described method also includes：

During deduplication, interference caused by repetitive sequence in the genome statistical information is subtracted；

And/or

In weight comparison process, the gene order of the heavy comparison area of genome is extracted, again than counterweight comparison area After gene order, the genome statistical information of the gene order of weight comparison area is adjusted.

From the above it can be seen that genomic data storage method provided by the invention and electronic equipment, for variation The characteristics of detecting the intermediate file in overall process, the data store organisation of exquisiteness is devised, by some of main mediants According to being maintained in internal memory, these data can directly invoke from internal memory so that each step of variation detection overall process does not have to The substantial amounts of I/O read-writes for carrying out disk, significantly improve the efficiency of whole variation detection and analysis flow.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of one embodiment of genomic data storage method provided by the invention；

Is are there is not insertion and deletion and soft shearing in Fig. 1 a, described and when 2 kinds of site at most occurred in base type The schematic diagram of first data structure；

Fig. 1 b are to have insertion and deletion (InDel) appearance, and when the site of 3-4 kinds occurred in base type, described second The schematic diagram of data structure；

Fig. 1 c are to have insertion and deletion (InDel) appearance, and when the site of 3-4 kinds occurred in base type, described first The schematic diagram of data structure；

Fig. 1 d are the 3rd number when the insertion and deletion of unnecessary 1 occur, intubating length being more than the site of 12 bases According to the schematic diagram of structure；

Fig. 1 e are first number when the insertion and deletion of unnecessary 1 occur, intubating length being more than the site of 12 bases According to the schematic diagram of structure；

Fig. 1 f be for the soft shearing statistical information, during using a dynamic array to record, the dynamic array Schematic diagram；

Fig. 1 g are the schematic diagram of the index；

Fig. 1 h are the schematic diagram of the data storage structure of every gene order comparison information；

Fig. 2 is the schematic flow sheet of one embodiment of genome sequence comparison method provided by the invention；

Fig. 3 is the structural representation of one embodiment of genomic data storage device provided by the invention；

Fig. 4 is the structural representation of one embodiment of electronic equipment provided by the invention.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in more detail.

It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention The non-equal entity of individual same names or non-equal parameter, it is seen that " first ", " second " should not only for the convenience of statement The restriction to the embodiment of the present invention is interpreted as, subsequent embodiment no longer illustrates one by one to this.

Based on above-mentioned purpose, the one side of the embodiment of the present invention, it is proposed that a kind of genomic data storage method One embodiment, can solve the problem that needs a large amount of binary files of continually input and output and causes in genome mutation detection process Low efficiency problem.As shown in figure 1, the flow signal of one embodiment for genomic data storage method provided by the invention Figure.

The genomic data storage method, including：

Step 101：In comparison process, gene order comparison information is obtained, and creates gene order statistical information；It is described Gene order comparison information is caused gene order comparison result information during genome alignment, according to the gene order ratio To object information, can therefrom extract to obtain the gene order statistical information；

Step 102：The gene order comparison information is stored in disk, and by gene order comparison information in genome Comparison position, corresponding index is stored in internal memory；The index is gene order comparison information the depositing in disk Storage space is put；

Step 103：The gene order statistical information is classified, obtains the first statistical information and the second statistics letter Breath；

Step 104：First statistical information is stored in internal memory, first statistical information is in variation detection process Access frequency is higher than the statistical information of predeterminated frequency；

Step 105：Second statistical information is stored in disk, second statistical information is that can not be stored in internal memory Statistical information and/or variation detection process in access frequency be less than predeterminated frequency statistical information.

From above-described embodiment as can be seen that genomic data storage method provided in an embodiment of the present invention, is examined for variation The characteristics of surveying the intermediate file in overall process (including compare, sort, deduplication, the again step such as comparison, the detection that makes a variation, filtering), The data store organisation of exquisiteness is devised, some of main intermediate data are maintained in internal memory, these data can be from interior Directly invoked in depositing so that each step of variation detection overall process does not have to the substantial amounts of I/O read-writes for carrying out disk, significantly carries The high efficiency of whole variation detection and analysis flow.

In some optional embodiments, first statistical information includes base weighted quality Data-Statistics information, positive and negative Chain statistical information, insertion and deletion statistical information and soft shearing statistical information；Specifically include：

The base weighted quality Data-Statistics information (Weighted Count)：

Because each comparison has a mass value, between 0 and 40, the power of imparting to the base in reference gene sequence Weight is as shown in the table：

Base Quality Scores	Parameter*	Weight
			0–10	[0–Weight0]	0
11–13	(Weight0–Weight1)	1
			14–17	(Weight1–Weight2)	2
18–20	(Weight2–Weight3)	3
			21–40	(Weight3–40)	4

All weights compared to the identical base of same position are added, obtain the mass value weight of this base type With；

The positive minus strand statistical information (Strand Count)：Forward and reverse compares the gene order number to same position Statistics；

The insertion and deletion statistical information and insetion sequence information (InDel Count)：Compare in gene order in base Because organizing some position insertion and deletion sequence and the accumulative number occurred；

The soft shearing statistical information (Soft Clip Count)：In genome, some position goes out in comparison gene order The number of existing soft shearing (soft clip).

In some optional embodiments, simplest situation is considered, for there is not insertion and deletion and soft shearing, and At most there is 2 kinds of site in base type, and first statistical information in the site uses the first data structure storage；Optionally, First data structure is 8bytes data structure Counter (container), using 8bytes data structure Counter preserves the information in a site, and whole human genome comprises about 3G site, it is therefore desirable to internal memory about 24GB；

First data structure, as shown in Figure 1a, the statistical informations of two bases (base1information and Base2information same 4bytes data structure storages) are used, including：

For representing the first head of base type；Optionally, first head (base) represents alkali using 2bits Base type, base A, C, G, T represent using 00,01,10 and 11 respectively；

For representing the first mass value storage part of base weighted quality value；Optionally, the first mass value storage part (weighted count) represents weighted quality value and maximum 16383 using 14bits；

For representing the first normal chain number storage part of normal chain quantity；Optionally, the first normal chain number storage part (+ve Strand count) using 1byte (8bits) represent the quantity of normal chain, maximum 255；

For representing the first minus strand number storage part of minus strand quantity；Optionally, the first minus strand number storage part (- ve Strand count) using 1byte (8bits) represent the quantity of minus strand, maximum 255.

In some optional embodiments, for there is insertion and deletion (InDel) appearance, and there are 3-4 kinds in base type Site, first statistical information in the site uses the first data structure and the second data structure storage；Optionally, using one 32bytes data structure OverflowCounter (spilling container) preserves the information in a site, base ACGT statistics Information (base AZinformation, base C information, base G information and base T Information it is) each respectively to be represented with 6bytes, insertion information (Insertion Info.) and missing information (Deletion Info.) respectively represented with 4bytes；Rule of thumb 30X full-length genomes data about 200M such sites；

Second data structure, as shown in Figure 1 b, including：

The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information；Every kind of base type Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include：For representing that base weights matter Second mass value storage part of value (weighted count, optionally, weighted quality value and maximum is represented using 2bytes It is worth for 65535), for representing that the second normal chain number storage part of normal chain quantity (+ve strand count, optionally, uses 2bytes represent normal chain quantity, maximum 65535), and, for represent minus strand quantity the second minus strand number storage part (- Ve strand count, optionally, minus strand quantity, maximum 65535 are represented using 2bytes)；

First insertion statistical information, is specifically included：For representing the first insetion sequence storage part of insetion sequence (Insertion Pattern, optionally, being represented using 3bytes, most long here to represent 12 bases), for representing low-quality The first low quality insertion number storage part (LQ count, optionally, being represented using 1byte, maximum 255) of amount insertion quantity；

First miss statistics information, is specifically included：For representing the first missing length storage part of missing length (Del.Len, optionally, being represented using 1byte, maximum 255), for representing that the first high quality of high quality missing quantity lacks Number storage part (HQ count, optionally, being represented using 1byte, maximum 255) is lost, for representing low quality missing quantity First low quality missing number storage part (LQ count, optionally, being represented using 1byte, maximum 255)；Optionally, in addition to 1byte unused storage spaces；

When using OverflowCounter, the storage content of corresponding first data structure can change, institute The first data structure is stated, as illustrated in figure 1 c, including：

With the second head of 11 fillings；It is original to be used for storing base1information and base2information two The data of individual base type, it can all be filled to be " 11 " and represent to have used OverflowCounter；

It is optional for indicating whether the first insertion information storage part (Insertion Information) in the presence of insertion , insertion information is preserved using 14bits, is specifically included：For indicating whether the first insertion information son storage in the presence of insertion Portion (1bit), the sub- storage part of intubating length (Ins.Len. is represented using 4bits) for representing intubating length, for representing low The low quality insertion number storage part (LQ count, represented using 8bits) of quality insertion quantity；Optionally, 1bit is arranged to 0；

It is optional for indicating whether the first missing information storage part (Deletion Information) in the presence of missing , missing information is preserved using 14bits, indicates whether missing (the sub- storage part of the first missing information) be present using 1bit, 1bit is arranged to 0,12bits and (Unused) is not used；

For pointing to pointer (the array index pointing to of corresponding second data structure storage position Dynamic array of overflow counter), optionally, pointed to using 4bytes to preserve One pointer of the position of OverflowCounter data.

In some optional embodiments, for there is the insertion and deletion of unnecessary 1, intubating length is more than 12 bases Site, first statistical information in the site uses the first data structure and the 3rd data structure storage, and for such position First statistical information of point, memory pool (Memory Pool, specially opening up one piece of internal memory) is created in internal memory to be stored, Base ACGT statistical information (base A information, base C information, base G information With base T information) it is each respectively represented with 6bytes, and in OverflowCounter record insertion and deletion letter The pointer of breath, as shown in Figure 1 d；

3rd data structure, as shown in Figure 1 d, including：

The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information；Every kind of base type Base weighted quality Data-Statistics information and the storage organization of positive minus strand statistical information specifically include：For representing that base weights matter 3rd mass value storage part of value (weighted count, optionally, weighted quality value and maximum is represented using 2bytes It is worth for 65535), for representing that the 3rd normal chain number storage part of normal chain quantity (+ve strand count, optionally, uses 2bytes represent normal chain quantity, maximum 65535), and, for represent minus strand quantity the 3rd minus strand number storage part (- Ve strand count, optionally, minus strand quantity, maximum 65535 are represented using 2bytes)；

Second insertion statistical information (Insertion Ptr), optionally, is represented using 4bytes, specifically included：For The intubating length storage part (Insertion length, optionally, intubating length being represented using 1byte) of intubating length is represented, For representing that the second insetion sequence storage part of insetion sequence (Insertion pattern, indefinite length, one is represented per 2bits Individual base), for representing that the second low quality insertion number storage part of low quality insertion quantity (LQ count, optionally, uses 1byte represents low quality insertion quantity), and, for representing the high quality insertion number storage part (HQ of high quality insertion quantity Count, optionally, high quality insertion quantity is represented using 1byte)；

Second miss statistics information (Deletion Ptr), optionally, is represented using 4bytes, specifically included：For Represent that the second missing length storage part of missing length (Deletion length, optionally, missing length is represented using 1byte Degree), for representing that the second high quality missing number storage part of high quality missing quantity (HQ count, optionally, uses 1byte Represent high quality missing quantity), for represent low quality missing quantity the second low quality missing number storage part (LQ count, Optionally, low quality missing quantity is represented using 1byte)；

At the same time, the information record in Counter changes as shown in fig. le, first data structure, including：

With the 3rd head of 11 fillings；It is original to be used for storing base1information and base2information two The data of individual base type, it can all be filled to be " 11 " and represent to have used OverflowCounter；

It is optional for indicating whether the second insertion information storage part (Insertion Information) in the presence of insertion , represented, specifically included using 14bits：For indicating whether the second insertion sub- storage part of information (1bit) in the presence of insertion, For indicating whether to have used the first sub- storage part of memory pool information (1bit) of memory pool, for representing accounting in memory pool The sub- storage part of length (12bits) is taken with the first of length；

It is optional for indicating whether the second missing information storage part (Deletion Information) in the presence of missing , represented, specifically included using 14bits：For indicating whether the sub- storage part of the second missing information (1bit) in the presence of missing, For indicating whether to have used the second sub- storage part of memory pool information (1bit) of memory pool, for representing accounting in memory pool The sub- storage part of length (12bits) is taken with the second of length.

Rule of thumb, soft shearing (soft clipping) can only occur on seldom genomic locations, therefore be not necessarily to One piece of memory space is individually opened up for each site.Therefore, in some optional embodiments, for the soft shearing statistics letter Breath, recorded using a dynamic array, as shown in Figure 1 f, every record form for position, left counts, Right counts }, 12bytes is taken, is specifically included：

For representing the soft clipped position storage part (position) of soft shearing present position on genome, take 4bytes；

For representing that the soft shearing left-hand digit storage part (left in the number on the corresponding site left side occurs for soft shearing Counts), 4bytes is taken；

For representing that the soft shearing right-hand component storage part (right of the number on the right of corresponding site occurs for soft shearing Counts), 4bytes is taken.

In some optional embodiments, as shown in Figure 1 g, the index includes both-end and compares (Pair End) information rope Draw and single-ended comparison (Single End) information index；

Indexed for both-end comparison information, comparing structure of arrays using both-end is stored, and the both-end compares array knot Structure (PairEndAlignmentInfo, taking 12bytes) includes：

For the first ID storage parts (ReadID) for the ID for representing gene order, 4bytes is taken；

Position storage part (Aligned is compared for representing that gene order is compared to first of the position on genome Position), 4bytes is taken；

For the Insert Fragment length storage part (Insert Size) for the Insert Fragment length for representing gene order, take 2bytes；

For the first comparison mass value storage part (MAPQ) of the comparison mass value for representing gene order, 1byte is taken；

For the first average mass values storage part (Average Base of the average mass values for representing gene order Quality), 1byte is taken；

Index for single-ended comparison information, stored using single-ended comparison structure of arrays, the single-ended comparison array knot Structure (SingleEndAlignmentInfo, taking 10bytes) includes：

For the 2nd ID storage parts (ReadID) for the ID for representing gene order, 4bytes is taken；

Position storage part (Aligned is compared for representing that gene order is compared to second of the position on genome Position), 4bytes is taken；

For the second comparison mass value storage part (MAPQ) of the comparison mass value for representing gene order, 1byte is taken；

For the second average mass values storage part (Average Base of the average mass values for representing gene order Quality), 1byte is taken；

Because it is very random that sequence is read in the detection process that makes a variation, in some optional embodiments, by the gene sequence Row comparison information is stored in disk, specifically includes：

So gene order comparison information is divided into 512 files (buckets) and is stored in disk, each file storage Gene order comparison information between certain genomic region, as shown in figure 1h, the data storage knot of every gene order comparison information Structure includes：

For the sequence length storage part (Read Length) for the sequence length for representing gene order, 2bytes is taken；

For representing the sequence storage part (Packed Read) of gene order in itself, indefinite length, represented using 2bits One base；

For the mass value storage part (Base Qualities) for the mass value for representing gene order, indefinite length；

For representing starting position storage part (the DP Start of alignment algorithm starting position of the gene order when comparing Pos.), 4bytes is taken；

For representing the positive minus strand storage part (Strand) of positive and negative chain information of the gene order when comparing, 1bit is taken；

Zone length storage part (the DP for the genomic region lengths chosen for representing gene order when comparing Ref.length), 15bits is taken；

For representing the leftward position storage part (Left Anchor) of gene order riveting fixed position in the left side when comparing, Take 4bytes；

The right positions storage part (Right anchor) of the position fixed for riveting on the right of representing gene order when comparing, Take 4bytes.

It is optional at some outside the step of except in comparison process, creating statistics and index in foregoing embodiment In embodiment, methods described also includes：

During deduplication (de-duplication), subtract in the genome statistical information caused by repetitive sequence Interference；

And/or

During (realignment) is compared again, the gene order of the heavy comparison area of genome is extracted, is compared again After the gene order of weight comparison area, the genome statistical information of the gene order of weight comparison area is adjusted；

In the detection process that makes a variation, directly using these statistical informations, the probability of various genotype is calculated.

By genomic data storage method provided in an embodiment of the present invention, whole process of analyzing does not have to output repeatedly largely Binary file, by the algorithm optimization of entirety, the data of a full-length genome are analyzed, can be completed in 4 hours, and one As analysis process need to complete for tens hours；The I/O processes during variation detection and analysis are greatly reduced, greatly The analysis efficiency for improving program.

For the ease of the understanding of preceding solution, a kind of implementation of genome sequence comparison method is simply introduced herein Example, for explaining the genome alignment process in previous embodiment in step 101.As shown in Fig. 2 it is gene provided by the invention The schematic flow sheet of one embodiment of group sequence alignment method.

The genome sequence comparison method, comprises the following steps：

Step 201：Obtain reference gene group sequence and genome sequence file to be compared.Here file acquisition mode Using conventional acquisition modes.Wherein, the form of the genome sequence file to be compared can be FASTQ files.

The genome sequence comparison method, sequence alignment is divided into 3 ranks to carry out；Every time from the to be compared of input Genome sequence file in read a part of sequence, then successively to performing 1 grade, 2 grades, 3 grades of alignment algorithms, upper level do not have There is the sequence on comparing, continue to compare into the alignment algorithm of next stage；Specifically include following steps.

Step 202：Partial genome sequence is read from genome sequence file to be compared.

Step 203：According to (the 1st grade of two-way BWT alignment algorithms：Two-way BWT alignment algorithms, two-way BWT：Bi- Directional Burrows-Wheeler Transform, two-way Barrow this-Wheeler conversion), by the portion gene group sequence Row are compared with reference gene group sequence.Wherein, the two-way BWT alignment algorithms handle most 4 base mistakes of permission Reads is compared.Reads, length is read, is the sequencing sequence obtained in high-flux sequence, each read is one section of base sequence. During analysis of biological information, each read is compared onto reference gene group, it is possible to obtain sequencing sequence and reference gene The difference of group, so as to find to make a variation.

Optionally, the method that genome sequence is compared according to two-way BWT alignment algorithms, specifically may include following steps：

Reads is segmented using dovecote principle, 0-2 base mistake of every section of permission；

Then scan for comparing using two-way BWT alignment algorithms, including：

Establish the BWT of the BWT of the reference gene group sequence, Suffix array clustering and reference gene group sequence backward；

Using sweep backward (backward) and sweep forward (forward) respectively to reads or reads each piece Both direction searches for its position in reference gene group sequence to section from right to left and from left to right.

The two-way BWT is compared when multiple base erroneous matchings are handled, and efficiency comparison is low.Allow 4 most In the case of base erroneous matching, reads is segmented according to dovecote principle, each paragraph allows 0-2 base mistake Match somebody with somebody, the comparison of most 2 base mistakes is so handled with two-way BWT, efficiency greatly increases.

Common comparison software BWA after the BWT of reference sequences and corresponding index and SA (suffix array) is established, Searched for using backward, i.e., its position on genome is searched for from right to left to reads or reads each fragment. The two-way BWT that this patent uses also is established in addition to establishing traditional BWT indexes (being designated as B) to the backward sequence of reference sequences One BWT index (being designated as B ').Utilize B, B ', SA, searched in the two directions by backward, forward reads or Positions of the seeds on genome, the efficiency of sequence alignment significantly improve.

Step 204：Whether in a pair of a readss only read comparison at least there is in the partial genome sequence Upper (that is, in the partial genome sequence, at least a pair of reads are that only a read is compared)；If so, enter Enter step 208；If it is not, into step 205.

Step 205：According to single-ended Dynamic Programming alignment algorithm (the 2nd grade), will only have one in the partial genome sequence Each pair reads in bar read comparisons, it is compared again with reference gene group sequence.In the two-way BWT by foregoing 1st grade Alignment algorithm, in a pair of reads (A, A '), wherein one (A or A ') is compared onto reference gene group sequence, another (A ' Or A) but without comparing onto reference gene group sequence, it will continue to compare using the 2nd grade of alignment algorithm.

Optionally, the method that genome sequence is compared according to single-ended Dynamic Programming alignment algorithm, specifically may include following step Suddenly：

Determine that the read (A or A ') in a pair of reads (A, A ') is compared to the spy in the reference gene group sequence Positioning is put (pos positions)；The data reads that both-end is sequenced to obtain is paired, it is assumed that wherein one of a pair of reads (A, A ') Read (A or A ') is compared to the pos positions in reference gene group sequence, then (A ' or A) theoretical compare position by another read Certain area around pos positions is in candidate region (candidate region)；

Therefore, according to predeterminated position range threshold, the particular range around the ad-hoc location (pos positions) is chosen；Institute Stating predeterminated position range threshold can be selected according to being actually needed, such as reference error tolerance is configured；Specifically Ground, in both-end sequencing, a pair of reads are compared on genome, then the distance between two read grow with two read The length that sum is equal to sequencing fragment (fragment) is spent, determines the position of candidate region around this principle.For example, sequencing piece Section is 500bp, and each read is 150bp, then after comparing onto genome, the theoretical distance between two read is 200bp. Because fragment length is sequenced, theoretical distance is about in 100bp~200bp；

Using dynamic programming algorithm to another not be compared in a pair of reads in the particular range (A ' A) is compared；Step 206：It is equal whether in a pair of readss two reads at least there are in the partial genome sequence Do not compare (that is, in the partial genome sequence, at least a pair of reads every read is not compared)；If It is, into step 108；If it is not, into step 207.

Step 207：According to both-end Dynamic Programming alignment algorithm (3rd level), by two in the partial genome sequence The each pair reads that read is not compared, it is compared again with reference gene group sequence.In a pair of reads (A, A '), warp Foregoing 1st grade of two-way BWT alignment algorithms and the 2nd grade of single-ended Dynamic Programming alignment algorithm are crossed, certain a pair of reads (A, A ') in A and A ' without upper reference gene group sequence is compared, will continue to compare using 3rd level alignment algorithm.

Optionally, the method that genome sequence is compared according to both-end Dynamic Programming alignment algorithm, specifically may include following step Suddenly：

Seed (seeds, substrings of a read) is built respectively to every (A and A ') in a pair of reads；

Specifically, it is respectively classified into many segments to a pair of reads (A, A ') every read, structure seed (seeds, substrings of a read)；When a pair of reads are compared onto genome, the distance between two read are in certain model In enclosing, therefore the distance between two read seed also should be in certain scope；

Each seed is compared onto reference gene group sequence；

Specifically, the region of the seeds comparisons of the distance between (i.e. two seeds meet the requirements) in pairs is retrieved, really Determine this candidate's comparison area to reads.Then reads is compared with dynamic programming algorithm and arrives candidate region.

If in a certain region of the reference gene group sequence, two (A and the A ') of the reads have corresponding kind respectively On son compares, then the region is the candidate region of final comparison position；

Two (A and the A ') of the reads are compared respectively using dynamic programming algorithm in the candidate region It is right；Enter step 208 after the completion of comparison；Step 208：Whether all compare and complete the genome sequence file to be compared； If it is not, return to step 102；If so, into step 109.

Step 209：Export comparison result.Optionally, BAM files are the output files that genome sequence compares, and BAM is base Because group sequence alignment result preserves form, position and detailed sequence ratio that genome sequence is listed in reference gene group sequence have recorded To situation.

From above-described embodiment as can be seen that genome sequence comparison method provided by the invention, is compared by the way that setting is multistage Algorithm, the part not compared is carried out continuing to contrast using next stage alignment algorithm after the completion of the comparison of previous stage algorithm, from And allow the complexity of algorithm to match the complexity of data, and every one-level algorithm is optimized, and then total algorithm Optimization in speed.Using genome sequence comparison method provided by the invention, in same asset and ensure the accuracy compared On the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the comparison time of prior art There is significant shortening, improve data analysis efficiency.

Based on above-mentioned purpose, second aspect of the embodiment of the present invention, it is proposed that a kind of genomic data storage device One embodiment, can solve the problem that needs a large amount of binary files of continually input and output and causes in genome mutation detection process Low efficiency problem.As shown in figure 3, the structural representation of one embodiment for genomic data storage device provided by the invention Figure.

The genomic data storage device, including：

Creation module 301, for during genome alignment, obtaining gene order comparison information, and create gene sequence Row statistical information；

Comparison information memory module 302, for the gene order comparison information to be stored in into disk, and press gene order Comparison information is in the comparison position of genome, the corresponding index of storage in internal memory；The index compares for the gene order Storage location of the information in disk；

Statistical information sort module 303, for classifying to the genome statistical information, obtain the first statistical information With the second statistical information；

Statistical information memory module 304, for the first statistical information to be stored in into internal memory, first statistical information is change Access frequency is higher than the statistical information of predeterminated frequency in different detection process；And the second statistical information is stored in disk, it is described Second statistical information is that access frequency is less than predeterminated frequency in the statistical information and/or variation detection process for can not be stored in internal memory Statistical information.

In some optional embodiments, first statistical information includes base weighted quality Data-Statistics information, positive and negative Chain statistical information, insertion and deletion statistical information and soft shearing statistical information.

In some optional embodiments, for there is not insertion and deletion and soft shearing, and base type at most occurs 2 kinds of site is crossed, first statistical information in the site uses the first data structure storage；

First data structure, including：

For representing the first head of base type；

In some optional embodiments, for there is insertion and deletion appearance, and there is the site of 3-4 kinds in base type, First statistical information in the site uses the first data structure and the second data structure storage；

Second data structure, including：

First data structure, including：

With the second head of 11 fillings；

For indicating whether the first missing information storage part in the presence of missing；

In some optional embodiments, for there is the insertion and deletion of unnecessary 1, intubating length is more than 12 bases Site, first statistical information in the site uses the first data structure and the 3rd data structure storage, and for such position First statistical information of point, memory pool is created in internal memory to be stored；

3rd data structure, including：

First data structure, including：

With the 3rd head of 11 fillings；

In some optional embodiments, for the soft shearing statistical information, recorded using a dynamic array, often Bar record includes：

In some optional embodiments, the index includes both-end comparison information index and single-ended comparison information index；

For the first ID storage parts for the ID for representing gene order；

For the 2nd ID storage parts for the ID for representing gene order；

In some optional embodiments, the gene order comparison information is stored in disk, specifically included：

For representing the sequence storage part of gene order in itself；

Based on above-mentioned purpose, the 3rd aspect of the embodiment of the present invention, it is proposed that one kind performs the genomic data and deposited One embodiment of the device of method for storing.As shown in figure 4, it is the execution genomic data storage method provided by the invention The hardware architecture diagram of one embodiment of device.

As shown in figure 4, described device includes：

One or more processors 401 and memory 402, in Fig. 4 by taking a processor 401 as an example.

Performing the device of the genomic data storage method can also include：Input unit 403 and output device 404.

Processor 401, memory 402, input unit 403 and output device 404 can pass through bus or other modes Connect, in Fig. 4 exemplified by being connected by bus.

Memory 402 is used as a kind of non-volatile computer readable storage medium storing program for executing, available for storage non-volatile software journey Sequence, non-volatile computer executable program and module, such as the genomic data storage method in the embodiment of the present application Corresponding programmed instruction/module is (for example, creation module 301, comparison information memory module 302, statistical information shown in accompanying drawing 3 Sort module 303 and statistical information memory module 304).Processor 401 is stored in non-volatile in memory 402 by operation Property software program, instruction and module, so as to execute server various function application and data processing, that is, realize above-mentioned side The genomic data storage method of method embodiment.

Memory 402 can include storing program area and storage data field, wherein, storing program area can store operation system Application program required for system, at least one function；Storage data field can store the use according to genomic data storage device Data created etc..In addition, memory 402 can include high-speed random access memory, non-volatile deposit can also be included Reservoir, for example, at least a disk memory, flush memory device or other non-volatile solid state memory parts.In some implementations In example, memory 402 is optional including that can pass through relative to the remotely located memory of processor 401, these remote memories Network connection is to member user's behavior monitoring device.The example of above-mentioned network includes but is not limited to internet, intranet, office Domain net, mobile radio communication and combinations thereof.

Input unit 403 can receive the numeral or character information of input, and produce and genomic data storage device The key signals input that user is set and function control is relevant.Output device 404 may include the display devices such as display screen.

One or more of modules are stored in the memory 402, when by one or more of processors During 401 execution, the genomic data storage method in above-mentioned any means embodiment is performed.It is described to perform the genomic data The embodiment of the device of storage method, its technique effect and foregoing any means embodiment are same or similar.

The embodiment of the present application additionally provides a kind of non-transient computer storage medium, and the computer-readable storage medium is stored with Computer executable instructions, the computer executable instructions can perform the place of the list items operation in above-mentioned any means embodiment Reason method.The embodiment of the non-transient computer storage medium, its technique effect it is identical with foregoing any means embodiment or Person is similar.

Finally it should be noted that one of ordinary skill in the art will appreciate that realizing the whole in above-described embodiment method Or part flow, it is that related hardware can be instructed to complete by computer program, described program can be stored in computer In read/write memory medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, it is described Storage medium can be magnetic disc, CD, read-only storage (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..The embodiment of the computer program, its technique effect and foregoing any means embodiment phase It is same or similar.

Those of ordinary skills in the art should understand that：The discussion of any of the above embodiment is exemplary only, not It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples；Under the thinking of the present invention, above example Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and exist such as Many other changes of upper described different aspect of the invention, for simplicity, they are not provided in details.

In addition, to simplify explanation and discussing, and in order to obscure the invention, can in the accompanying drawing provided To show or can not show that the known power ground with integrated circuit (IC) chip and other parts is connected.In addition, can To show device in block diagram form, to avoid obscuring the invention, and this have also contemplated that following facts, i.e., on The details of the embodiment of these block diagram arrangements be depend highly on will implement the present invention platform (that is, these details should When being completely in the range of the understanding of those skilled in the art).Elaborating detail (for example, circuit) to describe the present invention Exemplary embodiment in the case of, it will be apparent to those skilled in the art that can these be specific thin Implement the present invention in the case of section or in the case that these details change.Therefore, these descriptions are considered as It is bright property rather than restricted.

Although having been incorporated with specific embodiment of the invention, invention has been described, according to retouching above State, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example Such as, other memory architectures (for example, dynamic ram (DRAM)) can use discussed embodiment.

Embodiments of the invention be intended to fall within the broad range of appended claims it is all it is such replace, Modifications and variations.Therefore, within the spirit and principles of the invention, any omission, modification, equivalent substitution, the improvement made Deng should be included in the scope of the protection.

Claims

A kind of 1. genomic data storage method, it is characterised in that including：

In comparison process, gene order comparison information is obtained, and creates gene order statistical information；

The gene order comparison information is stored in disk, and by gene order comparison information in the comparison position of genome, Corresponding index is stored in internal memory；The index is storage location of the gene order comparison information in disk；

The genome statistical information is classified, obtains the first statistical information and the second statistical information；

First statistical information is stored in internal memory, first statistical information is access frequency in variation detection process higher than default The statistical information of frequency；

Second statistical information is stored in disk, second statistical information be can not be stored in internal memory statistical information and/or Access frequency is less than the statistical information of predeterminated frequency in the detection process that makes a variation.
2. according to the method for claim 1, it is characterised in that first statistical information includes base weighted quality primary system Count information, positive minus strand statistical information, insertion and deletion statistical information and soft shearing statistical information.
3. according to the method for claim 2, it is characterised in that for there is not insertion and deletion and soft shearing and base class At most there is 2 kinds of site in type, and first statistical information in the site uses the first data structure storage；

First data structure, including：

For representing the first head of base type；

For representing the first mass value storage part of base weighted quality value；

For representing the first normal chain number storage part of normal chain quantity；

For representing the first minus strand number storage part of minus strand quantity.
4. according to the method for claim 2, it is characterised in that for there is insertion and deletion appearance, and base type occurred The site of 3-4 kinds, first statistical information in the site use the first data structure and the second data structure storage；

Second data structure, including：

The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information；The alkali of every kind of base type The storage organization of base weighted quality Data-Statistics information and positive minus strand statistical information specifically includes：For representing base weighted quality value The second mass value storage part, for representing the second normal chain number storage part of normal chain quantity, and, for representing minus strand quantity Second minus strand number storage part；

First insertion statistical information, is specifically included：For representing the first insetion sequence storage part of insetion sequence, for representing low The first low quality insertion number storage part of quality insertion quantity；

First miss statistics information, is specifically included：For representing the first missing length storage part of missing length, for representing high Quality lacks the first high quality missing number storage part of quantity, for representing the first low quality missing number of low quality missing quantity Storage part；

First data structure, including：

With the second head of 11 fillings；

For indicating whether the first insertion information storage part in the presence of insertion, specifically include：For indicating whether in the presence of insertion The first insertion sub- storage part of information, the sub- storage part of intubating length for representing intubating length, for representing that low quality inserts number The low quality insertion number storage part of amount；

For indicating whether the first missing information storage part in the presence of missing, specifically include：For indicating whether in the presence of missing The sub- storage part of first missing information；

For pointing to the pointer of corresponding second data structure storage position.
5. according to the method for claim 2, it is characterised in that for occur unnecessary 1 insertion and deletion, intubating length it is big In the site of 12 bases, first statistical information in the site uses the first data structure and the 3rd data structure storage, and right The first statistical information in such site, memory pool is created in internal memory to be stored；

3rd data structure, including：

The respective base weighted quality Data-Statistics information of 4 kinds of base types and positive minus strand statistical information；The alkali of every kind of base type The storage organization of base weighted quality Data-Statistics information and positive minus strand statistical information specifically includes：For representing base weighted quality value The 3rd mass value storage part, for representing the 3rd normal chain number storage part of normal chain quantity, and, for representing minus strand quantity 3rd minus strand number storage part；

Second insertion statistical information, is specifically included：For representing the intubating length storage part of intubating length, for representing to insert sequence Second insetion sequence storage part of row, for representing that the second low quality of low quality insertion quantity inserts number storage part, and, use Number storage part is inserted in the high quality for representing high quality insertion quantity；

Second miss statistics information, is specifically included：For representing the second missing length storage part of missing length, for representing high Quality lacks the second high quality missing number storage part of quantity, for representing the second low quality missing number of low quality missing quantity Storage part；

First data structure, including：

With the 3rd head of 11 fillings；

For indicating whether the second insertion information storage part in the presence of insertion, specifically include：For indicating whether in the presence of insertion The second insertion sub- storage part of information, for indicating whether to have used the first sub- storage part of memory pool information of memory pool, for table Show that first of the occupancy length in memory pool takes the sub- storage part of length；

For indicating whether the second missing information storage part in the presence of missing, specifically include：For indicating whether in the presence of missing The sub- storage part of second missing information, for indicating whether to have used the second sub- storage part of memory pool information of memory pool, for table Show that second of the occupancy length in memory pool takes the sub- storage part of length.
6. according to the method for claim 2, it is characterised in that for the soft shearing statistical information, using a dynamic Array records, and every record includes：

For representing the soft clipped position storage part of soft shearing present position on genome；

For representing that the soft shearing left-hand digit storage part in the number on the corresponding site left side occurs for soft shearing；

For representing that the soft shearing right-hand component storage part of the number on the right of corresponding site occurs for soft shearing.
7. according to the method for claim 1, it is characterised in that the index includes both-end comparison information index and single-ended ratio To information index；

Indexed for both-end comparison information, comparing structure of arrays using both-end is stored, and the both-end compares structure of arrays bag Include：

For the first ID storage parts for the ID for representing gene order；

Position storage part is compared for representing that gene order is compared to first of the position on genome；

For the Insert Fragment length storage part for the Insert Fragment length for representing gene order；

For the first comparison mass value storage part of the comparison mass value for representing gene order；

For the first average mass values storage part of the average mass values for representing gene order；

Index for single-ended comparison information, stored using single-ended comparison structure of arrays, the single-ended comparison structure of arrays bag Include：

For the 2nd ID storage parts for the ID for representing gene order；

Position storage part is compared for representing that gene order is compared to second of the position on genome；

For the second comparison mass value storage part of the comparison mass value for representing gene order；

For the second average mass values storage part of the average mass values for representing gene order；

Wherein, for every gene order for being used for comparison, according to comparison position of the gene order on genome, its is corresponding Index be arranged in order.
8. according to the method for claim 1, it is characterised in that the gene order comparison information is stored in disk, had Body includes：

So gene order comparison information is divided into 512 files and is stored in disk, each file stores certain genomic region Between gene order comparison information, the data storage structure of every gene order comparison information includes：

For the sequence length storage part for the sequence length for representing gene order；

For representing the sequence storage part of gene order in itself；

For the mass value storage part for the mass value for representing gene order；

For representing the starting position storage part of alignment algorithm starting position of the gene order when comparing；

For representing the positive minus strand storage part of positive and negative chain information of the gene order when comparing；

The zone length storage part for the genomic region lengths chosen for representing gene order when comparing；

For representing the leftward position storage part of gene order riveting fixed position in the left side when comparing；

The right positions storage part of the position fixed for riveting on the right of representing gene order when comparing.
9. according to the method described in claim any one of 1-8, it is characterised in that also include：

During deduplication, interference caused by repetitive sequence in the genome statistical information is subtracted；

And/or

In weight comparison process, the gene order of the heavy comparison area of genome is extracted, again than the gene of counterweight comparison area After sequence, the genome statistical information of the gene order of weight comparison area is adjusted.
10. a kind of electronic equipment, including：

At least one processor；And

The memory being connected with least one processor communication；Wherein,

The memory storage has can be by the instruction of one computing device, and the instruction is by least one processor Perform, so that at least one processor is able to carry out the method as described in claim 1-9 any one.