CN103971031A - Read positioning method oriented to large-scale gene data - Google Patents

Read positioning method oriented to large-scale gene data Download PDF

Info

Publication number
CN103971031A
CN103971031A CN201410185387.2A CN201410185387A CN103971031A CN 103971031 A CN103971031 A CN 103971031A CN 201410185387 A CN201410185387 A CN 201410185387A CN 103971031 A CN103971031 A CN 103971031A
Authority
CN
China
Prior art keywords
reading
section
son
location
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410185387.2A
Other languages
Chinese (zh)
Other versions
CN103971031B (en
Inventor
杨明
涂金金
高阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Normal University
Original Assignee
Nanjing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Normal University filed Critical Nanjing Normal University
Priority to CN201410185387.2A priority Critical patent/CN103971031B/en
Publication of CN103971031A publication Critical patent/CN103971031A/en
Application granted granted Critical
Publication of CN103971031B publication Critical patent/CN103971031B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a read positioning method oriented to large-scale gene data, and belongs to the field of biological information analysis. The method comprises the following steps that gene read data are split randomly; load balancing of the data is conducted; spatial indexes of reads are constructed; sub-reads are positioned without crossing shearing positions; the sub-reads are positioned across the shearing positions; the sub-reads are spliced; read positioning information statistics is conducted. By the adoption of a read positioning method based on spaced seeds, the read positioning method oriented to the large-scale gene data is improved, so that the situation that the reads are positioned across the shearing positions can be handled; a parallelization program is designed and achieved under a MapReduce framework, so that read positioning efficiency is improved. In addition, the invention provides a solution to load balancing, the good capability to handle the situation that execution time for individual nodes is long, so that whole task execution efficiency is reduced is achieved, and therefore high use value is achieved.

Description

A kind of localization method of the section of reading towards extensive gene data
Technical field
The invention belongs to analysis of biological information field, particularly a kind of localization method of the section of reading towards extensive gene data.
Background technology
Gene order coupling (section of reading location) is an important subject of bioinformatics, is also the key link that gene data is analyzed.Gene order need to could be launched the step of follow-up a series of analysis of biological information after the section of reading position fixing process, as gene expression dose assessment, alternative splicing event recognition, cluster etc.Therefore, the gene section of reading location has been subject to the numerous researchers' in biological information field extensive concern.
The fast development of a new generation's gene sequencing technology has produced the gene of magnanimity and has read segment data, and this has brought great challenge to the traditional section of reading location algorithm.Huge gene is read segment data and is used conventional serial algorithm to move on monokaryon machine, can not meet researcher's demand far away.Generally locate the gene of tens GB and read the segment data needs time of about one week, even longer, could obtain final positioning result, thereby cause whole gene data analysis process too slow.Current to read segment length generally longer, uses tradition not cross over to shear the room seed algorithm of position may omit some correct location, and the positioning result obtaining is often accurate not.In addition, in parallelization process, always exist the execution time of certain node longer, make whole operation wait for indivedual nodes, this situation has had a strong impact on the efficiency of the section of reading location.Current study hotspot is also for how to address the above problem to launch.
The typical section of the reading localization method of crossing over shearing position comprises that Segmentation of Data Set, the section of reading are cut apart, the son section of reading is not crossed over location, shearing position, a location and splicing are sheared in the leap of the son section of reading.Good dividing method and efficiently the son section of reading are crossed over location, shearing position and can be eliminated to a certain extent the situation of load imbalance and reduce algorithm time complexity, reach the effect that improves the section of reading location efficiency.
Summary of the invention
The present invention, in order to solve load imbalance problem in the problem of the inefficiency existing in the gene section of reading position fixing process and Data Segmentation, has proposed a kind of localization method of the section of reading towards extensive gene data, can effectively improve the efficiency of the gene section of reading location.
The technical solution used in the present invention is as follows:
The section of a reading localization method towards extensive gene data, comprises the steps:
The section of a reading localization method towards extensive gene data, is characterized in that, comprises the steps:
Step 1, gene is read segment data random division: given gene is read to the piece number that segment data set random division becomes appointment;
Step 2, the load balance of data: utilize MapReduce (mapping reduction) framework, randomly draw a small amount of gene section of reading and attempt carrying out from the data block of having cut apart, detect possibility longer data block of execution time, itself and other data block is cut apart again to equally loaded;
Step 3, the spatial index of the section of reading: adopt MapReduce framework, the section of reading is divided into the son section of reading of specified number of segments in Map process, every height section of reading is generated to room spermotype, the cryptographic hash of each pattern is recorded in Hash table with the corresponding segment information of reading, builds the spatial index of the section of reading;
Step 4, the son section of reading is not crossed over and is sheared location, position: in Map process, utilize the existing section of reading positioning software first to locate carrying out the section of reading of location continuously in the son section of reading, record location information;
Step 5, the son section of reading is crossed over and is sheared location, position: behind step 4 location, the son section of reading that can locate continuously has obtained locating information, in Map process, according to son after positioning, read segment information, the locating information of the son section of reading that deduction can not be located continuously, realizes and crosses over the section of the reading location of shearing position;
Step 6, the son section of reading splicing: the locating information of the son section of reading being divided into through all sections of reading of step 5 is all recorded, and in Map process, the son section of reading of all location is assembled, and records all locating information that can connect into the original section of reading;
Step 7, the section of reading locating information statistics: in Reduce process, the information of all section of reading location that pass over by Map process is gathered, the statistics section of reading is positioned at the Global Information on reference sequences.
The load balance of described step 2 data specifically comprises the steps: that MapReduce framework is divided into N piece by whole data set acquiescence, is designated as { S 1, S 2..., S n, from every, randomly draw the gene data of some, form corresponding sub-block, carry out the section of reading location algorithm, detect potential longer piecemeal of possible execution time; Again cut apart detected data block, and it is merged with other undivided data blocks respectively, be finally again divided into the piece number of appointment.
The spatial index of described step 3 section of reading specifically comprises the steps: in Map process, and the section of reading in corresponding each data set sub-block is divided into M the son section of reading, and every height section of reading is created to spermotype; The inevitable limit of each spermotype all possible mode of specifying mispairing number base that contains, convert each spermotype to cryptographic hash, with the form of key-value pair by cryptographic hash with read segment information and be stored in Hash table; In position fixing process, just can from this table, by the cryptographic hash of the section of reading, obtain the information of specifying the section of reading fast, this Hash table is exactly the spatial index of all sections of reading.
Described step 5 section of reading is crossed over location, shearing position and is specifically comprised the steps: in each Map, and according to the length of the section of reading, the number of the shearing site that the son section of reading can be crossed over is limited in the quantity of appointment; According to existing biological information, limit introne maximum length, improve location efficiency; According to son after positioning, read segment information, by extension, attempt the mode of coupling and determine that the son that can not locate continuously reads fragment position information; If all son sections of reading of the original section of reading can both position, retain locating information; Otherwise give up this section of reading.
First the splicing of described step 6 section of reading specifically comprise the steps:, every height section of reading all records it can navigate to all possible positions on reference sequences, by all positional informations of storage of array; Then, from first locating information of first son section of reading, from the position array of first son section of reading, take out first position, detect with all positional informations of second son section of reading, check and whether have along connecing relation with one of them position; If had, continue to detect second son and read that position that Duan Shun connects and the relation of the 3rd all positional informations of the son section of reading, until last height section of reading; If no, stop detecting; Then continue to detect from second position of first son section of reading, until check out all positional informations of first son section of reading; Finally, extract all section of reading locating information that can complete coupling.
The present invention is the method proposing for the gene section of reading location specially.Compared with prior art, the present invention has following characteristics:
(1) the present invention is directed to existing gene data amount large, this problem of the conventional serial gene section of reading location algorithm inefficiency, the section of the reading location algorithm of parallelization has been proposed, and utilize MapReduce framework to be realized, do not affecting under the condition of the section of reading location correctness, improving the efficiency of the gene section of reading location;
(2) the present invention improves shear the section of the reading location algorithm of position based on not crossing over of room seed, enables to process the section of the reading location situation of shearing position of crossing over; In addition, added biological information: the length of introne and the quantity of shearing site, dwindled the search volume of algorithm; Algorithm during the improvement section of reading location, has reduced the time complexity of algorithm, thereby has further improved the efficiency of the gene section of reading location.
(3) the present invention, in the process of the section of reading location, has considered load balance problem specially, has proposed a kind of solution of reality, has effectively processed indivedual node execution time long, reduces the situation of whole efficiency, so has higher use value.
Accompanying drawing explanation
Fig. 1 is overall flow figure of the present invention;
Fig. 2 is the load balance step sub-process figure of data in the present invention;
Fig. 3 is that the neutron section of reading of the present invention is not crossed over a shearing position positioning step sub-process figure;
Fig. 4 is that the neutron section of reading of the present invention is crossed over a shearing position positioning step sub-process figure.
Embodiment
Below in conjunction with accompanying drawing explanation the specific embodiment of the present invention.
As shown in Figure 1, the invention discloses a kind of localization method of the section of reading towards extensive gene data, concrete steps are as follows:
Step 1, gene is read segment data random division: given gene is read to the piece number that segment data set random division becomes appointment;
Step 2, the load balance of data: adopt MapReduce framework, from the data block of having cut apart, randomly drawing a small amount of gene section of reading attempts carrying out, detect possibility longer data block of execution time, it is cut apart again, add to respectively in remainder data piece, finally remaining data block is cut apart again, realize the equally loaded of data;
Step 3, the spatial index of the section of reading: adopt MapReduce framework, the section of reading is divided into the son section of reading of specified number of segments in Map process, every height section of reading is generated to room spermotype, the cryptographic hash of each pattern is recorded in Hash table with the corresponding segment information of reading, builds the spatial index of the section of reading;
Step 4, the son section of reading is not crossed over and is sheared location, position: in Map process, utilize the existing section of reading positioning software first to locate carrying out the section of reading of location continuously in the son section of reading, record location information;
Step 5, the son section of reading is crossed over and is sheared location, position: in Map process, according to son after positioning, read segment information, the locating information of the son section of reading that deduction can not be located continuously, realizes and cross over the section of the reading location of shearing position;
Step 6, the son section of reading splicing: in Map process, the son section of reading of all location is assembled, record all locating information that can connect into the original section of reading;
Step 7, the section of reading locating information statistics: in Reduce process, the information of all section of reading location that pass over by Map process is gathered, the statistics section of reading is positioned at the Global Information on reference sequences, as location quantity, position etc.;
Suppose that in the section of reading position fixing process, allowing mispairing quantity is 2.
In step 1, gene is read segment data random division and can be utilized MapReduce framework to realize.Suppose that will read at random segment data set is divided into N piece, the quantity that can specify Reduce is N; And can arrange one in Map, can produce the random generator of counting between 1-N, during the section of reading of each traversal, produce a random number, this random number correspondence Reduce, expression is assigned to corresponding Reduce by this section of reading, and the last section of reading that only need to take out in Reduce has just completed Data Segmentation.
In step 2, utilize MapReduce framework that whole data set acquiescence is divided into N piece, be designated as { S 1, S 2..., S n, from every (64MB), randomly draw the gene data of some, form corresponding sub-block, carry out the section of reading location algorithm, detect longer piecemeal of potential possible execution time and (suppose the detected S of being 1, S 2).
Again partition data piece, balanced load: by S 1, S 2be divided into separately N piece, be designated as respectively { S 11, S 12..., S 1N, { S 21, S 22..., S 2N.Merge all the other N-2 pieces, be more again divided into N piece, be designated as { S 31, S 32..., S 3N.By S 11, S 21, S 31be merged into S ' 1, according to same merging method, obtain N final piecemeal, be designated as S ' 1, S ' 2..., S ' n.
Wherein from every, randomly drawing the sample of some can realize by a MapReduce task, concrete steps:
The Map stage:
(1) according to path, read segment data set;
(2) while (the hop count amount of reading of extraction meets the demands)
1. produce at random the section of a reading subscript;
2. the section of reading of randomly drawing is sent into reduce;
3. delete the section of reading of having taken out;
4. the hop count amount of reading extracting adds 1;
(3) finish while circulation.
The Reduce stage:
(1) traversal location number duration set;
1. export reference sequences title and final location quantity;
(2) finish.
Step 2 flow process as shown in Figure 2, is read to randomly draw the part section of reading segment data piece from each height, as test data set; Under MapReduce framework with the subdata collection that obtains as input, carry out the section of reading location algorithm, to detect indivedual nodes that can affect the whole execution time; In web browsing mode, check the execution time of each Map node, according to the length of time, infer the larger node of possibility load; The segment data set of reading in the larger node of detected load is cut apart again; Add the subdata collection of again cutting apart to other undivided data centralization, reach balanced loaded object.
In step 3, in Map process, all sections of reading in corresponding each data set sub-block are divided into N the son section of reading.
For every height section of reading, create spermotype, main thought is as follows: suppose the son section of reading to be divided into continuous Uncrossed k part again, allow at most m mispairing (m<k), coupling for each lower than mispairing designated value, in the k a having divided part, have at least k-m part there is no mispairing.This k-m part belongs to k part, therefore total combination.And the spermotype that can select just should have individual.The inevitable limit of each spermotype all possible mode of specifying mispairing number base that contains.
Convert each spermotype to cryptographic hash, with the form of key-value pair, cryptographic hash and the corresponding original segment information of reading are stored in Hash table.In position fixing process, just can from this table, by the cryptographic hash of the section of reading, obtain the information of specifying the section of reading fast, this Hash table is exactly the spatial index of all sections of reading.
In step 4, in Map process, utilize the existing section of reading positioning software first to locate carrying out the section of reading of location continuously in the son section of reading, record location information.What in the present invention, adopt is the section of the reading location algorithm in SeqMap software, and the section of reading is divided into 4 son sections of reading, 2 of mispairing, and the spermotype of every height section of reading just has 6 kinds.Build the spatial index of all son sections of reading, travel through all reference sequences, the son section of reading that can locate continuously navigates on reference sequences.
Step 4 flow process as shown in Figure 3, is first cut apart the son section of reading again, is divided into 4 sections in the present invention, creates 6 Vacancy spermotypes; Then all spermotypes are converted to cryptographic hash, and cryptographic hash is stored in Hash table with the corresponding segment information of reading, form the spatial index of all son sections of reading; Travel through successively each base in reference sequences; Whether the cryptographic hash forming in determining step 15 exists in spatial index; If there is this cryptographic hash in spatial index, the section of reading of correspondence is taken out with the base in reference sequences and further mate; If there is no, interrupt this time attempting coupling; Continue the next base of traversal.
In step 5, behind step 4 location, the son section of reading that can locate continuously has obtained locating information;
In each Map, according to the length of the section of reading, the number of the shearing site that the son section of reading can be crossed over is limited to the quantity of appointment and (supposes to read segment length 84bp with interior, can limit maximum permissions and cross over 2 shearing sites), because it is more to cross over shearing site quantity, algorithm just needs the more time to go to detect how possible locator meams; According to existing biological information, limit introne maximum length (than only having considerably less length of intron to be greater than 400Kb in mammalian genes group as is known), the search volume that can reduce like this algorithm, improves location efficiency;
According to son after positioning, read segment information, by extension, attempt the mode of coupling and determine that the son that can not locate continuously reads fragment position information.The section of the reading location of crossing over shearing position can be divided into two kinds of situations:
The first situation is that the section of reading of alien up and down of the uncertain seat section of reading is all located, and this situation is comparison easy to handle.In the present invention, from last base of the upstream section of reading and first base of the downstream section of reading, start coupling respectively, record respectively the previous position of mispairing 0,1,2, in the position of recording, search and can connect, and mispairing minimum position, be designated as shearing site.Need once travel through the method that finds shearing site by this, can between the time complexity of this process is reduced to O (N).
The second situation is that the upstream and downstream subsegment of the uncertain seat section of reading only has one to locate.For this situation, in the present invention, intercept Du Duan end, uncertain seat one segment length, be designated as h-mer; First by detecting the position of h-mer in upper alien's section of reading or lower alien's section of reading, and then utilize the processing mode section of reading of the first situation to locate.And the length value of h-mer is different from the value (directly h-mer length being fixed as to 2) in general algorithm in the present invention.The length of Dynamic Acquisition h-mer of the present invention, first extends coupling until there is the position of 2 mispairing by oriented upstream or lower alien's section of reading, the length L that record now extends, and the length of h-mer is just taken as uncertain seat and reads segment length and deduct L.By this way, can reduce as far as possible the uncertain seat section of reading and attempt matching times, and make to improve algorithm by time complexity O from original O (N 3) be reduced to O (N 2).
Carry out above-mentioned leap and shear behind location, position, if all son sections of reading of the original section of reading can both position, retain locating information; Otherwise give up this section of reading.
Step 5 flow process as shown in Figure 4, locate, record position information by the son section of reading that can locate continuously with the existing section of the reading positioning software based on room seed; According to son after positioning, read segment information, infer the son section of reading of no-fix; Possible the first situation of the uncertain seat section of reading, its upstream and downstream is read Duan Junyi and is located successfully, now by the upper and lower alien's section of reading that extends, determines the positional information of the uncertain seat section of reading; The second situation that the uncertain seat section of reading is possible, one of its upstream or lower alien's section of reading no-fix, can intercept in this case Du Duan end, uncertain seat one segment length, detect its position in upper alien's section of reading or lower alien's section of reading, so just the second situation has been converted into the first situation, more just can have completed location by the processing mode of the first situation; The splicing section of reading step is analyzed all son section of reading locating information, is spliced into the complete section of reading locating information.
In step 6, the locating information of the son section of reading being divided into through all sections of reading of step 5 is all recorded;
First, every height section of reading all records it can navigate to all possible positions (can be unnecessary 1) on reference sequences, can be by all positional informations of storage of array;
Then, from first locating information of first son section of reading, from the position array of first son section of reading, take out first position, detect with all positional informations of second son section of reading, check and whether have along connecing relation with one of them position; If had, continue to detect second son and read that position that Duan Shun connects and the relation of the 3rd all positional informations of the son section of reading, until last height section of reading; If no, stop detecting; Then continue to detect from second position of first son section of reading, until check out all positional informations of first son section of reading;
Finally, extract all section of reading locating information that can complete coupling.
In step 7, in Reduce process, the information of all section of reading location that pass over by Map process is gathered, the statistics section of reading is positioned at the Global Information on reference sequences, as location quantity, position etc.
Embodiment:
The cloud computing platform that the present embodiment is used has 21 station servers (1 master node, 20 slave nodes), and each node configuration is identical.The processor of node is Intel (R) Xeon (R) CPU E5620, and dominant frequency is 2.40GHZ, and operating system is 64 Debian (SuSE) Linux OS.Hadoop platform release is hadoop-0.20.2.
The gene data adopting is from mouseearcress genome, to extract 10 genes as reference sequences, and average length is about 12000bp.Read segment data set and mainly contain two sources: 1. with ES nutrient solution, cultivate mouseearcress, repeat to test 3 times, obtain the mouseearcress RNA in three situations, through order-checking, obtain 3 section of reading sample sets.2. in scarce P situation, cultivate mouseearcress, repeat equally to test 3 times, obtain 3 section of reading sample sets.The segment data set of reading obtaining in two kinds of situations is total to 25.2GB, and we divide these 6 section of reading sample sets of another name is ES1, ES2, ES3, P1, P2 and P3.
The present embodiment comprises following part:
1. gene is read segment data random division:
Gene is read segment data random division and can be utilized MapReduce framework to realize.Suppose that will read at random segment data set is divided into N piece, the quantity that can specify Reduce is N; And can arrange one in Map, can produce the random generator of counting between 1-N, during the section of reading of each traversal, produce a random number, this random number correspondence Reduce, expression is assigned to corresponding Reduce by this section of reading, and the last section of reading that only need to take out in Reduce has just completed Data Segmentation.Main code is as follows:
The Map stage:
(1)read_data=get_read_dataset(read_path);
(2)random=Math.rand()*N+1;
(3)context.write(random,read_info);
The Reduce stage
(1)for(count=0;count<read_list.length;count++)
①read=read_list.get(count);
②context.write(task_num,read);
(2)end.
2. the load balance of data:
Utilize MapReduce framework that whole data set acquiescence is divided into N piece, be designated as { S 1, S 2..., S n, from every (64MB), randomly draw the gene data of some, form corresponding sub-block, carry out the section of reading location algorithm, detect potential longer piecemeal of possible execution time.
Main code is as follows:
(1) split_to_N-2 (S 1, S 2, N-2); //S 1, S 2it is the higher piece of potential load;
(2) add (list_S 1, list_S 2, N-2); // piece of again cutting apart is added in all the other N-2 pieces;
(3) resplit (read_N-2_path, N); // N-2 piece is reclassified as to N piece.
Wherein from every, randomly drawing the sample of some can realize by a MapReduce task, concrete steps:
The Map stage:
(1)read_data=get_readdata(read_path);
(2)while(read_num<=fix_readNum)
①random=Math.rand()*read_totalNum+1;
②read=read_data.get(random);
③context.write(fix_reduce,read);;
④read_data.move(random);
⑤read_num++;
(3)end while.
The Reduce stage:
(1)for(count=0;count<read_data.length;count++);
○1context.write(read_name,map_num);
(2)end.
3. the spatial index of the section of reading:
In the present invention, the length of the section of reading is 84bp, allows at the most the section of reading to cross over 2 (seldom have in the section of reading of 84bp length according to the experience of bioinformatics and cross over 2 sections of reading more than shearing site) shearing sites.Therefore,, in Map process, all sections of reading in corresponding each data set sub-block are divided into 4 (moderate) son sections of reading.
For every height section of reading, create spermotype, this spermotype comes from SeqMap software, specifically can list of references: Jiang H, Wong W H.SeqMap:mapping massive amount of oligonucleotides to the genome[J] .Bioinformatics, 2008,24 (20): 2395-2396.
Main code is as follows:
The Map stage:
(1) list_sub_read=split_read (read, fix_read_num); // section of reading is divided into specified number of segments;
(2) list_seed_modes=create_modes (list_sub_read); // establishment spermotype;
(3)for(count=0;count<list_seed_modes.length;count++)
①hashcode=hashcode(list_seed_modes.get(count));
②hash_map.put(hashcode,read_info);
// cryptographic hash and the corresponding original segment information of reading are stored in Hash table;
(3)end.
4. the son section of reading is not crossed over and is sheared location, position:
In Map process, utilize the existing section of reading positioning software first to locate carrying out the section of reading of location continuously in the son section of reading, record location information.What in the present invention, adopt is the section of the reading location algorithm in SeqMap software, and the section of reading is divided into 4 son sections of reading, 2 of mispairing, and the spermotype of every height section of reading just has 6 kinds.Build the spatial index of all son sections of reading, travel through all reference sequences, the son section of reading that can locate continuously navigates on reference sequences.
Main code is as follows:
(1) for (int count=0; Count<ref_list.length; Count++) // traversal reference sequences data collection
①ref=ref_list.get(count);
2. for (int loc=0; Loc<ref.length; Loc++) // each base of traversal reference sequences
a.seed_mode=create_mode(ref.substring(loc,loc+84));
b.if(hash_map.get(seed_mode)!=null)match(ref,read);
③end;
(2)end.
Wherein math () function code is as follows:
(1)for(int count=0;count<read.length;count++)
①if(!read.get(count).equals(ref.get(count)))break;
else context.write(read_name,one);
(2)end.
5. the son section of reading is crossed over and is sheared location, position:
Behind step 4 location, the son section of reading that can locate continuously has obtained locating information; According to son after positioning, read segment information, by extension, attempt the mode of coupling and determine that the son that can not locate continuously reads fragment position information.
Main code is as follows:
(1) if (pre_subread.isFixed & & next_subread.isFixed) // the first situation
①brige_match(read,ref);
(2)end;
(3) if (! Next_subread.isFixed)) // the second situation
1. h_mer=read.substring (read.length-L); // Dynamic Acquisition h_mer
2. loc=search (h_mer, ref); // search h_mer position
③brige_match(h_mer,ref);
(4)end.
Wherein brige_match () function code is as follows:
(1)for(int loc=0;lco<read.length;loc++)
①if(!read.get(loc).equals(ref.get(pre_subread.length+loc)))
a.mis_num++;
b.list_loc=add(loc);
②if(mis_num>2)break;
(2)end.
(3)for(int loc=read.length;loc>=0;loc--)
①index=1;
②if(!read.get(loc).equals(ref.get(next_subread.loc-index)))
a.mis_num++;
b.list_loc=add(loc);
③if(mis_num>2)break;
(4)end,
(5) spliced_site=find_min_mistake (list_loc). // the minimum tie point of mispairing searched
6. the son section of reading splicing, extracts all section of reading locating information that can complete coupling.
7. the statistics section of reading is positioned at the Global Information on reference sequences.
Above a kind of localization method of the section of reading towards extensive gene data provided by the present invention is described in detail.Method and approach that it should be noted that this technical scheme of specific implementation have a lot, and the above is only the preferred embodiment of the present invention, only for helping to understand method of the present invention and core concept; Meanwhile, for one of ordinary skill in the art, on the basis of core concept of the present invention, the modification of making and adjustment all will be considered as protection scope of the present invention.In sum, this description should not be construed as limitation of the present invention, and protection scope of the present invention should be limited to the appended claims.

Claims (5)

1. towards the section of a reading localization method for extensive gene data, it is characterized in that, comprise the steps:
Step 1, gene is read segment data random division: given gene is read to the piece number that segment data set random division becomes appointment;
Step 2, the load balance of data: utilize MapReduce framework, randomly draw a small amount of gene section of reading and attempt carrying out from the data block of having cut apart, detect possibility longer data block of execution time, itself and other data block is cut apart again to equally loaded;
Step 3, the spatial index of the section of reading: adopt MapReduce framework, the section of reading is divided into the son section of reading of specified number of segments in Map process, every height section of reading is generated to room spermotype, the cryptographic hash of each pattern is recorded in Hash table with the corresponding segment information of reading, builds the spatial index of the section of reading;
Step 4, the son section of reading is not crossed over and is sheared location, position: in Map process, utilize the existing section of reading positioning software first to locate carrying out the section of reading of location continuously in the son section of reading, record location information;
Step 5, the son section of reading is crossed over and is sheared location, position: behind step 4 location, the son section of reading that can locate continuously has obtained locating information, in Map process, according to son after positioning, read segment information, the locating information of the son section of reading that deduction can not be located continuously, realizes and crosses over the section of the reading location of shearing position;
Step 6, the son section of reading splicing: the locating information of the son section of reading being divided into through all sections of reading of step 5 is all recorded, and in Map process, the son section of reading of all location is assembled, and records all locating information that can connect into the original section of reading;
Step 7, the section of reading locating information statistics: in Reduce process, the information of all section of reading location that pass over by Map process is gathered, the statistics section of reading is positioned at the Global Information on reference sequences.
2. a kind of localization method of the section of reading towards extensive gene data according to claim 1, is characterized in that, the load balance of described step 2 data specifically comprises the steps:
MapReduce framework is divided into N piece by whole data set acquiescence, is designated as { S 1, S 2..., S n, from every, randomly draw the gene data of some, form corresponding sub-block, carry out the section of reading location algorithm, detect potential longer piecemeal of possible execution time; Again cut apart detected data block, and it is merged with other undivided data blocks respectively, be finally again divided into the piece number of appointment.
3. a kind of localization method of the section of reading towards extensive gene data according to claim 1, is characterized in that, the spatial index of described step 3 section of reading specifically comprises the steps:
In Map process, the section of reading in corresponding each data set sub-block is divided into M the son section of reading, every height section of reading is created to spermotype; The inevitable limit of each spermotype all possible mode of specifying mispairing number base that contains, convert each spermotype to cryptographic hash, with the form of key-value pair by cryptographic hash with read segment information and be stored in Hash table; In position fixing process, just can from this table, by the cryptographic hash of the section of reading, obtain the information of specifying the section of reading fast, this Hash table is exactly the spatial index of all sections of reading.
4. a kind of localization method of the section of reading towards extensive gene data according to claim 1, is characterized in that, described step 5 section of reading is crossed over location, shearing position and specifically comprised the steps:
In each Map, according to the length of the section of reading, the number of the shearing site that the son section of reading can be crossed over is limited in the quantity of appointment; According to existing biological information, limit introne maximum length, improve location efficiency;
According to son after positioning, read segment information, by extension, attempt the mode of coupling and determine that the son that can not locate continuously reads fragment position information; If all son sections of reading of the original section of reading can both position, retain locating information; Otherwise give up this section of reading.
5. a kind of localization method of the section of reading towards extensive gene data according to claim 1, is characterized in that, described step 6 section of reading splicing specifically comprises the steps:
First, every height section of reading all records it can navigate to all possible positions on reference sequences, by all positional informations of storage of array;
Then, from first locating information of first son section of reading, from the position array of first son section of reading, take out first position, detect with all positional informations of second son section of reading, check and whether have along connecing relation with one of them position; If had, continue to detect second son and read that position that Duan Shun connects and the relation of the 3rd all positional informations of the son section of reading, until last height section of reading; If no, stop detecting; Then continue to detect from second position of first son section of reading, until check out all positional informations of first son section of reading;
Finally, extract all section of reading locating information that can complete coupling.
CN201410185387.2A 2014-05-04 2014-05-04 Read positioning method oriented to large-scale gene data Expired - Fee Related CN103971031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410185387.2A CN103971031B (en) 2014-05-04 2014-05-04 Read positioning method oriented to large-scale gene data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410185387.2A CN103971031B (en) 2014-05-04 2014-05-04 Read positioning method oriented to large-scale gene data

Publications (2)

Publication Number Publication Date
CN103971031A true CN103971031A (en) 2014-08-06
CN103971031B CN103971031B (en) 2017-05-17

Family

ID=51240519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410185387.2A Expired - Fee Related CN103971031B (en) 2014-05-04 2014-05-04 Read positioning method oriented to large-scale gene data

Country Status (1)

Country Link
CN (1) CN103971031B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243297A (en) * 2015-10-09 2016-01-13 人和未来生物科技(长沙)有限公司 Quick comparing and positioning method for gene sequence segments on reference genome
CN108287983A (en) * 2017-01-09 2018-07-17 朱瑞星 A kind of method and apparatus for carrying out compression and decompression to genome

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101457253A (en) * 2008-12-12 2009-06-17 深圳华大基因研究院 Sequencing sequence error correction method, system and device
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
JP2012003737A (en) * 2010-06-21 2012-01-05 Kosuke Amakatsu Method and system for classifying biological species based on codon information of gene
US20130116930A1 (en) * 2011-08-22 2013-05-09 The Board Of Trustees Of The Leland Stanford Junior University Method and System for Assessment of Regulatory Variants in a Genome
CN103392182A (en) * 2010-08-02 2013-11-13 人口诊断股份有限公司 Compositions and methods for discovery of causative mutations in genetic disorders

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101457253A (en) * 2008-12-12 2009-06-17 深圳华大基因研究院 Sequencing sequence error correction method, system and device
JP2012003737A (en) * 2010-06-21 2012-01-05 Kosuke Amakatsu Method and system for classifying biological species based on codon information of gene
CN103392182A (en) * 2010-08-02 2013-11-13 人口诊断股份有限公司 Compositions and methods for discovery of causative mutations in genetic disorders
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
US20130116930A1 (en) * 2011-08-22 2013-05-09 The Board Of Trustees Of The Leland Stanford Junior University Method and System for Assessment of Regulatory Variants in a Genome

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
涂金金 等: "基于MapReduce的基因读段定位方法", 《模式识别与人工智能》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243297A (en) * 2015-10-09 2016-01-13 人和未来生物科技(长沙)有限公司 Quick comparing and positioning method for gene sequence segments on reference genome
CN108287983A (en) * 2017-01-09 2018-07-17 朱瑞星 A kind of method and apparatus for carrying out compression and decompression to genome

Also Published As

Publication number Publication date
CN103971031B (en) 2017-05-17

Similar Documents

Publication Publication Date Title
Rautiainen et al. GraphAligner: rapid and versatile sequence-to-graph alignment
CN102799486B (en) Data sampling and partitioning method for MapReduce system
US10381106B2 (en) Efficient genomic read alignment in an in-memory database
Peng et al. IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels
Bonfert et al. ContextMap 2: fast and accurate context-based RNA-seq mapping
KR101029160B1 (en) Method, system and computer-readable recording medium for writing new image and its information onto image database
CN103218435A (en) Method and system for clustering Chinese text data
EP2759952B1 (en) Efficient genomic read alignment in an in-memory database
US20110238677A1 (en) Dynamic Sort-Based Parallelism
CN103714180A (en) Bioinformatics database system and data processing method
CN103258145A (en) Parallel gene splicing method based on De Bruijn graph
AU2014353667A1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
Kim et al. K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity
CN103995827B (en) High-performance sort method in MapReduce Computational frames
CN103136244A (en) Parallel data mining method and system based on cloud computing platform
CN104809231A (en) Mass web data mining method based on Hadoop
CN102651030A (en) Social network association searching method based on graphics processing unit (GPU) multiple sequence alignment algorithm
Zou et al. Reconstructing evolutionary trees in parallel for massive sequences
CN103971031A (en) Read positioning method oriented to large-scale gene data
CN106802958A (en) Conversion method and system of the CAD data to GIS data
CN102521713A (en) Data processing device and data processing method
CN104778088A (en) Method and system for optimizing parallel I/O (input/output) by reducing inter-progress communication expense
US20220359038A1 (en) Systems and Methods for the Efficient Identification and Extraction of Sequence Paths in Genome Graphs
CN103294932A (en) Reference sequence processing system and method for analyzing genome sequence
CN101405727B (en) Management of statistical views in a database system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170517

CF01 Termination of patent right due to non-payment of annual fee