CN110322927A

CN110322927A - A kind of CRISPR induction RNA library designs method

Info

Publication number: CN110322927A
Application number: CN201910712069.XA
Authority: CN
Inventors: 王建新; 李涛; 王劭恺; 严承; 李敏
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-08-02
Filing date: 2019-08-02
Publication date: 2019-10-11
Anticipated expiration: 2039-08-02
Also published as: CN110322927B

Abstract

The invention discloses a kind of CRISPR to induce RNA library designs method, comprising the following steps: Step 1: generating kmer set according to reference genome；Step 2: kmer is cut into kmer1 and kmer2 two parts, the identical kmer2 of corresponding kmer1 is divided into a classification；Same category of kmer2 is building up in the same trie tree again, the bond order of each trie tree is classified as the corresponding kmer1 of wherein kmer2；Step 3: parallel obtain induction RNA and its sequence of missing the target, in the step, in the kmer that the key sequence and kme2 therein that compare a kmer and trie tree connect into, the key sequence of the kmer1 of the kmer and trie tree are compared first, it sees and whether meets setting condition, meet the kmer2 in the kmer2 and trie tree for then continuing to compare kmer.The present invention improves computational efficiency.

Description

A kind of CRISPR induction RNA library designs method

Technical field

The invention belongs to functional genomics fields, and in particular to a kind of CRISPR induction RNA library designs method.

Background technique

In the genome editing technique of Current generation, CRISPR system (Clustered regularly The short palindrome repetitive sequence of the regular intervals of interspaced short palindromic repeats cluster) and Cas9 core The technical research of sour enzyme (guiding nuclease 9 with the associated RNA of CRISPR) is with fastest developing speed, can easily target and almost appoint What genomic locations has stepped major step for the innovation and development of genetic engineering.Cas9 albumen from CRISPR system passes through benefit RNA will be induced with the characteristics of induction RNA (guide RNA, gRNA, also referred to as guide RNA) and DNA target sequence base pair complementarity DNA target sequence is navigated to the complex of albumen.And the combination of the adjacent motif (PAM) in the original interval in target position downstream facilitates Cas9 cutting DNA double-strand is instructed, PAM is necessary to Cas9 nuclease cutting DNA double-strand.Wherein induction RNA is CRISPR- The key member of Cas system, is made of constant part and variable part, and variable part is the part complementary with DNA target sequence, can To realize the combination of induction RNA and DNA different loci by engineer's variable part.CRISPR induces the library RNA for base Because group editing system is most important.With the progress of genomic sequencing technique, induce the design in the library RNA for understanding genome Function becomes more and more important.

Many induction RNA design tools, such as CRISPR Design, Cas-OFFinder have now been developed, CRISPRscan and E-CRISP is used for genome editor.However, these tools return uniqueness but have the induction RNA collection of overlapping It closes, also has ignored full-length genome noncoding region, and compare tool using third party and carry out the sequence search that misses the target.

Recently, GuideScan software improves the building of CRISPER induction RNA database.GuideScan is one and opens Source system, can be from any genome or the more synthesis of CRISPR endonuclease design or the library induction RNA of Complete customization. In addition, GuideScan software can also construct single primed RNA and decoding for DTMF RNA database, and it can obtain and considerable have The induction RNA of multiple perfection target sites.Therefore, the induction RNA that GuideScan is obtained is than the induction that other tools averagely obtain RNA has higher specificity.However, the calculating cost of GuideScan is relatively high, especially when the sequence of missing the target for calculating induction RNA When column (not less than parameter M and being not more than parameter Q, Q > M with induction RNA mispairing number), the calculating cost of GuideScan is very high. Therefore, it is unpractical for GuideScan being applied to large-scale genome.Be sequenced with more and more eukaryotic gene groups or Again it is sequenced, needs more effective tool to accelerate the design in the CRISPR induction library RNA.

Summary of the invention

The object of the present invention is to provide a kind of CRISPR to induce RNA library designs method, and design CRISPR is effectively reduced and lures Lead the calculating time overhead in the library RNA.

A kind of CRISPR induction RNA library designs method, comprising the following steps:

Step 1: scanning is using standard PAM or non-standard PAM as the kmer of prefix or suffix, structure in reference genome Gather at kmer, the genome space that can be targeted；

Step 2: kmer1 and kmer2 two parts are cut into each of kmer set kmer, wherein Kmer1 is the sequence of its preceding n base composition, the identical kmer2 of corresponding kmer1 is divided into a classification, by a classification Key sequence of the corresponding kmer1 of kmer2 as category kmer2；Same category of kmer2 is building up to the same retrieval again It sets in (dictionary tree), thus multiple trie trees (data task is divided into a series of small task), the bond order of each trie tree is classified as The key sequence of the kmer2 of respective classes；

To kmer set each of kmer, if its using non-standard PAM as prefix or suffix or its in reference genome In frequency of occurrence be greater than 1, or exist with its Hamming distance be less than M kmer, then its be non-induced RNA, otherwise be induction RNA；

Step 3: all induction RNA are classified according to kmer1, to the induction RNA of all categories, all retrievals are traversed Tree, search are less than the kmer no more than Q with its Hamming distance；Wherein, to the induction RNA of a classification, all trie trees are traversed, Search with its Hamming distance be less than no more than Q kmer method specifically: first calculating the category induction RNA kmer1 with The Hamming distance of the key sequence of each trie tree, find out key sequence and category candidate induce RNA kmer1 Hamming distance not Trie tree greater than Q；Then, for the category each candidate induction RNA, respectively in the trie tree found out, search with The kmer2 that the candidate induces the Hamming distance of the kmer2 of RNA to be not more than Q-m；By these kmer2 respectively with the inspection where them The key sequence of Suo Shu, which links together, constitutes a plurality of kmer, as should with reference in genome with the sequence of these kmer complementary pairings Induce RNA sequence of missing the target (be duplex structure with reference to the DNA sequence dna in genome, base and another chain on one chain The base pair complementarity of upper corresponding position, if kmer sequence on a chain and the Hamming distance of induction RNA are not less than M and less In Q, then the mispairing number of the kmer sequence of corresponding position and induction RNA not less than M and are not more than Q on another chain, are induction RNA Sequence of missing the target；Coordinate of the kmer that writing scan goes out in step 1 on the DNA sequence dna of reference genome；In this step In, after the kmer for searching out the condition of satisfaction, according to its coordinate on DNA sequence dna, can quickly it be found therewith on DNA sequence dna The corresponding kmer sequence in position, i.e. the kmer sequence of complementary pairing therewith)；

The thus obtained induction RNA and its i.e. CRISPR of sequence information that misses the target induces the library RNA.

Above-mentioned steps use data prediction optimization algorithm, since the kmer1 of the induction RNA of each classification is identical, The key sequence of kmer2 is also identical in each trie tree, and first induction RNA classifies, then compares a classification induction RNA's The key sequence of kmer1 and trie tree greatly reduce calculation amount and time overhead that induction RNA is compared with kmer, avoid Each induction RNA and each kmer carry out global alignment one by one.And when the kmer1 of some classification induction RNA and some retrieval Set key sequence Hamming distance m be greater than Q when, the category induction RNA kmer2 will not need again with the kmer2 in the trie tree It is compared, again reduces calculation amount and time overhead.

Further, it in the step 1, scans parallel in multiple sequences of reference genome with standard PAM or non- Standard PAM is the kmer of prefix or suffix.

Further, will in multiple sequences of reference genome scanning using standard PAM or non-standard PAM as prefix or The kmer of suffix is divided into multiple subtasks, each subtask is i.e. at one of reference genome as a general assignment Scanning is using standard PAM or non-standard PAM as the kmer of prefix or suffix in sequence；Using in the multi-process module of python Process and queuing method simulate process pool function, execute multiple subtasks parallel.

Further, in the step 2, and the kmer2 for being about to multiple classifications is building up to respectively in multiple trie trees.

Further, it is building up in multiple trie trees respectively using by the kmer2 of multiple classifications as a general assignment, by it Multiple subtasks are divided into, the kmer2 of a classification is building up in a trie tree by each subtask；Using python's Process and queuing method in multi-process module simulate process pool function, execute multiple subtasks parallel.

Further, in the step 2, all trie trees are traversed parallel, judge wherein each kmer2 and its place Frequency of occurrence of the kmer that connects into of key sequence in reference genome of trie tree whether be greater than 1.

Further, all trie trees will be traversed, judge the key sequence of wherein each kmer2 and the trie tree where it Whether frequency of occurrence of the kmer connected into reference genome, which is greater than 1, is used as a general assignment, is divided into multiple sons Task, each subtask traverse a trie tree, and wherein the key sequence of each kmer2 and the trie tree where it connects for judgement Whether frequency of occurrence of the kmer being connected into reference genome is greater than 1；Using in the multi-process module of python process and Queuing method simulates process pool function, executes multiple subtasks parallel.

Further, in the step 2, each of gathering kmer kmer, if before it is with non-standard PAM Sew or suffix or its frequency of occurrence in reference genome are greater than 1, then it is non-induced RNA, and otherwise it is candidate induction RNA；

All candidate induction RNA are classified according to kmer1, parallel to the candidate induction RNA of all categories, traversal is all Trie tree judges whether there is the kmer for being less than M with its Hamming distance, if so, it is non-induced RNA, otherwise it is induction RNA。

Further, all trie trees will be traversed parallel to the candidate induction RNA of all categories, judge whether there is with Kmer of its Hamming distance less than M is divided into multiple subtasks, each subtask is i.e. to a class as a general assignment Other candidate induction RNA, traverses all trie trees, judges whether there is the kmer for being less than M with its Hamming distance；Using python Multi-process module in process and queuing method simulate process pool function, execute multiple subtasks parallel.

Further, to the candidate induction RNA of a classification, all trie trees is traversed, are judged whether there is and its Hamming The method of kmer of the distance less than M specifically: calculating category candidate first induces the kmer1 of RNA and the key of each trie tree The Hamming distance of sequence finds out trie tree of the Hamming distance no more than M of key sequence and the kmer1 of category candidate induction RNA； Then for each candidate induction RNA of the category, respectively in the trie tree found out, search whether there is to lure with the candidate The Hamming distance for leading the kmer2 of RNA is not more than the kmer2 of M-m, and if it exists, then illustrates exist and candidate induction RNA Hamming distance From the kmer for being less than M.Above-mentioned steps use data prediction optimization algorithm, and candidate due to each classification induces RNA's Kmer1 is identical, and the key sequence of kmer2 is also identical in each trie tree, first by candidate's induction RNA classification, then compares one Classification candidate induces the kmer1 of RNA and the key sequence of trie tree, greatly reduces the meter that candidate induction RNA is compared with kmer Calculation amount and time overhead avoid each candidate induction RNA and each kmer and carry out global alignment one by one.And work as some classification When the kmer1 of candidate's induction RNA and the Hamming distance m of some trie tree key sequence are greater than M, category candidate induces RNA's Kmer2 will not need to be compared with the kmer2 in the trie tree again, again reduce calculation amount and time overhead.

Further, in the step 3, all trie trees are traversed parallel to the induction RNA of all categories, search and its Hamming distance is less than the kmer no more than Q.

Further, in the step 3, all trie trees are traversed the induction RNA parallel to all categories, search with Its Hamming distance is less than the kmer no more than Q as a general assignment, is divided into multiple subtasks, each subtask is pair The induction RNA of one classification, traverses all trie trees, and search is less than the kmer no more than Q with its Hamming distance；Using python Multi-process module in process and queuing method simulate process pool function, execute multiple subtasks parallel.

Further, using in the multi-process module of python proceeding method and queuing method simulate process pool function, The method of multiple subtasks is executed parallel specifically:

Multiple processes are created using the proceeding method in the multi-process module of python；By the common parameters of each subtask As the preset parameter of each process, the characteristic parameter of each subtask is put into queue；Each process is every time from queue One group of characteristic parameter is taken out, a subtask is executed according to this feature parameter and preset parameter；Multiple task parallelisms execute multiple Subtask；After each process has executed a subtask, one group of characteristic parameter is taken out from queue again, is joined according to this group of feature Several and preset parameter executes a new subtask；Until the characteristic parameter in queue is all removed, all subtasks are executed It finishes；Wherein character parameter refers to a Proc part not in the parameter of other subtasks.

The present invention is based on when the research discovery induction library RNA GuideScan software building CRISPR, calculating data can be simultaneously Row.Therefore a kind of side that the GuideScan software building CRISPR induction library RNA is realized by multi-process parallel processing is devised Method scans multiple sequential parallels of reference genome including the genome space that determination parallel can target, generate kmer collection It closes；To each of kmer set kmer, kmer1 and kmer2 two parts are cut into, corresponding kmer1 is identical Kmer2 is divided into a classification；Classification construction trie tree, is building up to the same trie tree (dictionary tree) for same category of kmer2 In, the bond order of trie tree is classified as the corresponding kmer1 of wherein kmer2；Parallel filtering non-candidate induces RNA, traverses parallel all Trie tree, to each kmer2, according to its corresponding kmer whether using non-standard PAM as prefix, or whether in reference genome The number of appearance is greater than 1, if so, being marked as the kmer2 of non-candidate induction RNA, and filters out and remaining is not labeled Kmer2, the i.e. candidate induction RNA of the kmer being combined into the key sequence of corresponding trie tree；It is deposited in parallel filtering trie tree In the candidate induction RNA of similar kmer, for each candidate induction RNA, it is intended to find another and the candidate in trie tree The Hamming distance of RNA is induced to be less than the kmer of Hamming distance M, if it is present by candidate induction RNA and all similar to its Kmer kmer2 labeled as non-candidate induction RNA kmer2；It is parallel to obtain induction RNA, all trie trees are traversed parallel, The kmer2 for being wherein not labeled as the kmer2 of non-candidate induction RNA is filtered out, it is combined with the key sequence of corresponding trie tree At kmer be final induction RNA；Parallel computation induces the sequence of missing the target of RNA, for each induction RNA, it is intended to In trie tree search for its Hamming distance require not less than M and be not more than Hamming distance Q kmer, with reference in genome with these The sequence of kmer complementary pairing is the sequence of missing the target of induction RNA, and finally generates induction RNA and its sequence information that misses the target BAM file.The design objective in the induction library RNA is divided into a series of subtasks according to the data of classification by the present invention, then mould Quasi- process pool function, the method based on multi-process execute these subtasks parallel.And to data before each subtask starts It pre-processes, to reduce subsequent calculating task, can greatly shorten search time, improve search efficiency.The present invention passes through Multi-process structure and the strategy of data prediction optimization algorithm improve computational efficiency jointly.

The present invention is mainly characterized by data parallel, Parallel Design in this way can also use distributed parallel. Distributed parallel can use more machine resources, but can also generate more communication overheads simultaneously.Due to the step of this method Rapid one, two (preceding 6 steps in embodiment) time overheads are not very much, it is not necessary that carry out distributed parallel, otherwise can make At more communication overheads.And the sequence information that misses the target of parallel computation induction RNA is a most time-consuming step, for different The case where data set, kmer classifies, can be very different, need to consider the data volume size of each task, data are reasonably assigned to Each machine, avoids data skew.In addition, the data structure due to trie tree can not be serialized, need trie tree For information preservation to temporary file, then duplication is distributed to every machine, then reads trie tree on every machine, does further Parallel processing.Finally all results are merged again.Using distributed type assemblies platform parallel processing, can be calculated using more Resource, and it is not only restricted to single machine.

The utility model has the advantages that

(MultiGuideScan) of the invention proposes a kind of CRISPR induction RNA library designs method.First from gene Kmer set is generated in group sequence, and kmer is then divided into kmer1 and remaining kmer2 two parts, corresponding kmer1 is identical Kmer2 be divided into a classification, using the corresponding kmer1 of the kmer2 of a classification as the key sequence of category kmer2；Again will Same category of kmer2 is building up in the same trie tree (dictionary tree), and the bond order of each trie tree is classified as respective classes Thus the key sequence of kmer2 obtains multiple trie trees, non-induced RNA can will be filtered in each trie tree, is induced RNA, and the data task of the sequence of missing the target of search induction RNA is divided into a series of small task in all kmer；These small Business can execute parallel, and the calculating time overhead in the design CRISPR induction library RNA is effectively reduced；In addition, induction RNA is divided into Kmer1 and kmer2, is first calculated the Hamming distance between kmer1 and the key sequence of trie tree, and exclusion does not meet required distance Trie tree, make every effort to obtain induction RNA miss the target sequence information when reduce the subsequent matching primitives amount of kmer2；Furthermore we Method simulation process pool function carrys out the calculating process of subtask in parallel processing algorithm, improved by the advantage using multiprocessor in terms of Calculate speed.High Efficient Parallel Algorithms of the invention make it possible the library induction RNA for designing large-scale genome.

The method of the present invention realizes the multi-process parallel version of GuideScan software, to accelerate CRISPR to induce the library RNA Design process.GuideScan software is realized with Python combination C Plus Plus.In the Python interpreter of mainstream In CPython, global interpreter lock (GIL) is that a mutual exclusion lock for protecting the access to Python object prevents multiple lines Journey is performed simultaneously Python bytecode.Thread in Python cannot be used for parallel computation, so appointing for CPU intensive type Business, the promotion in speed cannot be brought using multithreading.And Python multi-process module (multiprocessing module) allows It creates the program that can be run parallel and uses entire core cpu.It in the methods of the invention, is more appointing for computation-intensive Business, to improve the execution efficiency of algorithm using multi-process method.The multi-process module of Python provides for each process The Python interpreter of oneself, each process have the GIL of oneself.Multi-process module uses individual memory headroom, multiple CPU core, around the GIL limitation in CPython, subprocess can be run, and be easier to use.When the quantity of processing task It, can be by being gone using the multiple processes of Process (proceeding method) dynamic generation in the multi-process module of Python when little Processing.But if the king-sized situation of operation object number, manual administration process will be especially cumbersome, can effectively send out at this time Wave the effect of process pool (pool) method.Process pool can need to provide a certain number of processes according to user, when there is new appoint When business request is submitted in process pool, if process pool is also less than, it will directly create new process and go to execute the task；And if Number of processes in process pool has had reached specified maximum value, and task requests will enter queue and wait, once there is task Terminating, idle process can obtain new task from queue and go to execute, limited process can be efficiently utilized in this way, and It avoids the lasting creation of process and destroys.But in use due to process pool method, parameter needs to be serialized laggard Row transmission, and retrieving tree construction can not serialize, so not being available process pool method.The present invention passes through in multi-process module Queue and proceeding method simulate the function of a process pool.The building of each trie tree and search process are considered as one to appoint Business, multiple tasks are submitted simultaneously, dynamic call process, when having the task of process free time and waiting, it is idle into Task of waiting is obtained in Cheng Huicong queue is handled, so as to improve the execution efficiency of task.

Detailed description of the invention

Fig. 1 is the flow diagram of (hereinafter referred to as MultiGuideScan) of the embodiment of the present invention；Wherein a is The process of MultiGuideScan；B is with reference to genome example；C is the kmer extracted from reference genome and its coordinate letter Breath；D is sorted as a result, preceding n base constitutes key sequence of the kmer1 as the category according to the preceding n base of kmer (key), kmer1 is converted to call number index by the quaternary, and remaining is kmer2.(e) trie tree constructed by kmer2, Comprising kmer2, corresponding kmer in genome the information such as frequency of occurrence, coordinate, and using kmer1 as the bond order of corresponding trie tree It arranges (key), call number of the respective index number as trie tree；

Fig. 2 is the sequence information schematic diagram that misses the target that parallel computation induces RNA.

Fig. 3 is from yeast count according to induction RNA quantity obtained in collection

Fig. 4 is to induce RNA quantity obtained in Caenorhabditis elegans data set

Fig. 5 is to induce RNA quantity obtained in Drosophila melanogaster data set

Fig. 6 is MultiGuideScan and GuideScan in Q is that total runing time performance compares in the case that 3, n is 4

Fig. 7 is MultiGuideScan and GuideScan in Q is that total runing time performance compares in the case that 4, n is 5

Specific embodiment

Below in conjunction with embodiment, the present invention is described further.

A kind of CRISPR induction RNA library designs method is present embodiments provided, is specifically comprised the following steps:

Step 1: the genome space that parallel determination can target.

In order to generate the desired library induction RNA of user, this method also allows for user's input with reference to genome FASTA text Part, induces the length of RNA, standard PAM (protospacer adjacent motif, original are spaced neighbouring motif), relative to luring Lead the position PAM of RNA target sequence, non-standard PAM and Hamming distance M and Q.It is given based on user as shown in Fig. 1 (b, c) Parameter, algorithm scanning with reference in genome using standard PAM and non-standard PAM as the kmer of prefix or suffix, and related letter Breath, such as its corresponding coordinate and direction (for recording position and coding direction of the kmer in some reference sequences) information. The scan method can be parallel, due to including multiple reference sequences in reference genome file, and each reference sequences Scanning process is independent of each other, therefore can pass through multi-process method by the scanning process of each reference sequences as a task These tasks are executed parallel, these kmer and PAM and its coordinate information are saved, then again the result of all scannings Merge, constitute kmer set, which gathers the genome space that can be targeted.And each kmer is counted in reference genome The number of appearance.

Wherein, standard PAM and non-standard PAM is also referred to as classics PAM and non-classical PAM inside some documents；Each PAM It is made of s base；Each kmer is made of N number of base；Base type includes t class；In the present embodiment, s=3, N=23, t= 4, the type of base is divided into tetra- class of A, C, G, T.

Step 2: cutting and classification are carried out to the kmer in kmer set.

As shown in Fig. 1 (d), a kmer is cut into kmer1 and kmer2 two parts first, kmer1 length is n (this reality Apply n=Q+1 in example), it is the sequence of the preceding n base composition of kmer, kmer2 is the remaining sequence of the kmer.Then, it will correspond to Identical kmer2 points of kmer1 be one kind, and using corresponding kmer1 as the key sequence of such kmer2.Sequence is by A, C, G, T Four kinds of bases are constituted, i.e., all kmer are divided into 4ⁿA classification.For convenience, with 0,1,2 and 3 this four number replacement bond orders A, C, G, T these four bases in column, obtain a quaternary number, quaternary number are then converted to decimal number, will obtain Decimal number as call number, key sequence is mutually converted with call number, such as key sequence " AAAA " manipulative indexing number 0, Call number 1 is corresponding key sequence " AAAC ".

Step 3: classification construction trie tree.

As shown in Fig. 1 (e), a trie tree (trie tree knot used in the present embodiment respectively is constructed as every one kind kmer2 Structure is dictionary tree), multiple trie trees are thus obtained, and using the key sequence of every one kind kmer2 as the bond order of corresponding trie tree Column.Kmer all in the former GuideScan method trie tree constructed has been divided into multiple smaller trie trees by above-mentioned steps, and Kmer2 in each trie tree prefix kmer1 having the same in corresponding kmer, the information such as corresponding coordinate of each sequence As former kmer.The establishment process of each trie tree be it is independent, this 4ⁿThe building process of a trie tree can be with parallel processing. Due to multi-process module in Python process pool (pool) method in use, after the parameter of method needs are serialized It is transmitted, and retrieving tree construction can not serialize, so not being available process pool (pool) method.The present embodiment by mostly into Queue and process structure in journey module simulate the function of a process pool.The building process of each trie tree is considered as one A task, submits multiple tasks simultaneously, dynamic call process, idle when having the task of process free time and waiting Process can obtain the task of waiting from queue and handled, so as to improve the execution efficiency of task.In this step, by institute Having by the sequence mark of prefix or suffix of non-standard PAM is non-candidate induction RNA, and is prefix or suffix by standard PAM Sequence mark is candidate induction RNA.

Step 4: parallel filtering non-candidate induces RNA.

Based on kmer set and frequency of occurrence of the kmer in reference genome, if certain kmer frequency of occurrence is greater than 1, The corresponding kmer2 of the kmer is searched out in trie tree, is marked as the kmer2 of non-candidate induction RNA.Then traversal retrieval Tree, filter out remaining kmer2 (refer to be not labeled as non-candidate induction RNA kmer2 kmer2), by they respectively with it The key sequence (kmer1) of the trie tree at place, which links together, induces RNA as candidate.The filter process of these trie trees is also It is mutually independent, so can equally simulate a process pool, the filter process of each trie tree is considered as a task, parallel Execute these tasks.And the candidate induction RNA that each task filters out is from the same trie tree, key having the same Sequence.

Step 5: there are the candidate induction RNA of similar kmer in parallel filtering trie tree.

For each candidate induction RNA, it is intended to find another in trie tree and induce RNA similar with the candidate Kmer, i.e., the kmer for inducing RNA Hamming distance to be less than M with the candidate.If it is present by the candidate induce RNA and it is all with Its similar kmer induces RNA labeled as non-candidate, by the kmer of these kmer labeled as the kmer2 of non-candidate induction RNA.With This guarantees all candidate induction RNA and other kmer Hamming distances are all more than or equal to M, all candidate induction RNA that This does not have similitude.

Based on, there are when the candidate induction RNA of similar kmer, implementation procedure is as follows in multi-process parallel filtering trie tree:

Step 5.1: all candidate induction RNA being classified according to its kmer1, by the candidate induction RNA with identical kmer1 It is divided into one kind；Induce the classification of RNA by the task according to candidate (there are the candidate induction RNA of similar kmer in filtering trie tree) It is divided into small task one by one.By each small documents and the corresponding call number (time in small documents comprising a kind of candidate induction RNA The call number that the kmer1 of choosing induction RNA is converted into) it is put into a queue for multi-process access, each packet in the queue The characteristic parameter of small documents and a corresponding call number i.e. subtask containing a kind of candidate induction RNA；Wherein characteristic parameter is Refer to a Proc part not in the parameter of other subtasks；

Step 5.2: multiple processes being created by multi-process module process method, number of processes can be defined according to user and be set It sets；The preset parameter of each process is all trie trees and Hamming distance M；Idle process takes out one group of feature ginseng from queue Number executes corresponding task according to preset parameter and characteristic parameter, calculates the candidate in the corresponding document according to call number first Then the kmer1 for inducing RNA calculates the Hamming distance m of the key sequence of the kmer1 and each trie tree, is stored in list；

Step 5.3: their phases are first taken out for each candidate induction RNA and each trie tree in list by circulation The Hamming distance m answered directly ignores if Hamming distance m is more than or equal to M (if the kmer1 and trie tree of candidate induction RNA Key sequence Hamming distance m be more than or equal to M, then do not have to carry out kmer2 comparison, to reduce calculation amount)；Otherwise attempt in phase It is found in the trie tree answered and induces the kmer2 Hamming distance of RNA to be not more than the kmer2 of M-m with the candidate, if it is present will Temporary file is written in the call number of the kmer2 and trie tree where it；

Step 5.4: it is finished when all tasks are carried out, according to the kmer2 and call number in all temporary files of generation, By the correspondence kmer2 in the trie tree of manipulative indexing number labeled as the kmer2 of non-candidate induction RNA.

Wherein, a kmer2 is given, is searched in trie tree as follows less than the process of M-m with its Hamming distance:

A: record Hamming distance h is 0 first；Since the 1st layer of trie tree, compare the base and kmer2 of the 1st node The 1st base it is whether identical, if identical, Hamming distance h is constant, and otherwise Hamming distance adds 1；It then proceedes to compare the node Child node in the 1st node and kmer2 base, and so on；

B: if until leaf node, h is still less than M-m, then the corresponding kmer of the kmer2 of the branch is similar sequences； Record its relevant information；Continue the brotgher of node for comparing it, if the brotgher of node is relatively over, return back to father node, continues Compare the brotgher of node of father node；

C: when comparing i-th of base of i-th layer of j-th of node and kmer2, if h is equal to M-m, the branch institute Kmer2 be not meant to the similar sequences looked for, continue i-th of base for comparing+1 node of i-th layer of jth and kmer2；If h Less than M-m, then continue the i+1 base of the 1st child node and kmer2 comparing the node；

D: and so on, it has been traversed until all trie trees are whole, has obtained all similar sequences and its phase of the kmer2 Close information.

Step 6: parallel to obtain induction RNA.Operation identical with step 4, all retrievals are traversed using multi-process parallel Tree filters out all candidate induction RNA, as final induction RNA set；

Step 7: the sequence of missing the target of parallel computation induction RNA；

For all induction RNA, it is intended to search for the sequence of missing the target with their mispairing in trie tree, Hamming distance requires big In or equal to M and it is less than or equal to Q.It is illustrated in figure 2 the schematic diagram of the step, which is divided into 4ⁿPart small data task Parallel processing.

Based on multi-process parallel computation induction RNA miss the target sequence information when, implementation procedure is as follows:

Step 7.1: all candidate induction RNA being classified according to its kmer1, by the candidate induction RNA with identical kmer1 It is divided into one kind；By each small documents comprising a kind of induction RNA and the corresponding call number (kmer1 of the induction RNA in small documents The call number being converted into), a corresponding SAM small documents are created for every a kind of candidate induction RNA for sequence of missing the target is written Information；Each small documents, corresponding call number and SAM small documents comprising a kind of induction RNA are put into one for multi-process The queue of access；Each in the queue includes small documents, corresponding call number and SAM small documents i.e. one of a kind of induction RNA The characteristic parameter of a word task；Wherein characteristic parameter refers to a Proc part not in the parameter of other subtasks；

Step 7.2: multiple processes being created by multi-process module process method, number of processes can be defined according to user and be set It sets；The preset parameter of each process is all trie trees and Hamming distance Q；Idle process takes out one group of feature ginseng from queue Number executes corresponding task according to preset parameter and characteristic parameter, calculates the candidate in the corresponding document according to call number first Then the kmer1 for inducing RNA calculates the Hamming distance m of the key sequence of the kmer1 and each trie tree, is stored in list；

Step 7.3: for each idle process, it obtains the small documents of an induction RNA from queue, therefrom reads Take induction RNA sequence call number corresponding with it.Call number is converted into key sequence, calculates the key sequence and each trie tree The Hamming distance m of key sequence, calculated result are put into a list so as to subsequent access.

Step 7.4: for each induction RNA, its relevant information is recorded, kmer1 and kmer2 are then cut into, It and trie tree key sequence are obtained from the list of preservation according to the call number of file where induction RNA for each trie tree Hamming distance m.If m is greater than Q, ignore (if the Hamming distance m of the key sequence of the kmer1 and trie tree of induction RNA is greater than Equal to Q, then do not have to the comparison for carrying out kmer2, to reduce calculation amount), it is otherwise searched in the trie tree with induction RNA's Kmer2 Hamming distance is not more than the kmer2 of Q-m；Key sequence by these kmer2 respectively with the trie tree where them is connected to A plurality of kmer is constituted together, with reference to the sequence of missing the target in genome with the sequence of these kmer complementary pairings being induction RNA. When the search of all trie trees finishes, what is obtained all misses the target sequence and its relevant information can be converted into Hexadecimal form.

Step 7.5: SAM small documents are written into induction RNA relevant information and the corresponding sequence information that misses the target together.Later, All SAM small documents are converted to the binary format BAM file of SAM, then all BAM small documents are merged, and are it Index being established, quickly being accessed with will pass through Samtools tool.

Wherein, the kmer2 for giving an induction RNA, searches for the process for being not more than Q-m with its Hamming distance in trie tree It is as follows:

A: record Hamming distance h is 0 first.Since the 1st layer of trie tree, compare the base and kmer2 of the 1st node The 1st base it is whether identical, if identical, Hamming distance h is constant, and otherwise Hamming distance adds 1.It then proceedes to compare the node Child node in the 1st node and kmer2 base, and so on.

B: if until leaf node, h is still less than Q-m is equal to, then by the retrieval where the kmer2 of the branch and they The key sequence of tree, which links together, constitutes a kmer, with reference to the sequence of the kmer complementary pairing being the induction in genome The sequence of missing the target of RNA, records its relevant information；Continue the brotgher of node for comparing it, if the brotgher of node is relatively over, retracts To father node, continue the brotgher of node for comparing father node.

C: when comparing i-th of base of i-th layer of j-th of node and kmer2, if h is greater than Q-m, ignore, continue Compare+1 node of i-th layer of jth and i-th of base of kmer2；Otherwise, continue to compare the 1st child node of the node with The i+1 base of kmer2.

D: and so on, it has been traversed until all trie trees are whole, has obtained all sequences of missing the target of the induction RNA sequence And its relevant information.

Experimental result assessment

In order to assess the experimental performance of the method for the invention, it is compared with the performance of original method GuideScan, this grinds Study carefully the benchmark for having used three species from UCSC genome browser (http://hgdownload.soe.ucsc.edu) Data set comes test method performance, including saccharomycete (Yeast), Caenorhabditis elegans (C.elegans) and Drosophila melanogaster (D.melanogaster) data set.Input file is the reference genome FASTA file of three species, the FASTA of genome Comment file is generally divided into following components:

(1) chromosome is assembled: the sequence started with chr1.., chrX, chrY and chrM.

(2) sequence of no-fix: using _ random as the sequence of suffix, expression know it on which chromosome, but not Know its direction and sequence.

(3) sequence that do not place: using chrUn_ as the sequence of prefix, which chromosome expression is not known on.

The main feature for the data set that the present embodiment uses is as shown in table 1:

The main feature of 1 benchmark dataset of table

The present embodiment passes through the method Parallel Implementation original GuideScan algorithm of multi-process, exists with former GuideScan algorithm Calculating logic substantially is the same.So the experimental result of the present embodiment is identical with original GuideScan algorithm. As Fig. 3,4,5 be respectively from saccharomycete, three species of Caenorhabditis elegans and Drosophila melanogaster data set obtained in induction RNA, what wherein abscissa indicated is with reference to the sequence names label in genome, and ordinate indicates 50kb alkali before each sequence The induction RNA quantity found in base (insufficient then take all bases).

Calculate time performance comparative analysis

The present embodiment is in the computer Linux system with 40 Xeon 2.20GHz E5-2630 v4 CPU and 128G memories The time performance of the present embodiment method and original GuideScan algorithm is tested on system.Used two kinds of configurations: 1) parameter M is 2, the Chinese Prescribed distance Q is 3, n 4, and induction RNA length is 20, and standard PAM is " NGG ", and non-standard PAM is " NAG "；2) parameter M is 2, the Chinese Prescribed distance Q is 4, n 5, and induction RNA length is 20, and standard PAM is " NGG ", and non-standard PAM is " NAG ".Related data set It specifically describes as shown in table 1.

Table 2 show the step of the present embodiment method (MultiGuideScan) and describes, and wherein step 2 is ratio One step of GuideScan multiprocessing.Other than step 2, remaining step has all carried out parallelization processing.

The step of 2 MultiGuideScan of table, describes

It is respectively the present embodiment method (MultiGuideScan) and GuideScan shown in table 3,4 using configuration 1) and 2) the runing time performance in the case where step by step compares.Wherein, step 1 due to reference genome sequence limited amount, with Process number increase, calculating speed tends to be steady；Step 3 can not also continue to extend since IO and communication overhead are larger；Step 7 It is the maximum step of time overhead, because the computation complexity of the step is maximum.

Runing time performance compares (Q=3, n=4) to 3 MultiGuideScan of table step by step

Runing time performance compares (Q=4, n=5) to 4 MultiGuideScan of table step by step

It is respectively the present embodiment method (MultiGuideScan) and GuideScan shown in Fig. 6 and Fig. 7 using configuration 1) With 2) in the case where total runing time performance compare.It can be seen from the figure that calculating the time with the increase for using process number Constantly reduce.In this method, induction RNA and trie tree are classified according to kmer1, calculates the kmer1 of induction RNA of all categories first With the Hamming distance of trie tree key sequence, since the kmer1 of each classification induction RNA is identical, kmer2 in each trie tree Key sequence is also all identical, which greatly reduces the calculation amount that induction RNA is compared with kmer, avoids each induction RNA Global alignment is carried out one by one with each kmer.And when the Chinese of the kmer1 of some classification induction RNA and some trie tree key sequence When prescribed distance m is greater than Q, the kmer2 of category induction RNA will not need to be compared with the kmer2 in the trie tree again, again Reduce calculation amount.Therefore, when Q is 3, n 4, and the process number used is 1, the time performance of MultiGuideScan is still It is better than GuideScan very much, reduce about 1 times of time overhead.As shown in fig. 6,3 times of acceleration is obtained using 2 processes, The acceleration that 5 times are obtained using 4 processes obtains 6-8 times of acceleration using 8 processes, obtains 8-10 times using 16 processes Accelerate, 9-12 times of acceleration effect has been reached using 32 processes.

And when Q is 4, when n is 5, RNA and trie tree is induced to be divided into 4⁵That is 1024 classifications, although theoretically answering This reaches similar acceleration effect as configuring 1), but actually task quantity increases too much, at the same time IO expense with Communication overhead also increases many, and performance when process number being caused to be 1 is not improved.But as the process number used increases Add, acceleration effect is just more and more significant.As shown in fig. 7, obtaining 1.5 times of acceleration using 2 processes, obtained using 4 processes 2.6 times of acceleration is obtained 4.5 times of acceleration using 8 processes, 7 times of acceleration is obtained using 16 processes, is reached using 32 processes 9-10 times of acceleration effect.

In addition, also there are many special, can not be parallel processing parts in the algorithm, and there are also many IO expenses, communication Expense and other overheads.According to Amdahl law, the parallelization of single program can not infinitely accelerate.With use into Number of passes increases, and scalability can be reduced slowly.

It is emphasized that example of the present invention be it is illustrative, without being restrictive, thus the present invention it is unlimited Example described in specific embodiment, other all obtained according to the technique and scheme of the present invention by those skilled in the art Embodiment does not depart from present inventive concept and range, and whether modification or replacement, also belong to protection model of the invention It encloses.

Claims

1. a kind of CRISPR induces RNA library designs method, which comprises the following steps:

Step 1: scanning is constituted using standard PAM or non-standard PAM as the kmer of prefix or suffix in reference genome Kmer set；

Step 2: being cut into kmer1 and kmer2 two parts, wherein kmer1 is to each of kmer set kmer The identical kmer2 of corresponding kmer1 is divided into a classification, by the kmer2 of a classification by the sequence of its preceding n base composition Key sequence of the corresponding kmer1 as category kmer2；Same category of kmer2 is building up in the same trie tree again, by This multiple trie tree, the bond order of each trie tree are classified as the key sequence of the kmer2 of respective classes；

To kmer set each of kmer, if its using non-standard PAM as prefix or suffix or its in reference genome Frequency of occurrence is greater than 1, or there is the kmer for being less than M with its Hamming distance, then it is non-induced RNA, is otherwise induction RNA；

Step 3: all induction RNA are classified according to kmer1, to the induction RNA of all categories, all trie trees is traversed, are searched Suo Yuqi Hamming distance is less than the kmer no more than Q；Wherein, to the induction RNA of a classification, all trie trees are traversed, are searched for With its Hamming distance be less than no more than Q kmer method specifically: first calculating the category induction RNA kmer1 with it is each The Hamming distance of the key sequence of trie tree, the Hamming distance for finding out key sequence and the kmer1 of category candidate induction RNA are not more than The trie tree of Q；Then for each candidate induction RNA of the category, respectively in the trie tree found out, search and the candidate The Hamming distance of the kmer2 of RNA is induced to be not more than the kmer2 of Q-m；By these kmer2 respectively with the trie tree where them Key sequence, which links together, constitutes a plurality of kmer, with reference to the sequence of these kmer complementary pairings being the induction in genome The sequence of missing the target of RNA；

2. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 1, and Row scans in multiple sequences of reference genome using standard PAM or non-standard PAM as the kmer of prefix or suffix.

3. CRISPR according to claim 2 induces RNA library designs method, which is characterized in that will be in reference genome Multiple sequences in scanning using standard PAM or non-standard PAM be the kmer of prefix or suffix as a general assignment, by its stroke It is divided into multiple subtasks, each subtask is scanned in a sequence of reference genome with standard PAM or non-standard PAM For the kmer of prefix or suffix；Using the process and queuing method simulation process pool function in the multi-process module of python, and Row executes multiple subtasks.

4. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 2, and The kmer2 for being about to multiple classifications is building up to respectively in multiple trie trees.

5. CRISPR according to claim 4 induces RNA library designs method, which is characterized in that by multiple classifications Kmer2 is building up in multiple trie trees respectively as a general assignment, is divided into multiple subtasks, each subtask will The kmer2 of one classification is building up in a trie tree；Using the process and queuing method mould in the multi-process module of python Quasi- process pool function, executes multiple subtasks parallel.

6. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 2, To each of kmer set kmer, if it is using non-standard PAM as prefix or suffix or its appearance in reference genome Number is greater than 1, then it is non-induced RNA, and otherwise it is candidate induction RNA；

All candidate induction RNA are classified according to kmer1, parallel to the candidate induction RNA of all categories, traverse all retrievals Tree judges whether there is the kmer for being less than M with its Hamming distance, if so, it is non-induced RNA, otherwise it is induction RNA.

7. CRISPR according to claim 6 induces RNA library designs method, which is characterized in that will be parallel to all classes Other candidate induction RNA, traverses all trie trees, the kmer judged whether there is with its Hamming distance less than M is total as one Task is divided into multiple subtasks, and each subtask is to traverse all trie trees to the candidate induction RNA of a classification, Judge whether there is the kmer for being less than M with its Hamming distance；Using the process and queuing method in the multi-process module of python Process pool function is simulated, executes multiple subtasks parallel.

8. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 3, institute It states in step 3, parallel to the induction RNA of all categories, traverses all trie trees, search is less than with its Hamming distance and is not more than Q Kmer.

9. CRISPR according to claim 8 induces RNA library designs method, which is characterized in that, will in the step 3 Parallel to the induction RNA of all categories, all trie trees are traversed, search is less than the kmer conduct no more than Q with its Hamming distance One general assignment, is divided into multiple subtasks, and each subtask is to traverse all retrievals to the induction RNA of a classification Tree, search are less than the kmer no more than Q with its Hamming distance；Using the process and queuing method in the multi-process module of python Process pool function is simulated, executes multiple subtasks parallel.

10. the CRISPR according to any one of claim 3,5,7,9 induces RNA library designs method, which is characterized in that Using the proceeding method and queuing method simulation process pool function in the multi-process module of python, multiple subtasks are executed parallel Method specifically:

Multiple processes are created using the proceeding method in the multi-process module of python；Using the common parameters of each subtask as The characteristic parameter of each subtask is put into queue by the preset parameter of each process；Each process is taken out from queue every time One group of characteristic parameter executes a subtask according to this feature parameter and preset parameter；Multiple task parallelisms execute multiple sons and appoint Business；After each process has executed a subtask, one group of characteristic parameter is taken out from queue again, according to this group of characteristic parameter and Preset parameter executes a new subtask；Until the characteristic parameter in queue is all removed, all subtasks are finished； Wherein character parameter refers to a Proc part not in the parameter of other subtasks.