CN110322927A - A kind of CRISPR induction RNA library designs method - Google Patents
A kind of CRISPR induction RNA library designs method Download PDFInfo
- Publication number
- CN110322927A CN110322927A CN201910712069.XA CN201910712069A CN110322927A CN 110322927 A CN110322927 A CN 110322927A CN 201910712069 A CN201910712069 A CN 201910712069A CN 110322927 A CN110322927 A CN 110322927A
- Authority
- CN
- China
- Prior art keywords
- rna
- kmer
- kmer2
- induction
- kmer1
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of CRISPR to induce RNA library designs method, comprising the following steps: Step 1: generating kmer set according to reference genome;Step 2: kmer is cut into kmer1 and kmer2 two parts, the identical kmer2 of corresponding kmer1 is divided into a classification;Same category of kmer2 is building up in the same trie tree again, the bond order of each trie tree is classified as the corresponding kmer1 of wherein kmer2;Step 3: parallel obtain induction RNA and its sequence of missing the target, in the step, in the kmer that the key sequence and kme2 therein that compare a kmer and trie tree connect into, the key sequence of the kmer1 of the kmer and trie tree are compared first, it sees and whether meets setting condition, meet the kmer2 in the kmer2 and trie tree for then continuing to compare kmer.The present invention improves computational efficiency.
Description
Technical field
The invention belongs to functional genomics fields, and in particular to a kind of CRISPR induction RNA library designs method.
Background technique
In the genome editing technique of Current generation, CRISPR system (Clustered regularly
The short palindrome repetitive sequence of the regular intervals of interspaced short palindromic repeats cluster) and Cas9 core
The technical research of sour enzyme (guiding nuclease 9 with the associated RNA of CRISPR) is with fastest developing speed, can easily target and almost appoint
What genomic locations has stepped major step for the innovation and development of genetic engineering.Cas9 albumen from CRISPR system passes through benefit
RNA will be induced with the characteristics of induction RNA (guide RNA, gRNA, also referred to as guide RNA) and DNA target sequence base pair complementarity
DNA target sequence is navigated to the complex of albumen.And the combination of the adjacent motif (PAM) in the original interval in target position downstream facilitates
Cas9 cutting DNA double-strand is instructed, PAM is necessary to Cas9 nuclease cutting DNA double-strand.Wherein induction RNA is CRISPR-
The key member of Cas system, is made of constant part and variable part, and variable part is the part complementary with DNA target sequence, can
To realize the combination of induction RNA and DNA different loci by engineer's variable part.CRISPR induces the library RNA for base
Because group editing system is most important.With the progress of genomic sequencing technique, induce the design in the library RNA for understanding genome
Function becomes more and more important.
Many induction RNA design tools, such as CRISPR Design, Cas-OFFinder have now been developed,
CRISPRscan and E-CRISP is used for genome editor.However, these tools return uniqueness but have the induction RNA collection of overlapping
It closes, also has ignored full-length genome noncoding region, and compare tool using third party and carry out the sequence search that misses the target.
Recently, GuideScan software improves the building of CRISPER induction RNA database.GuideScan is one and opens
Source system, can be from any genome or the more synthesis of CRISPR endonuclease design or the library induction RNA of Complete customization.
In addition, GuideScan software can also construct single primed RNA and decoding for DTMF RNA database, and it can obtain and considerable have
The induction RNA of multiple perfection target sites.Therefore, the induction RNA that GuideScan is obtained is than the induction that other tools averagely obtain
RNA has higher specificity.However, the calculating cost of GuideScan is relatively high, especially when the sequence of missing the target for calculating induction RNA
When column (not less than parameter M and being not more than parameter Q, Q > M with induction RNA mispairing number), the calculating cost of GuideScan is very high.
Therefore, it is unpractical for GuideScan being applied to large-scale genome.Be sequenced with more and more eukaryotic gene groups or
Again it is sequenced, needs more effective tool to accelerate the design in the CRISPR induction library RNA.
Summary of the invention
The object of the present invention is to provide a kind of CRISPR to induce RNA library designs method, and design CRISPR is effectively reduced and lures
Lead the calculating time overhead in the library RNA.
A kind of CRISPR induction RNA library designs method, comprising the following steps:
Step 1: scanning is using standard PAM or non-standard PAM as the kmer of prefix or suffix, structure in reference genome
Gather at kmer, the genome space that can be targeted;
Step 2: kmer1 and kmer2 two parts are cut into each of kmer set kmer, wherein
Kmer1 is the sequence of its preceding n base composition, the identical kmer2 of corresponding kmer1 is divided into a classification, by a classification
Key sequence of the corresponding kmer1 of kmer2 as category kmer2;Same category of kmer2 is building up to the same retrieval again
It sets in (dictionary tree), thus multiple trie trees (data task is divided into a series of small task), the bond order of each trie tree is classified as
The key sequence of the kmer2 of respective classes;
To kmer set each of kmer, if its using non-standard PAM as prefix or suffix or its in reference genome
In frequency of occurrence be greater than 1, or exist with its Hamming distance be less than M kmer, then its be non-induced RNA, otherwise be induction
RNA;
Step 3: all induction RNA are classified according to kmer1, to the induction RNA of all categories, all retrievals are traversed
Tree, search are less than the kmer no more than Q with its Hamming distance;Wherein, to the induction RNA of a classification, all trie trees are traversed,
Search with its Hamming distance be less than no more than Q kmer method specifically: first calculating the category induction RNA kmer1 with
The Hamming distance of the key sequence of each trie tree, find out key sequence and category candidate induce RNA kmer1 Hamming distance not
Trie tree greater than Q;Then, for the category each candidate induction RNA, respectively in the trie tree found out, search with
The kmer2 that the candidate induces the Hamming distance of the kmer2 of RNA to be not more than Q-m;By these kmer2 respectively with the inspection where them
The key sequence of Suo Shu, which links together, constitutes a plurality of kmer, as should with reference in genome with the sequence of these kmer complementary pairings
Induce RNA sequence of missing the target (be duplex structure with reference to the DNA sequence dna in genome, base and another chain on one chain
The base pair complementarity of upper corresponding position, if kmer sequence on a chain and the Hamming distance of induction RNA are not less than M and less
In Q, then the mispairing number of the kmer sequence of corresponding position and induction RNA not less than M and are not more than Q on another chain, are induction RNA
Sequence of missing the target;Coordinate of the kmer that writing scan goes out in step 1 on the DNA sequence dna of reference genome;In this step
In, after the kmer for searching out the condition of satisfaction, according to its coordinate on DNA sequence dna, can quickly it be found therewith on DNA sequence dna
The corresponding kmer sequence in position, i.e. the kmer sequence of complementary pairing therewith);
The thus obtained induction RNA and its i.e. CRISPR of sequence information that misses the target induces the library RNA.
Above-mentioned steps use data prediction optimization algorithm, since the kmer1 of the induction RNA of each classification is identical,
The key sequence of kmer2 is also identical in each trie tree, and first induction RNA classifies, then compares a classification induction RNA's
The key sequence of kmer1 and trie tree greatly reduce calculation amount and time overhead that induction RNA is compared with kmer, avoid
Each induction RNA and each kmer carry out global alignment one by one.And when the kmer1 of some classification induction RNA and some retrieval
Set key sequence Hamming distance m be greater than Q when, the category induction RNA kmer2 will not need again with the kmer2 in the trie tree
It is compared, again reduces calculation amount and time overhead.
Further, it in the step 1, scans parallel in multiple sequences of reference genome with standard PAM or non-
Standard PAM is the kmer of prefix or suffix.
Further, will in multiple sequences of reference genome scanning using standard PAM or non-standard PAM as prefix or
The kmer of suffix is divided into multiple subtasks, each subtask is i.e. at one of reference genome as a general assignment
Scanning is using standard PAM or non-standard PAM as the kmer of prefix or suffix in sequence;Using in the multi-process module of python
Process and queuing method simulate process pool function, execute multiple subtasks parallel.
Further, in the step 2, and the kmer2 for being about to multiple classifications is building up to respectively in multiple trie trees.
Further, it is building up in multiple trie trees respectively using by the kmer2 of multiple classifications as a general assignment, by it
Multiple subtasks are divided into, the kmer2 of a classification is building up in a trie tree by each subtask;Using python's
Process and queuing method in multi-process module simulate process pool function, execute multiple subtasks parallel.
Further, in the step 2, all trie trees are traversed parallel, judge wherein each kmer2 and its place
Frequency of occurrence of the kmer that connects into of key sequence in reference genome of trie tree whether be greater than 1.
Further, all trie trees will be traversed, judge the key sequence of wherein each kmer2 and the trie tree where it
Whether frequency of occurrence of the kmer connected into reference genome, which is greater than 1, is used as a general assignment, is divided into multiple sons
Task, each subtask traverse a trie tree, and wherein the key sequence of each kmer2 and the trie tree where it connects for judgement
Whether frequency of occurrence of the kmer being connected into reference genome is greater than 1;Using in the multi-process module of python process and
Queuing method simulates process pool function, executes multiple subtasks parallel.
Further, in the step 2, each of gathering kmer kmer, if before it is with non-standard PAM
Sew or suffix or its frequency of occurrence in reference genome are greater than 1, then it is non-induced RNA, and otherwise it is candidate induction
RNA;
All candidate induction RNA are classified according to kmer1, parallel to the candidate induction RNA of all categories, traversal is all
Trie tree judges whether there is the kmer for being less than M with its Hamming distance, if so, it is non-induced RNA, otherwise it is induction
RNA。
Further, all trie trees will be traversed parallel to the candidate induction RNA of all categories, judge whether there is with
Kmer of its Hamming distance less than M is divided into multiple subtasks, each subtask is i.e. to a class as a general assignment
Other candidate induction RNA, traverses all trie trees, judges whether there is the kmer for being less than M with its Hamming distance;Using python
Multi-process module in process and queuing method simulate process pool function, execute multiple subtasks parallel.
Further, to the candidate induction RNA of a classification, all trie trees is traversed, are judged whether there is and its Hamming
The method of kmer of the distance less than M specifically: calculating category candidate first induces the kmer1 of RNA and the key of each trie tree
The Hamming distance of sequence finds out trie tree of the Hamming distance no more than M of key sequence and the kmer1 of category candidate induction RNA;
Then for each candidate induction RNA of the category, respectively in the trie tree found out, search whether there is to lure with the candidate
The Hamming distance for leading the kmer2 of RNA is not more than the kmer2 of M-m, and if it exists, then illustrates exist and candidate induction RNA Hamming distance
From the kmer for being less than M.Above-mentioned steps use data prediction optimization algorithm, and candidate due to each classification induces RNA's
Kmer1 is identical, and the key sequence of kmer2 is also identical in each trie tree, first by candidate's induction RNA classification, then compares one
Classification candidate induces the kmer1 of RNA and the key sequence of trie tree, greatly reduces the meter that candidate induction RNA is compared with kmer
Calculation amount and time overhead avoid each candidate induction RNA and each kmer and carry out global alignment one by one.And work as some classification
When the kmer1 of candidate's induction RNA and the Hamming distance m of some trie tree key sequence are greater than M, category candidate induces RNA's
Kmer2 will not need to be compared with the kmer2 in the trie tree again, again reduce calculation amount and time overhead.
Further, in the step 3, all trie trees are traversed parallel to the induction RNA of all categories, search and its
Hamming distance is less than the kmer no more than Q.
Further, in the step 3, all trie trees are traversed the induction RNA parallel to all categories, search with
Its Hamming distance is less than the kmer no more than Q as a general assignment, is divided into multiple subtasks, each subtask is pair
The induction RNA of one classification, traverses all trie trees, and search is less than the kmer no more than Q with its Hamming distance;Using python
Multi-process module in process and queuing method simulate process pool function, execute multiple subtasks parallel.
Further, using in the multi-process module of python proceeding method and queuing method simulate process pool function,
The method of multiple subtasks is executed parallel specifically:
Multiple processes are created using the proceeding method in the multi-process module of python;By the common parameters of each subtask
As the preset parameter of each process, the characteristic parameter of each subtask is put into queue;Each process is every time from queue
One group of characteristic parameter is taken out, a subtask is executed according to this feature parameter and preset parameter;Multiple task parallelisms execute multiple
Subtask;After each process has executed a subtask, one group of characteristic parameter is taken out from queue again, is joined according to this group of feature
Several and preset parameter executes a new subtask;Until the characteristic parameter in queue is all removed, all subtasks are executed
It finishes;Wherein character parameter refers to a Proc part not in the parameter of other subtasks.
The present invention is based on when the research discovery induction library RNA GuideScan software building CRISPR, calculating data can be simultaneously
Row.Therefore a kind of side that the GuideScan software building CRISPR induction library RNA is realized by multi-process parallel processing is devised
Method scans multiple sequential parallels of reference genome including the genome space that determination parallel can target, generate kmer collection
It closes;To each of kmer set kmer, kmer1 and kmer2 two parts are cut into, corresponding kmer1 is identical
Kmer2 is divided into a classification;Classification construction trie tree, is building up to the same trie tree (dictionary tree) for same category of kmer2
In, the bond order of trie tree is classified as the corresponding kmer1 of wherein kmer2;Parallel filtering non-candidate induces RNA, traverses parallel all
Trie tree, to each kmer2, according to its corresponding kmer whether using non-standard PAM as prefix, or whether in reference genome
The number of appearance is greater than 1, if so, being marked as the kmer2 of non-candidate induction RNA, and filters out and remaining is not labeled
Kmer2, the i.e. candidate induction RNA of the kmer being combined into the key sequence of corresponding trie tree;It is deposited in parallel filtering trie tree
In the candidate induction RNA of similar kmer, for each candidate induction RNA, it is intended to find another and the candidate in trie tree
The Hamming distance of RNA is induced to be less than the kmer of Hamming distance M, if it is present by candidate induction RNA and all similar to its
Kmer kmer2 labeled as non-candidate induction RNA kmer2;It is parallel to obtain induction RNA, all trie trees are traversed parallel,
The kmer2 for being wherein not labeled as the kmer2 of non-candidate induction RNA is filtered out, it is combined with the key sequence of corresponding trie tree
At kmer be final induction RNA;Parallel computation induces the sequence of missing the target of RNA, for each induction RNA, it is intended to
In trie tree search for its Hamming distance require not less than M and be not more than Hamming distance Q kmer, with reference in genome with these
The sequence of kmer complementary pairing is the sequence of missing the target of induction RNA, and finally generates induction RNA and its sequence information that misses the target
BAM file.The design objective in the induction library RNA is divided into a series of subtasks according to the data of classification by the present invention, then mould
Quasi- process pool function, the method based on multi-process execute these subtasks parallel.And to data before each subtask starts
It pre-processes, to reduce subsequent calculating task, can greatly shorten search time, improve search efficiency.The present invention passes through
Multi-process structure and the strategy of data prediction optimization algorithm improve computational efficiency jointly.
The present invention is mainly characterized by data parallel, Parallel Design in this way can also use distributed parallel.
Distributed parallel can use more machine resources, but can also generate more communication overheads simultaneously.Due to the step of this method
Rapid one, two (preceding 6 steps in embodiment) time overheads are not very much, it is not necessary that carry out distributed parallel, otherwise can make
At more communication overheads.And the sequence information that misses the target of parallel computation induction RNA is a most time-consuming step, for different
The case where data set, kmer classifies, can be very different, need to consider the data volume size of each task, data are reasonably assigned to
Each machine, avoids data skew.In addition, the data structure due to trie tree can not be serialized, need trie tree
For information preservation to temporary file, then duplication is distributed to every machine, then reads trie tree on every machine, does further
Parallel processing.Finally all results are merged again.Using distributed type assemblies platform parallel processing, can be calculated using more
Resource, and it is not only restricted to single machine.
The utility model has the advantages that
(MultiGuideScan) of the invention proposes a kind of CRISPR induction RNA library designs method.First from gene
Kmer set is generated in group sequence, and kmer is then divided into kmer1 and remaining kmer2 two parts, corresponding kmer1 is identical
Kmer2 be divided into a classification, using the corresponding kmer1 of the kmer2 of a classification as the key sequence of category kmer2;Again will
Same category of kmer2 is building up in the same trie tree (dictionary tree), and the bond order of each trie tree is classified as respective classes
Thus the key sequence of kmer2 obtains multiple trie trees, non-induced RNA can will be filtered in each trie tree, is induced
RNA, and the data task of the sequence of missing the target of search induction RNA is divided into a series of small task in all kmer;These small
Business can execute parallel, and the calculating time overhead in the design CRISPR induction library RNA is effectively reduced;In addition, induction RNA is divided into
Kmer1 and kmer2, is first calculated the Hamming distance between kmer1 and the key sequence of trie tree, and exclusion does not meet required distance
Trie tree, make every effort to obtain induction RNA miss the target sequence information when reduce the subsequent matching primitives amount of kmer2;Furthermore we
Method simulation process pool function carrys out the calculating process of subtask in parallel processing algorithm, improved by the advantage using multiprocessor in terms of
Calculate speed.High Efficient Parallel Algorithms of the invention make it possible the library induction RNA for designing large-scale genome.
The method of the present invention realizes the multi-process parallel version of GuideScan software, to accelerate CRISPR to induce the library RNA
Design process.GuideScan software is realized with Python combination C Plus Plus.In the Python interpreter of mainstream
In CPython, global interpreter lock (GIL) is that a mutual exclusion lock for protecting the access to Python object prevents multiple lines
Journey is performed simultaneously Python bytecode.Thread in Python cannot be used for parallel computation, so appointing for CPU intensive type
Business, the promotion in speed cannot be brought using multithreading.And Python multi-process module (multiprocessing module) allows
It creates the program that can be run parallel and uses entire core cpu.It in the methods of the invention, is more appointing for computation-intensive
Business, to improve the execution efficiency of algorithm using multi-process method.The multi-process module of Python provides for each process
The Python interpreter of oneself, each process have the GIL of oneself.Multi-process module uses individual memory headroom, multiple
CPU core, around the GIL limitation in CPython, subprocess can be run, and be easier to use.When the quantity of processing task
It, can be by being gone using the multiple processes of Process (proceeding method) dynamic generation in the multi-process module of Python when little
Processing.But if the king-sized situation of operation object number, manual administration process will be especially cumbersome, can effectively send out at this time
Wave the effect of process pool (pool) method.Process pool can need to provide a certain number of processes according to user, when there is new appoint
When business request is submitted in process pool, if process pool is also less than, it will directly create new process and go to execute the task;And if
Number of processes in process pool has had reached specified maximum value, and task requests will enter queue and wait, once there is task
Terminating, idle process can obtain new task from queue and go to execute, limited process can be efficiently utilized in this way, and
It avoids the lasting creation of process and destroys.But in use due to process pool method, parameter needs to be serialized laggard
Row transmission, and retrieving tree construction can not serialize, so not being available process pool method.The present invention passes through in multi-process module
Queue and proceeding method simulate the function of a process pool.The building of each trie tree and search process are considered as one to appoint
Business, multiple tasks are submitted simultaneously, dynamic call process, when having the task of process free time and waiting, it is idle into
Task of waiting is obtained in Cheng Huicong queue is handled, so as to improve the execution efficiency of task.
Detailed description of the invention
Fig. 1 is the flow diagram of (hereinafter referred to as MultiGuideScan) of the embodiment of the present invention;Wherein a is
The process of MultiGuideScan;B is with reference to genome example;C is the kmer extracted from reference genome and its coordinate letter
Breath;D is sorted as a result, preceding n base constitutes key sequence of the kmer1 as the category according to the preceding n base of kmer
(key), kmer1 is converted to call number index by the quaternary, and remaining is kmer2.(e) trie tree constructed by kmer2,
Comprising kmer2, corresponding kmer in genome the information such as frequency of occurrence, coordinate, and using kmer1 as the bond order of corresponding trie tree
It arranges (key), call number of the respective index number as trie tree;
Fig. 2 is the sequence information schematic diagram that misses the target that parallel computation induces RNA.
Fig. 3 is from yeast count according to induction RNA quantity obtained in collection
Fig. 4 is to induce RNA quantity obtained in Caenorhabditis elegans data set
Fig. 5 is to induce RNA quantity obtained in Drosophila melanogaster data set
Fig. 6 is MultiGuideScan and GuideScan in Q is that total runing time performance compares in the case that 3, n is 4
Fig. 7 is MultiGuideScan and GuideScan in Q is that total runing time performance compares in the case that 4, n is 5
Specific embodiment
Below in conjunction with embodiment, the present invention is described further.
A kind of CRISPR induction RNA library designs method is present embodiments provided, is specifically comprised the following steps:
Step 1: the genome space that parallel determination can target.
In order to generate the desired library induction RNA of user, this method also allows for user's input with reference to genome FASTA text
Part, induces the length of RNA, standard PAM (protospacer adjacent motif, original are spaced neighbouring motif), relative to luring
Lead the position PAM of RNA target sequence, non-standard PAM and Hamming distance M and Q.It is given based on user as shown in Fig. 1 (b, c)
Parameter, algorithm scanning with reference in genome using standard PAM and non-standard PAM as the kmer of prefix or suffix, and related letter
Breath, such as its corresponding coordinate and direction (for recording position and coding direction of the kmer in some reference sequences) information.
The scan method can be parallel, due to including multiple reference sequences in reference genome file, and each reference sequences
Scanning process is independent of each other, therefore can pass through multi-process method by the scanning process of each reference sequences as a task
These tasks are executed parallel, these kmer and PAM and its coordinate information are saved, then again the result of all scannings
Merge, constitute kmer set, which gathers the genome space that can be targeted.And each kmer is counted in reference genome
The number of appearance.
Wherein, standard PAM and non-standard PAM is also referred to as classics PAM and non-classical PAM inside some documents;Each PAM
It is made of s base;Each kmer is made of N number of base;Base type includes t class;In the present embodiment, s=3, N=23, t=
4, the type of base is divided into tetra- class of A, C, G, T.
Step 2: cutting and classification are carried out to the kmer in kmer set.
As shown in Fig. 1 (d), a kmer is cut into kmer1 and kmer2 two parts first, kmer1 length is n (this reality
Apply n=Q+1 in example), it is the sequence of the preceding n base composition of kmer, kmer2 is the remaining sequence of the kmer.Then, it will correspond to
Identical kmer2 points of kmer1 be one kind, and using corresponding kmer1 as the key sequence of such kmer2.Sequence is by A, C, G, T
Four kinds of bases are constituted, i.e., all kmer are divided into 4nA classification.For convenience, with 0,1,2 and 3 this four number replacement bond orders
A, C, G, T these four bases in column, obtain a quaternary number, quaternary number are then converted to decimal number, will obtain
Decimal number as call number, key sequence is mutually converted with call number, such as key sequence " AAAA " manipulative indexing number 0,
Call number 1 is corresponding key sequence " AAAC ".
Step 3: classification construction trie tree.
As shown in Fig. 1 (e), a trie tree (trie tree knot used in the present embodiment respectively is constructed as every one kind kmer2
Structure is dictionary tree), multiple trie trees are thus obtained, and using the key sequence of every one kind kmer2 as the bond order of corresponding trie tree
Column.Kmer all in the former GuideScan method trie tree constructed has been divided into multiple smaller trie trees by above-mentioned steps, and
Kmer2 in each trie tree prefix kmer1 having the same in corresponding kmer, the information such as corresponding coordinate of each sequence
As former kmer.The establishment process of each trie tree be it is independent, this 4nThe building process of a trie tree can be with parallel processing.
Due to multi-process module in Python process pool (pool) method in use, after the parameter of method needs are serialized
It is transmitted, and retrieving tree construction can not serialize, so not being available process pool (pool) method.The present embodiment by mostly into
Queue and process structure in journey module simulate the function of a process pool.The building process of each trie tree is considered as one
A task, submits multiple tasks simultaneously, dynamic call process, idle when having the task of process free time and waiting
Process can obtain the task of waiting from queue and handled, so as to improve the execution efficiency of task.In this step, by institute
Having by the sequence mark of prefix or suffix of non-standard PAM is non-candidate induction RNA, and is prefix or suffix by standard PAM
Sequence mark is candidate induction RNA.
Step 4: parallel filtering non-candidate induces RNA.
Based on kmer set and frequency of occurrence of the kmer in reference genome, if certain kmer frequency of occurrence is greater than 1,
The corresponding kmer2 of the kmer is searched out in trie tree, is marked as the kmer2 of non-candidate induction RNA.Then traversal retrieval
Tree, filter out remaining kmer2 (refer to be not labeled as non-candidate induction RNA kmer2 kmer2), by they respectively with it
The key sequence (kmer1) of the trie tree at place, which links together, induces RNA as candidate.The filter process of these trie trees is also
It is mutually independent, so can equally simulate a process pool, the filter process of each trie tree is considered as a task, parallel
Execute these tasks.And the candidate induction RNA that each task filters out is from the same trie tree, key having the same
Sequence.
Step 5: there are the candidate induction RNA of similar kmer in parallel filtering trie tree.
For each candidate induction RNA, it is intended to find another in trie tree and induce RNA similar with the candidate
Kmer, i.e., the kmer for inducing RNA Hamming distance to be less than M with the candidate.If it is present by the candidate induce RNA and it is all with
Its similar kmer induces RNA labeled as non-candidate, by the kmer of these kmer labeled as the kmer2 of non-candidate induction RNA.With
This guarantees all candidate induction RNA and other kmer Hamming distances are all more than or equal to M, all candidate induction RNA that
This does not have similitude.
Based on, there are when the candidate induction RNA of similar kmer, implementation procedure is as follows in multi-process parallel filtering trie tree:
Step 5.1: all candidate induction RNA being classified according to its kmer1, by the candidate induction RNA with identical kmer1
It is divided into one kind;Induce the classification of RNA by the task according to candidate (there are the candidate induction RNA of similar kmer in filtering trie tree)
It is divided into small task one by one.By each small documents and the corresponding call number (time in small documents comprising a kind of candidate induction RNA
The call number that the kmer1 of choosing induction RNA is converted into) it is put into a queue for multi-process access, each packet in the queue
The characteristic parameter of small documents and a corresponding call number i.e. subtask containing a kind of candidate induction RNA;Wherein characteristic parameter is
Refer to a Proc part not in the parameter of other subtasks;
Step 5.2: multiple processes being created by multi-process module process method, number of processes can be defined according to user and be set
It sets;The preset parameter of each process is all trie trees and Hamming distance M;Idle process takes out one group of feature ginseng from queue
Number executes corresponding task according to preset parameter and characteristic parameter, calculates the candidate in the corresponding document according to call number first
Then the kmer1 for inducing RNA calculates the Hamming distance m of the key sequence of the kmer1 and each trie tree, is stored in list;
Step 5.3: their phases are first taken out for each candidate induction RNA and each trie tree in list by circulation
The Hamming distance m answered directly ignores if Hamming distance m is more than or equal to M (if the kmer1 and trie tree of candidate induction RNA
Key sequence Hamming distance m be more than or equal to M, then do not have to carry out kmer2 comparison, to reduce calculation amount);Otherwise attempt in phase
It is found in the trie tree answered and induces the kmer2 Hamming distance of RNA to be not more than the kmer2 of M-m with the candidate, if it is present will
Temporary file is written in the call number of the kmer2 and trie tree where it;
Step 5.4: it is finished when all tasks are carried out, according to the kmer2 and call number in all temporary files of generation,
By the correspondence kmer2 in the trie tree of manipulative indexing number labeled as the kmer2 of non-candidate induction RNA.
Wherein, a kmer2 is given, is searched in trie tree as follows less than the process of M-m with its Hamming distance:
A: record Hamming distance h is 0 first;Since the 1st layer of trie tree, compare the base and kmer2 of the 1st node
The 1st base it is whether identical, if identical, Hamming distance h is constant, and otherwise Hamming distance adds 1;It then proceedes to compare the node
Child node in the 1st node and kmer2 base, and so on;
B: if until leaf node, h is still less than M-m, then the corresponding kmer of the kmer2 of the branch is similar sequences;
Record its relevant information;Continue the brotgher of node for comparing it, if the brotgher of node is relatively over, return back to father node, continues
Compare the brotgher of node of father node;
C: when comparing i-th of base of i-th layer of j-th of node and kmer2, if h is equal to M-m, the branch institute
Kmer2 be not meant to the similar sequences looked for, continue i-th of base for comparing+1 node of i-th layer of jth and kmer2;If h
Less than M-m, then continue the i+1 base of the 1st child node and kmer2 comparing the node;
D: and so on, it has been traversed until all trie trees are whole, has obtained all similar sequences and its phase of the kmer2
Close information.
Step 6: parallel to obtain induction RNA.Operation identical with step 4, all retrievals are traversed using multi-process parallel
Tree filters out all candidate induction RNA, as final induction RNA set;
Step 7: the sequence of missing the target of parallel computation induction RNA;
For all induction RNA, it is intended to search for the sequence of missing the target with their mispairing in trie tree, Hamming distance requires big
In or equal to M and it is less than or equal to Q.It is illustrated in figure 2 the schematic diagram of the step, which is divided into 4nPart small data task
Parallel processing.
Based on multi-process parallel computation induction RNA miss the target sequence information when, implementation procedure is as follows:
Step 7.1: all candidate induction RNA being classified according to its kmer1, by the candidate induction RNA with identical kmer1
It is divided into one kind;By each small documents comprising a kind of induction RNA and the corresponding call number (kmer1 of the induction RNA in small documents
The call number being converted into), a corresponding SAM small documents are created for every a kind of candidate induction RNA for sequence of missing the target is written
Information;Each small documents, corresponding call number and SAM small documents comprising a kind of induction RNA are put into one for multi-process
The queue of access;Each in the queue includes small documents, corresponding call number and SAM small documents i.e. one of a kind of induction RNA
The characteristic parameter of a word task;Wherein characteristic parameter refers to a Proc part not in the parameter of other subtasks;
Step 7.2: multiple processes being created by multi-process module process method, number of processes can be defined according to user and be set
It sets;The preset parameter of each process is all trie trees and Hamming distance Q;Idle process takes out one group of feature ginseng from queue
Number executes corresponding task according to preset parameter and characteristic parameter, calculates the candidate in the corresponding document according to call number first
Then the kmer1 for inducing RNA calculates the Hamming distance m of the key sequence of the kmer1 and each trie tree, is stored in list;
Step 7.3: for each idle process, it obtains the small documents of an induction RNA from queue, therefrom reads
Take induction RNA sequence call number corresponding with it.Call number is converted into key sequence, calculates the key sequence and each trie tree
The Hamming distance m of key sequence, calculated result are put into a list so as to subsequent access.
Step 7.4: for each induction RNA, its relevant information is recorded, kmer1 and kmer2 are then cut into,
It and trie tree key sequence are obtained from the list of preservation according to the call number of file where induction RNA for each trie tree
Hamming distance m.If m is greater than Q, ignore (if the Hamming distance m of the key sequence of the kmer1 and trie tree of induction RNA is greater than
Equal to Q, then do not have to the comparison for carrying out kmer2, to reduce calculation amount), it is otherwise searched in the trie tree with induction RNA's
Kmer2 Hamming distance is not more than the kmer2 of Q-m;Key sequence by these kmer2 respectively with the trie tree where them is connected to
A plurality of kmer is constituted together, with reference to the sequence of missing the target in genome with the sequence of these kmer complementary pairings being induction RNA.
When the search of all trie trees finishes, what is obtained all misses the target sequence and its relevant information can be converted into Hexadecimal form.
Step 7.5: SAM small documents are written into induction RNA relevant information and the corresponding sequence information that misses the target together.Later,
All SAM small documents are converted to the binary format BAM file of SAM, then all BAM small documents are merged, and are it
Index being established, quickly being accessed with will pass through Samtools tool.
Wherein, the kmer2 for giving an induction RNA, searches for the process for being not more than Q-m with its Hamming distance in trie tree
It is as follows:
A: record Hamming distance h is 0 first.Since the 1st layer of trie tree, compare the base and kmer2 of the 1st node
The 1st base it is whether identical, if identical, Hamming distance h is constant, and otherwise Hamming distance adds 1.It then proceedes to compare the node
Child node in the 1st node and kmer2 base, and so on.
B: if until leaf node, h is still less than Q-m is equal to, then by the retrieval where the kmer2 of the branch and they
The key sequence of tree, which links together, constitutes a kmer, with reference to the sequence of the kmer complementary pairing being the induction in genome
The sequence of missing the target of RNA, records its relevant information;Continue the brotgher of node for comparing it, if the brotgher of node is relatively over, retracts
To father node, continue the brotgher of node for comparing father node.
C: when comparing i-th of base of i-th layer of j-th of node and kmer2, if h is greater than Q-m, ignore, continue
Compare+1 node of i-th layer of jth and i-th of base of kmer2;Otherwise, continue to compare the 1st child node of the node with
The i+1 base of kmer2.
D: and so on, it has been traversed until all trie trees are whole, has obtained all sequences of missing the target of the induction RNA sequence
And its relevant information.
Experimental result assessment
In order to assess the experimental performance of the method for the invention, it is compared with the performance of original method GuideScan, this grinds
Study carefully the benchmark for having used three species from UCSC genome browser (http://hgdownload.soe.ucsc.edu)
Data set comes test method performance, including saccharomycete (Yeast), Caenorhabditis elegans (C.elegans) and Drosophila melanogaster
(D.melanogaster) data set.Input file is the reference genome FASTA file of three species, the FASTA of genome
Comment file is generally divided into following components:
(1) chromosome is assembled: the sequence started with chr1.., chrX, chrY and chrM.
(2) sequence of no-fix: using _ random as the sequence of suffix, expression know it on which chromosome, but not
Know its direction and sequence.
(3) sequence that do not place: using chrUn_ as the sequence of prefix, which chromosome expression is not known on.
The main feature for the data set that the present embodiment uses is as shown in table 1:
The main feature of 1 benchmark dataset of table
The present embodiment passes through the method Parallel Implementation original GuideScan algorithm of multi-process, exists with former GuideScan algorithm
Calculating logic substantially is the same.So the experimental result of the present embodiment is identical with original GuideScan algorithm.
As Fig. 3,4,5 be respectively from saccharomycete, three species of Caenorhabditis elegans and Drosophila melanogaster data set obtained in induction
RNA, what wherein abscissa indicated is with reference to the sequence names label in genome, and ordinate indicates 50kb alkali before each sequence
The induction RNA quantity found in base (insufficient then take all bases).
Calculate time performance comparative analysis
The present embodiment is in the computer Linux system with 40 Xeon 2.20GHz E5-2630 v4 CPU and 128G memories
The time performance of the present embodiment method and original GuideScan algorithm is tested on system.Used two kinds of configurations: 1) parameter M is 2, the Chinese
Prescribed distance Q is 3, n 4, and induction RNA length is 20, and standard PAM is " NGG ", and non-standard PAM is " NAG ";2) parameter M is 2, the Chinese
Prescribed distance Q is 4, n 5, and induction RNA length is 20, and standard PAM is " NGG ", and non-standard PAM is " NAG ".Related data set
It specifically describes as shown in table 1.
Table 2 show the step of the present embodiment method (MultiGuideScan) and describes, and wherein step 2 is ratio
One step of GuideScan multiprocessing.Other than step 2, remaining step has all carried out parallelization processing.
The step of 2 MultiGuideScan of table, describes
It is respectively the present embodiment method (MultiGuideScan) and GuideScan shown in table 3,4 using configuration 1) and
2) the runing time performance in the case where step by step compares.Wherein, step 1 due to reference genome sequence limited amount, with
Process number increase, calculating speed tends to be steady;Step 3 can not also continue to extend since IO and communication overhead are larger;Step 7
It is the maximum step of time overhead, because the computation complexity of the step is maximum.
Runing time performance compares (Q=3, n=4) to 3 MultiGuideScan of table step by step
Runing time performance compares (Q=4, n=5) to 4 MultiGuideScan of table step by step
It is respectively the present embodiment method (MultiGuideScan) and GuideScan shown in Fig. 6 and Fig. 7 using configuration 1)
With 2) in the case where total runing time performance compare.It can be seen from the figure that calculating the time with the increase for using process number
Constantly reduce.In this method, induction RNA and trie tree are classified according to kmer1, calculates the kmer1 of induction RNA of all categories first
With the Hamming distance of trie tree key sequence, since the kmer1 of each classification induction RNA is identical, kmer2 in each trie tree
Key sequence is also all identical, which greatly reduces the calculation amount that induction RNA is compared with kmer, avoids each induction RNA
Global alignment is carried out one by one with each kmer.And when the Chinese of the kmer1 of some classification induction RNA and some trie tree key sequence
When prescribed distance m is greater than Q, the kmer2 of category induction RNA will not need to be compared with the kmer2 in the trie tree again, again
Reduce calculation amount.Therefore, when Q is 3, n 4, and the process number used is 1, the time performance of MultiGuideScan is still
It is better than GuideScan very much, reduce about 1 times of time overhead.As shown in fig. 6,3 times of acceleration is obtained using 2 processes,
The acceleration that 5 times are obtained using 4 processes obtains 6-8 times of acceleration using 8 processes, obtains 8-10 times using 16 processes
Accelerate, 9-12 times of acceleration effect has been reached using 32 processes.
And when Q is 4, when n is 5, RNA and trie tree is induced to be divided into 45That is 1024 classifications, although theoretically answering
This reaches similar acceleration effect as configuring 1), but actually task quantity increases too much, at the same time IO expense with
Communication overhead also increases many, and performance when process number being caused to be 1 is not improved.But as the process number used increases
Add, acceleration effect is just more and more significant.As shown in fig. 7, obtaining 1.5 times of acceleration using 2 processes, obtained using 4 processes
2.6 times of acceleration is obtained 4.5 times of acceleration using 8 processes, 7 times of acceleration is obtained using 16 processes, is reached using 32 processes
9-10 times of acceleration effect.
In addition, also there are many special, can not be parallel processing parts in the algorithm, and there are also many IO expenses, communication
Expense and other overheads.According to Amdahl law, the parallelization of single program can not infinitely accelerate.With use into
Number of passes increases, and scalability can be reduced slowly.
It is emphasized that example of the present invention be it is illustrative, without being restrictive, thus the present invention it is unlimited
Example described in specific embodiment, other all obtained according to the technique and scheme of the present invention by those skilled in the art
Embodiment does not depart from present inventive concept and range, and whether modification or replacement, also belong to protection model of the invention
It encloses.
Claims (10)
1. a kind of CRISPR induces RNA library designs method, which comprises the following steps:
Step 1: scanning is constituted using standard PAM or non-standard PAM as the kmer of prefix or suffix in reference genome
Kmer set;
Step 2: being cut into kmer1 and kmer2 two parts, wherein kmer1 is to each of kmer set kmer
The identical kmer2 of corresponding kmer1 is divided into a classification, by the kmer2 of a classification by the sequence of its preceding n base composition
Key sequence of the corresponding kmer1 as category kmer2;Same category of kmer2 is building up in the same trie tree again, by
This multiple trie tree, the bond order of each trie tree are classified as the key sequence of the kmer2 of respective classes;
To kmer set each of kmer, if its using non-standard PAM as prefix or suffix or its in reference genome
Frequency of occurrence is greater than 1, or there is the kmer for being less than M with its Hamming distance, then it is non-induced RNA, is otherwise induction RNA;
Step 3: all induction RNA are classified according to kmer1, to the induction RNA of all categories, all trie trees is traversed, are searched
Suo Yuqi Hamming distance is less than the kmer no more than Q;Wherein, to the induction RNA of a classification, all trie trees are traversed, are searched for
With its Hamming distance be less than no more than Q kmer method specifically: first calculating the category induction RNA kmer1 with it is each
The Hamming distance of the key sequence of trie tree, the Hamming distance for finding out key sequence and the kmer1 of category candidate induction RNA are not more than
The trie tree of Q;Then for each candidate induction RNA of the category, respectively in the trie tree found out, search and the candidate
The Hamming distance of the kmer2 of RNA is induced to be not more than the kmer2 of Q-m;By these kmer2 respectively with the trie tree where them
Key sequence, which links together, constitutes a plurality of kmer, with reference to the sequence of these kmer complementary pairings being the induction in genome
The sequence of missing the target of RNA;
The thus obtained induction RNA and its i.e. CRISPR of sequence information that misses the target induces the library RNA.
2. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 1, and
Row scans in multiple sequences of reference genome using standard PAM or non-standard PAM as the kmer of prefix or suffix.
3. CRISPR according to claim 2 induces RNA library designs method, which is characterized in that will be in reference genome
Multiple sequences in scanning using standard PAM or non-standard PAM be the kmer of prefix or suffix as a general assignment, by its stroke
It is divided into multiple subtasks, each subtask is scanned in a sequence of reference genome with standard PAM or non-standard PAM
For the kmer of prefix or suffix;Using the process and queuing method simulation process pool function in the multi-process module of python, and
Row executes multiple subtasks.
4. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 2, and
The kmer2 for being about to multiple classifications is building up to respectively in multiple trie trees.
5. CRISPR according to claim 4 induces RNA library designs method, which is characterized in that by multiple classifications
Kmer2 is building up in multiple trie trees respectively as a general assignment, is divided into multiple subtasks, each subtask will
The kmer2 of one classification is building up in a trie tree;Using the process and queuing method mould in the multi-process module of python
Quasi- process pool function, executes multiple subtasks parallel.
6. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 2,
To each of kmer set kmer, if it is using non-standard PAM as prefix or suffix or its appearance in reference genome
Number is greater than 1, then it is non-induced RNA, and otherwise it is candidate induction RNA;
All candidate induction RNA are classified according to kmer1, parallel to the candidate induction RNA of all categories, traverse all retrievals
Tree judges whether there is the kmer for being less than M with its Hamming distance, if so, it is non-induced RNA, otherwise it is induction RNA.
7. CRISPR according to claim 6 induces RNA library designs method, which is characterized in that will be parallel to all classes
Other candidate induction RNA, traverses all trie trees, the kmer judged whether there is with its Hamming distance less than M is total as one
Task is divided into multiple subtasks, and each subtask is to traverse all trie trees to the candidate induction RNA of a classification,
Judge whether there is the kmer for being less than M with its Hamming distance;Using the process and queuing method in the multi-process module of python
Process pool function is simulated, executes multiple subtasks parallel.
8. CRISPR according to claim 1 induces RNA library designs method, which is characterized in that in the step 3, institute
It states in step 3, parallel to the induction RNA of all categories, traverses all trie trees, search is less than with its Hamming distance and is not more than Q
Kmer.
9. CRISPR according to claim 8 induces RNA library designs method, which is characterized in that, will in the step 3
Parallel to the induction RNA of all categories, all trie trees are traversed, search is less than the kmer conduct no more than Q with its Hamming distance
One general assignment, is divided into multiple subtasks, and each subtask is to traverse all retrievals to the induction RNA of a classification
Tree, search are less than the kmer no more than Q with its Hamming distance;Using the process and queuing method in the multi-process module of python
Process pool function is simulated, executes multiple subtasks parallel.
10. the CRISPR according to any one of claim 3,5,7,9 induces RNA library designs method, which is characterized in that
Using the proceeding method and queuing method simulation process pool function in the multi-process module of python, multiple subtasks are executed parallel
Method specifically:
Multiple processes are created using the proceeding method in the multi-process module of python;Using the common parameters of each subtask as
The characteristic parameter of each subtask is put into queue by the preset parameter of each process;Each process is taken out from queue every time
One group of characteristic parameter executes a subtask according to this feature parameter and preset parameter;Multiple task parallelisms execute multiple sons and appoint
Business;After each process has executed a subtask, one group of characteristic parameter is taken out from queue again, according to this group of characteristic parameter and
Preset parameter executes a new subtask;Until the characteristic parameter in queue is all removed, all subtasks are finished;
Wherein character parameter refers to a Proc part not in the parameter of other subtasks.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910712069.XA CN110322927B (en) | 2019-08-02 | 2019-08-02 | CRISPR (clustered regularly interspaced short palindromic repeats) induced RNA (ribonucleic acid) library design method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910712069.XA CN110322927B (en) | 2019-08-02 | 2019-08-02 | CRISPR (clustered regularly interspaced short palindromic repeats) induced RNA (ribonucleic acid) library design method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110322927A true CN110322927A (en) | 2019-10-11 |
CN110322927B CN110322927B (en) | 2021-04-09 |
Family
ID=68123517
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910712069.XA Active CN110322927B (en) | 2019-08-02 | 2019-08-02 | CRISPR (clustered regularly interspaced short palindromic repeats) induced RNA (ribonucleic acid) library design method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110322927B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013169398A2 (en) * | 2012-05-09 | 2013-11-14 | Georgia Tech Research Corporation | Systems and methods for improving nuclease specificity and activity |
CN105316322A (en) * | 2015-09-25 | 2016-02-10 | 北京大学 | SgRNA base mismatched target site library and its application |
CN105647922A (en) * | 2016-01-11 | 2016-06-08 | 中国人民解放军疾病预防控制所 | Application of CRISPR-Cas9 system based on new gRNA (guide ribonucleic acid) sequence in preparing drugs for treating hepatitis B |
CN105821072A (en) * | 2015-01-23 | 2016-08-03 | 深圳华大基因研究院 | CRISPR-Cas9 system used for assembling DNA and DNA assembly method |
CN106637421A (en) * | 2016-10-28 | 2017-05-10 | 北京大学 | Method for constructing double-sg RNA library and method for applying double-sg RNA library to high-flux functionality screening research |
WO2018005691A1 (en) * | 2016-06-29 | 2018-01-04 | The Regents Of The University Of California | Efficient genetic screening method |
CN108205614A (en) * | 2017-12-29 | 2018-06-26 | 苏州金唯智生物科技有限公司 | A kind of structure system in full-length genome sgRNA libraries and its application |
CN109207515A (en) * | 2017-07-03 | 2019-01-15 | 华中农业大学 | A method of design and building pig full-length genome CRISPR/Cas9 knock out library |
CN109997192A (en) * | 2016-06-15 | 2019-07-09 | 哈佛学院董事及会员团体 | Method for rule-based genome design |
-
2019
- 2019-08-02 CN CN201910712069.XA patent/CN110322927B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013169398A2 (en) * | 2012-05-09 | 2013-11-14 | Georgia Tech Research Corporation | Systems and methods for improving nuclease specificity and activity |
CN105821072A (en) * | 2015-01-23 | 2016-08-03 | 深圳华大基因研究院 | CRISPR-Cas9 system used for assembling DNA and DNA assembly method |
CN105316322A (en) * | 2015-09-25 | 2016-02-10 | 北京大学 | SgRNA base mismatched target site library and its application |
CN105647922A (en) * | 2016-01-11 | 2016-06-08 | 中国人民解放军疾病预防控制所 | Application of CRISPR-Cas9 system based on new gRNA (guide ribonucleic acid) sequence in preparing drugs for treating hepatitis B |
CN109997192A (en) * | 2016-06-15 | 2019-07-09 | 哈佛学院董事及会员团体 | Method for rule-based genome design |
WO2018005691A1 (en) * | 2016-06-29 | 2018-01-04 | The Regents Of The University Of California | Efficient genetic screening method |
CN106637421A (en) * | 2016-10-28 | 2017-05-10 | 北京大学 | Method for constructing double-sg RNA library and method for applying double-sg RNA library to high-flux functionality screening research |
CN109207515A (en) * | 2017-07-03 | 2019-01-15 | 华中农业大学 | A method of design and building pig full-length genome CRISPR/Cas9 knock out library |
CN108205614A (en) * | 2017-12-29 | 2018-06-26 | 苏州金唯智生物科技有限公司 | A kind of structure system in full-length genome sgRNA libraries and its application |
Non-Patent Citations (2)
Title |
---|
CHUANXIAN WEI ET AL: "TALEN or Cas9-Rapid,Efficient and Specific Choices for Genome Modifications", 《JOURNAL OF GENETICS AND GENOMICS》 * |
JIAMIN SUN ET AL: "Crispr-local: a local single-guide rna (sgrna) design tool for non-reference plant genomes", 《BIOINFORMATICS》 * |
Also Published As
Publication number | Publication date |
---|---|
CN110322927B (en) | 2021-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Srivastava et al. | Parallel formulations of decision-tree classification algorithms | |
Yap et al. | Parallel computation in biological sequence analysis | |
US5884320A (en) | Method and system for performing proximity joins on high-dimensional data points in parallel | |
US6415286B1 (en) | Computer system and computerized method for partitioning data for parallel processing | |
CN1241135C (en) | System and method of sequencing and classifying attributes for better visible of multidimentional data | |
Snir et al. | Quartets MaxCut: a divide and conquer quartets algorithm | |
CA2480688A1 (en) | Method and apparatus for querying relational databases | |
JP2005038386A (en) | Device and method for sorting sentences | |
Arasteh et al. | Bölen: Software module clustering method using the combination of shuffled frog leaping and genetic algorithm | |
EP1349082A1 (en) | Method and apparatus for querying relational databases | |
JP2007034878A (en) | Information processing method, information processor, and information processing program | |
Zoraghchian et al. | Exploiting parallel graphics processing units to improve association rule mining in transactional databases using butterfly optimization algorithm | |
CN110322927A (en) | A kind of CRISPR induction RNA library designs method | |
Horesh et al. | Designing an A* algorithm for calculating edit distance between rooted-unordered trees | |
Wadud et al. | Multiple sequence alignment using chemical reaction optimization algorithm | |
Cringean et al. | Efficiency of text scanning in bibliographic databases using microprocessor-based, multiprocessor networks | |
CN104598591B (en) | A kind of model element matching process for type attribute graph model | |
KR20040036691A (en) | High performance sequence searching system and method for dna and protein in distributed computing environment | |
US7657417B2 (en) | Method, system and machine readable medium for publishing documents using an ontological modeling system | |
Dai et al. | Leveraging Constraints plus dynamic programming for the large dollo parsimony problem | |
Glybovets et al. | Evolutionary multimodal optimization | |
Rheinländer et al. | Scalable sequence similarity search and join in main memory on multi-cores | |
Farouzi et al. | Balanced parallel triangle enumeration with an adaptive algorithm | |
Suarez et al. | Bioinformatics software for genomic: a systematic review on github | |
Hayashi et al. | Work-time optimal k-merge algorithms on the PRAM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |