CN106446600B - A kind of design method of the sgRNA based on CRISPR/Cas9 - Google Patents

A kind of design method of the sgRNA based on CRISPR/Cas9 Download PDF

Info

Publication number
CN106446600B
CN106446600B CN201610341946.3A CN201610341946A CN106446600B CN 106446600 B CN106446600 B CN 106446600B CN 201610341946 A CN201610341946 A CN 201610341946A CN 106446600 B CN106446600 B CN 106446600B
Authority
CN
China
Prior art keywords
sgrna
cas9
model
value
crispr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610341946.3A
Other languages
Chinese (zh)
Other versions
CN106446600A (en
Inventor
刘琦
啜国晖
陈亚男
闫纪芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201610341946.3A priority Critical patent/CN106446600B/en
Publication of CN106446600A publication Critical patent/CN106446600A/en
Application granted granted Critical
Publication of CN106446600B publication Critical patent/CN106446600B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to the design methods of sgRNA based on CRISPR/Cas9 a kind of, which is characterized in that this method includes the following steps: the value for obtaining the digesting efficiency of sgRNA and corresponding Cas9;Personalization sgRNA is established to design a model;The quality and more new database that the personalized sgRNA established designs a model are measured with NDCG algorithm;Design sgRNA and the assessed value for providing each sgRNA.Compared with prior art, the present invention has the characteristics that accuracy rate is high, feature is complete, it is wide with analysis data to have a wide range of application.

Description

A kind of design method of the sgRNA based on CRISPR/Cas9
Technical field
It is especially a kind of based on CRISPR/Cas9 gene editing technology the present invention relates to gene editing research field The design method of sgRNA.
Background technique
With the development of molecular biology, people have more further understanding for the constitution element of life, but raw The curative mechanism of the mechanism of life process, especially certain diseases is not understood there is also very much.Relationship between gene and phenotype, gene Influencing each other between gene, there is an urgent need to a kind of engineering technology that in vivo quickly can knock out and be inserted into gene. CRISPR/Cas9 system occurs at once, meets this demand of researcher.
CRISPR/Cas9 system (Clustered regularly interspaced short palindromic Repeats/CRISPR-associated protein 9) it is a kind of easy to operate, extensive gene editing tool of applicability. Whole system is mainly made of the RNA (sgRNA) that a nucleic acid cleaving enzymatic (Cas9) and one play guidance recognition reaction.sgRNA It is identified by base pair complementarity and target gene site, then recruits Cas9 and carry out digestion, generate double-strand break, to realize The gene editing of DNA level.Because its applicability is wide, convenient and time-saving, it is applied to various aspects quickly, is especially built in cancer model It is vertical and gene therapy to probe into aspect, there is very big superiority.
However, being found in the continuous exploration of scientist, for the different sgRNA's of same gene design in same cell Digesting efficiency has very big difference, if efficient sgRNA cannot be designed, can only be made up by increasing concentration, in this way will Many gene rubbish can be brought to cell, at the same generate it is a high proportion of miss the target, to the research of scientific research personnel bring it is very big not Just, therefore the sgRNA of one high digesting efficiency of design is extremely important for the research in terms of gene.
Currently, the design software of existing sgRNA has nearly 30 kinds, be broadly divided into two classes: one kind is summarized from experiment Some rules of sgRNA, such as sgRNA sequence one end of pairing must contain PAM sequence, and 5 ' ends should be GG, G/C content It should be maintained at 60% or so, seed sequence can't stand mispairing etc., then directly screened by setting condition,;Another kind of master The specificity of sgRNA, such as CRISPR are calculated by assigning a weight to each base with statistical method Design.What the software of both types was all established is the model of a versatility, however due to different plant species and different cells Between have very big heterogeneity, cause existing software prediction efficiency be not very well, and because different experimental conditions under it is different Matter has a certain impact to the digesting efficiency of sgRNA, and general model evaluation accuracy rate is relatively low.
Accordingly, it is considered to which the heterogeneity between different platform species data, is established a with different platform or the data of species Property model to improve the specificity and high efficiency of sgRNA, for CRISPR/Cas9 system miss the target problem research it is extremely heavy It wants.
Summary of the invention
The purpose of the present invention is provide regarding to the issue above a kind of accuracy rate is high, have a wide range of application based on CRISPR/ The design method of the sgRNA of Cas9.
The purpose to realize the present invention, the present invention provide the design method of sgRNA based on CRISPR/Cas9 a kind of, This method includes the following steps:
1) value of the digesting efficiency of sgRNA and corresponding Cas9 is obtained, specifically:
11) value of the digesting efficiency of sgRNA and corresponding Cas9 is obtained from document;
12) sgRNA is obtained from SRA database, calculates the value for obtaining the digesting efficiency of corresponding Cas9;
13) according to species, cell type and experiment condition by step 11) and 12) in the data classification that gets at difference Reference genome, it is each with reference to list in genome portion first be classified as sgRNA title, second be classified as sgRNA sequence with And third is classified as the table of the digesting efficiency of corresponding Cas9;
2) personalization sgRNA is established to design a model, specifically:
21) according to demand from accordingly with reference in genome, extraction step 1) in the sequence information of sgRNA that obtains;
22) binary coding is carried out according to binary rules to the sgRNA sequence information extracted in step 21);
23) to the sgRNA obtained in step 21), judge the data type of the digesting efficiency of its Cas9, then if numeric type It enters step 24), is then entered step 25) if classifying type;
24) to the sgRNA sequence information after coding in step 22), feature extraction is carried out with Lasso model, according to standard Linear regression establishes personalization sgRNA and designs a model;
25) it to the sgRNA sequence information after coding in step 22), is carried out with the L1 regularization in the recurrence of two sorted logics Feature selecting establishes personalization sgRNA further according to the L2 regularization in the recurrence of two sorted logics and designs a model;
3) quality that designs a model of personalized sgRNA established in step 2) is measured with NDCG algorithm and updates SRA number According to library, specifically:
31) the NDCG value that the personalized sgRNA established in step 2) designs a model is calculated;
32) judge whether there is corresponding personalization sgRNA model in existing SRA database, if being otherwise added to SRA Database, if then entering step 33);
33) compare personalization sgRNA model and the sgRNA model in corresponding SRA database, select NDCG value big One is stored in SRA database;
4) it designs sgRNA and provides the assessed value of each sgRNA, specifically:
41) genome area provided according to user is chosen from SRA database and suitably refers to genome, therefrom searches Suo Suoyou meets the sgRNA of design rule, as the sgRNA of design;
42) it to the sgRNA designed in step 41), is assessed with the personalized sgRNA model established in step 2).
Preferably, the value of the digesting efficiency of corresponding Cas9 is calculated in the step 12) specifically:
121) in the long comparison to reference genome of reading sgRNA and corresponding two generation being sequenced;
122) it is long to take out the reading comprising sgRNA;
123) judge cut point whether generate the insertion on DNA or the insertion on deletion and DNA or delete whether be Frameshift mutation;
124) the frameshift mutation rate of each sgRNA is counted, specifically:
125) using the frameshift mutation rate being calculated in step 124) as the value of the digesting efficiency of Cas9.
Preferably, the sequence information of sgRNA includes sgRNA sequence, the required mark of sgRNA identification DNA in the step 21) The base of the upstream and downstream of the spacer of master chip section and sgRNA, the bases longs of the upstream and downstream of the spacer of the sgRNA are flat The value of platform default value or user setting.
Preferably, the binary rules in the step 22) specifically: corresponding 0100, the G corresponding 0010 of corresponding 1000, the C of A, Corresponding 0001, the N corresponding 0000 of T.
Preferably, carrying out feature extraction with Lasso model in the step 24) is to select spy by extracting non-zero weight Vector is levied, specifically:
Wherein, w is the weight of estimative feature vector, and x is the feature vector of the sgRNA selected, and n is sgRNA Quantity, y are the values of the digesting efficiency of the corresponding Cas9 of sgRNA;α is a constant, | | w | |1It is the matrix of parameter vector; Lasso model is by increasing α | | w | |1It solves this least square loss function, passes through traversal regularization matrix, non-zero weight Feature be extracted.
Preferably, the L1 regularization in the step 25) specifically:
Wherein, w and c is the weight and intercept of estimative feature, and X is the binary matrix of the sgRNA of coding, and n is The quantity of sgRNA, y are the values of the digesting efficiency of the corresponding Cas9 of sgRNA.
Preferably, the L2 regularization specifically:
Preferably, the NDCG value that the personalized sgRNA established designs a model is calculated in the step 31) specifically:
Wherein, DCG is the numerical value calculated with prediction sequence, and IDCG is to calculate resulting ideal DCG with true sequence, reliIt is the ranking value of the i-th position prediction.
Preferably, design rule in the step 41) specifically:
20bp+PAM
Wherein, bp is the unit for indicating DNA length, and PAM is that sgRNA identifies that the required mark segment of DNA, "+" indicate DNA Length meets 20bp, simultaneously containing the required mark segment PAM of sgRNA identification DNA.
Compared with prior art, the invention has the following advantages:
(1) it is directed to different plant species different type cell, has used personalized strategy, and with the machine learning of data-driven Algorithm is modeled, and assessment accuracy rate has significant improvement.
(2) it is not limited only between PAM and spacer using new coding rule so that the feature found is more complete.
(3) process that user oneself constructs model is imparted, so that application range is wider, is not limited only in database only Some species.
(4) it uses the OTF rate of NGS data as digestion rate, expands the range that can analyze data;
(5) data that user can upload oneself come expanding data library, accelerate the accumulation of data, advantageously account for now Because data volume deficiency leads to the predicament that cannot design optimal sgRNA very well.
Detailed description of the invention
Fig. 1 is the method flow diagram for establishing personalization sgRNA model and model evaluation;
Fig. 2 is the method flow diagram of design and assessment sgRNA.
Specific embodiment
The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to Following embodiments.
Abbreviation explanation:
CRISPR:Clustered regularly interspaced short palindromic repeats
The small palindrome repetitive sequence at the interval of the rule of cluster
Cas9: the relevant enzyme with CRISPR II type system
NGS:Next Generation Sequencing, the sequencing of two generations
PAM:Protospacer-adjacent motif, sgRNA identify that DNA is required and indicate segment
The RNA of guiding function is played in sgRNA:CRISPR/Cas9 system
Insertion, deletion on DNA caused by indel:CRISPR/Cas9 is edited
20 or so bases of base pair complementarity are played in spacer:sgRNA
OTF:out of frame, frameshift mutation.
Read: reading length, is the sequencing sequence that a reaction obtains in high-flux sequence.
The present embodiment provides the design methods of sgRNA based on CRISPR/Cas9 a kind of, for different plant species different type Cell establishes the process that oneself personalization sgRNA designs a model, and model can be established according to different demands and designs sgRNA, tool Body includes following four step:
(1) data collection: the receipt being collected into from document is generally two classes: sgRNA and corresponding digesting efficiency number Value type or sgRNA and corresponding digesting efficiency classifying type (such as effective or invalid two classification);It is downloaded from SRA database NGS then only have numeric type a kind of.Because NGS data pass through the numeric type one collected in process and document after counting OTF rate It causes, therefore the present embodiment is only illustrated two kinds of data of document classification type and NGS.
Classifying type data: for the classifying type data collected from document, it is in vain 0 that the present embodiment regulation, which is effectively 1, It is organized into the format such as table 1.
Table 1
sgID Sequence Score
sgRNA_1 CGCAACCTGCTCAGCGCCTACGG 1
sgRNA_2 CAGTCTACATAACACGCCCATGG 1
sgRNA_3 CGCAACCTGCTCAGCGCCTACGG 1
…… …… ……
sgRNA_1_1 GGCAACCGTGGCGGCAATCGAGG 0
sgRNA_2_2 CTTCTCGGAATTCGGTGAAGGTGG 0
sgRNA_3_3 AACCTCCCGGCTTCTCGGAATTCGG 0
…… …… ……
Numeric type data: for the numeric type data of NGS, first by BWA respectively the sequence of sgRNA and NGS Reads is compared to the mankind with reference on genome, takes out the reads comprising sgRNA, and judge whether generate indel in cut point And whether indel is OTF, then counts OTF rate (the OTF rate=include the sgRNA and be OTF of each sgRNA The sum of reads is divided by total reads number comprising the sgRNA).It finally arranges as the format such as table 2.
Table 2
sgID Sequence Score
sgRNA_1 CGCAACCTGCTCAGCGCCTACGG 0.2345
sgRNA_2 CAGTCTACATAACACGCCCATGG 0.7846
sgRNA_3 CGCAACCTGCTCAGCGCCTACGG 0.2367
…… …… ……
(2) model is established: as shown in Figure 1, from the corresponding sequence information for extracting the sgRNA being collected into reference to genome. Assuming that setting upstream and downstream sequence is respectively 35 and 32 bases, then the sequence taken out is 90 (35+20+3+32) a bases.CACC TGGTATGTTCGTATCGGGCAGAATATCGCAACCTGCTCAGCGCCTACGGTCCATCTCGCTCAGGTACGACTGACCG ACCCAGTCTA。
Binary coding is carried out to the sgRNA information of extraction, rule is as shown in table 3.
Table 3
Then 90 bases of the above taking-up may be encoded as:
Feature is extracted with machine learning method, personalization sgRNA is established and designs a model.
For classifying type data, feature is selected with logistic regression and establishes prediction model.The recurrence of two sorted logics has two A optional regularization, the present invention carry out feature selecting with L1 regularization, and model is established in L2 regularization.
The optimization problem of the following sparse features selection of L1 regularization logistic regression solution:
Wherein, w and c is the weight and intercept of estimative feature, and X is the character representation of training sample, and n is training sample Quantity, y is the corresponding digesting efficiency value of sgRNA.
Cost function is minimized with L2 punishment logistic regression solution:
For numeric type data, make feature selecting of Lasso model, standard linear regression establishes prediction model. Lasso is the linear model for estimating sparse related coefficient, mainly selects feature vector by extracting non-zero weight.Minimize mesh Scalar functions are as follows:
Wherein, w is the weight of estimative feature vector, and x is the feature vector of the sgRNA selected, and n is training sample Quantity, y is the corresponding digesting efficiency value of sgRNA;α is a constant, | | w | |1It is the matrix of parameter vector;Lasso mould Type is by increasing α | | w | |1This least square loss function is solved, traversal regularization matrix, the feature quilt of non-zero weight are passed through It extracts, these features are considered being important the element for influencing sgRNA digesting efficiency.
After choosing these features, an assessment models then are established with a standard linear regression.
The modeling result of numeric type and classifying type all generates two files: one is xml document, and content includes selectable The result of feature and cross validation;Another file is pkl file, and content is the prediction model established, binary file.
Xml document content is as follows:
<report>
<features method=" lasso " n=" 36 " ups=" 35 " dws=" 32 ">
<ups_33_C/>
<ups_30_G/>
<ups_22_G/>
<ups_21_G/>
<ups_20_A/>
<ups_14_G/>
<ups_13_C/>
<ups_9_G/>
<ups_8_A/>
<ups_5_G/>
<ups_1_A/>
<spa_1_C/>
<spa_2_C/>
<spa_4_A/>
<spa_5_T/>
<spa_6_C/>
<spa_8_A/>
<spa_9_C/>
<spa_9_T/>
<spa_10_C/>
<spa_15_T/>
<spa_17_A/>
<spa_19_C/>
<pam_1_G/>
<pam_2_G/>
<dws_2_A/>
<dws_6_G/>
<dws_12_G/>
<dws_13_C/>
<dws_14_C/>
<dws_15_A/>
<dws_24_A/>
<dws_24_G/>
<dws_26_G/>
<dws_26_T/>
<dws_28_A/>
</features>
<cross_validation fold=" 5 ">
<metric>
<pearson_cor value="0.862"/>
<r2value=" 0.683 "/>
</metric>
</cross_validation>
</report>
(3) quality of prediction model, NDCG (Normalized Discounted assessment models: are measured using NDCG algorithm Storage gain is lost in Cumulative Gain, normalization) it is the efficiency for being mainly used to measure an order models, its value generation Table prediction ranking results and actual sequence between similitude, range between zero and one, 1 indicate completely the same, numerical value It is better to represent this model more greatly.Specific formula is as follows:
DCG (Discounted Cumulative Gain, lose storage gain) is the numerical value calculated with prediction sequence, IDCG (ideal DCG), is ideal DCG, calculates gained with true sequence.The mathematical definition of DCG is as follows:
Wherein, reliIt is the ranking value of the i-th position prediction.
As shown in the table, sgID is the title of sgRNA, and seq is the spacer sequence of sgRNA, and Benchmark Score is Benchmark score, BS_rank are the sequence of Benchmark Score, and Cage is the score of prediction model of the present invention assessment, C_rank It is as shown in table 4 for the sequence of Cage.
Table 4
sgID seq Benchmark Score BS_rank Cage Score C_rank
sg1000 GCAGGTACCCTGCAACGTCGCGG 0.789456865 1 0.6905 1
sg1001 CTCCACTAGTCCCCGCGCCGCGG 0.506422166 2 0.6026 2
sg1 GTAATGGCTTCCTCGTGAGTTGG 0.325738326 3 0.5548 3
sg1002 GACTCCGTTGGGATCCGCGCCGG 0.092078991 4 0.5095 4
sg10 ATCTTAAGCAAACGCTTACCAGG 0.072255575 5 0.4959 5
sg1003 CCCGAAACGGTTGACTCCGTTGG 0.037552375 6 0.4473 6
sg1004 AGGCGCGCGATCCAGGTAGCTGG 0.019922477 7 0.3281 8
sg100 AAAAAGCTGATGAAGTTGTTTGG 0.017296539 8 0.3357 7
sg1005 CGGGGCCACCGCGACGTTGCAGG 0.002206787 9 0.3056 9
…… …… …… …… …… ……
TOP50 NDCG=0.876322904
TOP 10%NDCG=0.84340749
If there is no this model in database, updates and arrive database, otherwise calculate two groups of NDCG value and be compared, if New model is bigger than the NDCG value of existing model, then database may be updated.
(4) design and assess: as shown in Fig. 2, for user designed sgRNA assess or for user to Genome area (such as chromosome 1,1,000,000to 1,002,000, hg19) out, carries out the design of sgRNA, first The species or cell type of the first determination sgRNA to be assessed, then select suitable model to be assessed, if without suitable Model, similar model may be selected, present embodiments provide be related to 3 species, 8 kinds of cells 10 models it is for selection It uses.As a result output is as shown in table 5.
Table 5
So far, user can choose the research for being suitble to the sgRNA of oneself demand to carry out next step.

Claims (7)

1. a kind of design method of the sgRNA based on CRISPR/Cas9, which is characterized in that this method includes the following steps:
1) value of the digesting efficiency of sgRNA and corresponding Cas9 is obtained, specifically:
11) value of the digesting efficiency of sgRNA and corresponding Cas9 is obtained from document;
12) sgRNA is obtained from SRA database, calculates the value for obtaining the digesting efficiency of corresponding Cas9;
13) according to species, cell type and experiment condition by step 11) and 12) in the data classification that gets at different ginsengs Examine genome, it is each to be classified as sgRNA title, second be classified as sgRNA sequence and the with reference to listing portion first in genome Three are classified as the table of the digesting efficiency of corresponding Cas9;
2) personalization sgRNA is established to design a model, specifically:
21) according to demand from accordingly with reference in genome, extraction step 1) in the sequence information of sgRNA that obtains;
22) binary coding is carried out according to binary rules to the sgRNA sequence information extracted in step 21);
23) to the sgRNA obtained in step 21), judge the data type of the digesting efficiency of its Cas9, then enter if numeric type 25) step 24) is then entered step if classifying type;
Classifying type data: being in vain 0 for the classifying type data collected from document, it is specified that being effectively 1;
Numeric type data: for the numeric type data of NGS, first by BWA respectively the reads of the sequence of sgRNA and NGS ratio To the mankind with reference on genome, take out include sgRNA reads, and judge cut point whether generate indel and Whether indel is OTF, then counts the OTF rate of each sgRNA, OTF rate=comprising the sgRNA and be OTF reads Sum is divided by total reads number comprising the sgRNA;
24) to the sgRNA sequence information after coding in step 22), feature extraction is carried out with Lasso model, according to normal linearity Foundation personalization sgRNA is returned to design a model;
25) to the sgRNA sequence information after coding in step 22), feature is carried out with the L1 regularization in the recurrence of two sorted logics Selection establishes personalization sgRNA further according to the L2 regularization in the recurrence of two sorted logics and designs a model;
3) quality that designs a model of personalized sgRNA established in step 2) is measured with NDCG algorithm and updates SRA database, Specifically:
31) the NDCG value that the personalized sgRNA established in step 2) designs a model is calculated;
32) judge whether there is corresponding personalization sgRNA model in existing SRA database, if being otherwise added to SRA data Library, if then entering step 33);
33) compare personalization sgRNA model and the sgRNA model in corresponding SRA database, select one that NDCG value is big It is stored in SRA database;
4) it designs sgRNA and provides the assessed value of each sgRNA, specifically:
41) genome area provided according to user is chosen from SRA database and suitably refers to genome, therefrom searches for institute There is the sgRNA for meeting design rule, as the sgRNA of design;
42) it to the sgRNA designed in step 41), is assessed with the personalized sgRNA model established in step 2).
2. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 12) value of the digesting efficiency of corresponding Cas9 is calculated in specifically:
121) in the long comparison to reference genome of reading sgRNA and corresponding two generation being sequenced;
122) it is long to take out the reading comprising sgRNA;
123) judge whether cut point generates the insertion on DNA or the insertion on deletion and DNA or delete whether to be frameshit Mutation;
124) the frameshift mutation rate of each sgRNA is counted, specifically:
125) using the frameshift mutation rate being calculated in step 124) as the value of the digesting efficiency of Cas9.
3. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 21) sequence information of sgRNA includes the spacer of sgRNA sequence, the required mark segment of sgRNA identification DNA and sgRNA in Upstream and downstream base, the bases longs of the upstream and downstream of the spacer of the sgRNA are the value of platform default value or user setting.
4. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 22) binary rules in specifically: corresponding 0001, the N corresponding 0000 of corresponding 0010, the T of corresponding 0100, the G of corresponding 1000, the C of A.
5. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 24) carrying out feature extraction with Lasso model in is to select feature vector by extracting non-zero weight, specifically:
Wherein, w is the weight of estimative feature vector, and x is the feature vector of the sgRNA selected, and n is the quantity of sgRNA, Y is the value of the digesting efficiency of the corresponding Cas9 of sgRNA;α is a constant, | | w | |1It is the matrix of parameter vector;Lasso model By increasing α | | w | |1This least square loss function is solved, by traversing regularization matrix, the feature of non-zero weight is mentioned It takes out.
6. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 31) the NDCG value that the personalized sgRNA established designs a model is calculated in specifically:
Wherein, DCG is the numerical value calculated with prediction sequence, and IDCG is to calculate resulting ideal DCG, rel with true sequenceiIt is The ranking value of i-th position prediction.
7. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 41) design rule in specifically:
20bp+PAM
Wherein, bp is the unit for indicating DNA length, and PAM is that sgRNA identifies that the required mark segment of DNA, "+" indicate DNA length Meet 20bp, simultaneously containing the required mark segment PAM of sgRNA identification DNA.
CN201610341946.3A 2016-05-20 2016-05-20 A kind of design method of the sgRNA based on CRISPR/Cas9 Expired - Fee Related CN106446600B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610341946.3A CN106446600B (en) 2016-05-20 2016-05-20 A kind of design method of the sgRNA based on CRISPR/Cas9

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610341946.3A CN106446600B (en) 2016-05-20 2016-05-20 A kind of design method of the sgRNA based on CRISPR/Cas9

Publications (2)

Publication Number Publication Date
CN106446600A CN106446600A (en) 2017-02-22
CN106446600B true CN106446600B (en) 2019-10-18

Family

ID=58183551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610341946.3A Expired - Fee Related CN106446600B (en) 2016-05-20 2016-05-20 A kind of design method of the sgRNA based on CRISPR/Cas9

Country Status (1)

Country Link
CN (1) CN106446600B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10323236B2 (en) 2011-07-22 2019-06-18 President And Fellows Of Harvard College Evaluation and improvement of nuclease cleavage specificity
US20150044192A1 (en) 2013-08-09 2015-02-12 President And Fellows Of Harvard College Methods for identifying a target site of a cas9 nuclease
US9359599B2 (en) 2013-08-22 2016-06-07 President And Fellows Of Harvard College Engineered transcription activator-like effector (TALE) domains and uses thereof
US9526784B2 (en) 2013-09-06 2016-12-27 President And Fellows Of Harvard College Delivery system for functional nucleases
US9340799B2 (en) 2013-09-06 2016-05-17 President And Fellows Of Harvard College MRNA-sensing switchable gRNAs
US9388430B2 (en) 2013-09-06 2016-07-12 President And Fellows Of Harvard College Cas9-recombinase fusion proteins and uses thereof
US9840699B2 (en) 2013-12-12 2017-12-12 President And Fellows Of Harvard College Methods for nucleic acid editing
EP3177718B1 (en) 2014-07-30 2022-03-16 President and Fellows of Harvard College Cas9 proteins including ligand-dependent inteins
EP3365356B1 (en) 2015-10-23 2023-06-28 President and Fellows of Harvard College Nucleobase editors and uses thereof
GB2568182A (en) 2016-08-03 2019-05-08 Harvard College Adenosine nucleobase editors and uses thereof
AU2017308889B2 (en) 2016-08-09 2023-11-09 President And Fellows Of Harvard College Programmable Cas9-recombinase fusion proteins and uses thereof
US11542509B2 (en) 2016-08-24 2023-01-03 President And Fellows Of Harvard College Incorporation of unnatural amino acids into proteins using base editing
KR102622411B1 (en) 2016-10-14 2024-01-10 프레지던트 앤드 펠로우즈 오브 하바드 칼리지 AAV delivery of nucleobase editor
WO2018119359A1 (en) 2016-12-23 2018-06-28 President And Fellows Of Harvard College Editing of ccr5 receptor gene to protect against hiv infection
US11898179B2 (en) 2017-03-09 2024-02-13 President And Fellows Of Harvard College Suppression of pain by gene editing
WO2018165629A1 (en) 2017-03-10 2018-09-13 President And Fellows Of Harvard College Cytosine to guanine base editor
EP3601562A1 (en) 2017-03-23 2020-02-05 President and Fellows of Harvard College Nucleobase editors comprising nucleic acid programmable dna binding proteins
WO2018209320A1 (en) 2017-05-12 2018-11-15 President And Fellows Of Harvard College Aptazyme-embedded guide rnas for use with crispr-cas9 in genome editing and transcriptional activation
US11732274B2 (en) 2017-07-28 2023-08-22 President And Fellows Of Harvard College Methods and compositions for evolving base editors using phage-assisted continuous evolution (PACE)
EP3676376A2 (en) 2017-08-30 2020-07-08 President and Fellows of Harvard College High efficiency base editors comprising gam
KR20200121782A (en) 2017-10-16 2020-10-26 더 브로드 인스티튜트, 인코퍼레이티드 Uses of adenosine base editor
CN110751982B (en) * 2018-07-04 2023-11-10 广州赛业百沐生物科技有限公司 Intelligent parallelization knockout strategy screening method and system
BR112021018606A2 (en) 2019-03-19 2021-11-23 Harvard College Methods and compositions for editing nucleotide sequences
CN111261223B (en) * 2020-01-12 2022-05-03 湖南大学 CRISPR off-target effect prediction method based on deep learning
DE112021002672T5 (en) 2020-05-08 2023-04-13 President And Fellows Of Harvard College METHODS AND COMPOSITIONS FOR EDIT BOTH STRANDS SIMULTANEOUSLY OF A DOUBLE STRANDED NUCLEOTIDE TARGET SEQUENCE
CN111613267A (en) * 2020-05-21 2020-09-01 中山大学 CRISPR/Cas9 off-target prediction method based on attention mechanism
CN111881324B (en) * 2020-07-30 2023-12-15 苏州工业园区服务外包职业学院 High-throughput sequencing data general storage format structure, construction method and application thereof
CN117252306B (en) * 2023-10-11 2024-02-27 中央民族大学 Gene editing capability index calculation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103805606A (en) * 2014-02-28 2014-05-21 青岛市畜牧兽医研究所 Pair of small guide RNAs (Ribonucleic Acids) (sgRNAs) for specifically identifying sheep DKK1 gene and coded DNA (Deoxyribonucleic Acid) and application of sgRNAs
CN104109687A (en) * 2014-07-14 2014-10-22 四川大学 Construction and application of Zymomonas mobilis CRISPR (clustered regularly interspaced short palindromic repeats)-Cas (CRISPR-association proteins)9 system
CN105255937A (en) * 2015-08-14 2016-01-20 西北农林科技大学 Method for expression of CRISPR sgRNA by eukaryotic cell III-type promoter and use thereof
CN105296518A (en) * 2015-12-01 2016-02-03 中国农业大学 Homologous arm vector construction method used for CRISPR/Cas 9 technology
CN105400779A (en) * 2015-10-15 2016-03-16 芜湖医诺生物技术有限公司 Target sequence, recognized by streptococcus thermophilus CRISPR-Cas9 system, of human CCR5 gene, sgRNA and application of CRISPR-Cas9 system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103805606A (en) * 2014-02-28 2014-05-21 青岛市畜牧兽医研究所 Pair of small guide RNAs (Ribonucleic Acids) (sgRNAs) for specifically identifying sheep DKK1 gene and coded DNA (Deoxyribonucleic Acid) and application of sgRNAs
CN104109687A (en) * 2014-07-14 2014-10-22 四川大学 Construction and application of Zymomonas mobilis CRISPR (clustered regularly interspaced short palindromic repeats)-Cas (CRISPR-association proteins)9 system
CN105255937A (en) * 2015-08-14 2016-01-20 西北农林科技大学 Method for expression of CRISPR sgRNA by eukaryotic cell III-type promoter and use thereof
CN105400779A (en) * 2015-10-15 2016-03-16 芜湖医诺生物技术有限公司 Target sequence, recognized by streptococcus thermophilus CRISPR-Cas9 system, of human CCR5 gene, sgRNA and application of CRISPR-Cas9 system
CN105296518A (en) * 2015-12-01 2016-02-03 中国农业大学 Homologous arm vector construction method used for CRISPR/Cas 9 technology

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CRISPR/Cas9系统中sgRNA设计与脱靶效应评估;谢胜松等;《遗传》;20151130;第37卷(第11期);第1125-1136页 *
CRISPR/CAS系统介导的基因组大片段DNA编辑;王立人;《中国博士学位论文全文数据库基础科学辑》;20151015;第2015年卷(第10期);第A006-18页 *
CRISPR-Cas9系统定向编辑TCR基因的sgRNA筛选;邵红伟等;《集美大学学报(自然科学版)》;20150731;第20卷(第4期);第265-270页 *
CRISPR-P: a web tool for synthetic single-guide RNA design of CRISPR system in plants;Yang Lei et al;《Molecular Plant》;20140930;第7卷(第9期);第1494-1496页 *
In Silico Predictive Modeling of CRISPR/Cas9 guide efficiency;Nicolo Fusi et al;《BioRxiv》;20150626;第1-31页 *
Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9;John G Doench et al;《Nature Biotechnology》;20160118;第34卷;第184-191页 *

Also Published As

Publication number Publication date
CN106446600A (en) 2017-02-22

Similar Documents

Publication Publication Date Title
CN106446600B (en) A kind of design method of the sgRNA based on CRISPR/Cas9
Zhao et al. Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols
CN104504304B (en) A kind of short palindrome repetitive sequence recognition methods of regular intervals of cluster and device
CA2424031C (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
CN108319984B (en) The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level
CN108564117B (en) SVM-based poverty and life assisting identification method
CN106446597B (en) Several species feature selecting and the method for identifying unknown gene
CN107679367B (en) Method and system for identifying co-regulation network function module based on network node association degree
JP2014505935A (en) DNA sequence data analysis method
CN105808976A (en) Recommendation model based miRNA target gene prediction method
Hu et al. Algorithm for discovering low-variance 3-clusters from real-valued datasets
CN114708910B (en) Method for calculating enrichment score of cell subpopulations in cell sequencing by using single cell sequencing data
Williams et al. Plant microRNA prediction by supervised machine learning using C5. 0 decision trees
CN112349346A (en) Method for detecting structural variations in genomic regions
CN114496092A (en) miRNA and disease association relation prediction method based on graph convolution network
CN111462820A (en) Non-coding RNA prediction method based on feature screening and integration algorithm
CN104966106A (en) Biological age step-by-step predication method based on support vector machine
CN109545283B (en) Method for constructing phylogenetic tree based on sequence pattern mining algorithm
CN116525010A (en) Single-cell transcriptome double-source multi-cell filtering method, medium and equipment
CN115249538B (en) Construction method of lncRNA-disease associated prediction model for generating countermeasure network based on heterogeneous graph
CN115394348A (en) IncRNA subcellular localization prediction method, equipment and medium based on graph convolution network
CN108595914A (en) One grows tobacco mitochondrial RNA (mt RNA) editing sites high-precision forecasting method
CN114093420A (en) XGboost-based DNA recombination site prediction method
CN113035279A (en) Parkinson disease evolution key module identification method based on miRNA sequencing data
Mu et al. Investigation on tree molecular genome of Arabidopsis thaliana for internet of things

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191018

CF01 Termination of patent right due to non-payment of annual fee