CN106446600B

CN106446600B - A kind of design method of the sgRNA based on CRISPR/Cas9

Info

Publication number: CN106446600B
Application number: CN201610341946.3A
Authority: CN
Inventors: 刘琦; 啜国晖; 陈亚男; 闫纪芳
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2016-05-20
Filing date: 2016-05-20
Publication date: 2019-10-18
Anticipated expiration: 2036-05-20
Also published as: CN106446600A

Abstract

The present invention relates to the design methods of sgRNA based on CRISPR/Cas9 a kind of, which is characterized in that this method includes the following steps: the value for obtaining the digesting efficiency of sgRNA and corresponding Cas9；Personalization sgRNA is established to design a model；The quality and more new database that the personalized sgRNA established designs a model are measured with NDCG algorithm；Design sgRNA and the assessed value for providing each sgRNA.Compared with prior art, the present invention has the characteristics that accuracy rate is high, feature is complete, it is wide with analysis data to have a wide range of application.

Description

A kind of design method of the sgRNA based on CRISPR/Cas9

Technical field

It is especially a kind of based on CRISPR/Cas9 gene editing technology the present invention relates to gene editing research field The design method of sgRNA.

Background technique

With the development of molecular biology, people have more further understanding for the constitution element of life, but raw The curative mechanism of the mechanism of life process, especially certain diseases is not understood there is also very much.Relationship between gene and phenotype, gene Influencing each other between gene, there is an urgent need to a kind of engineering technology that in vivo quickly can knock out and be inserted into gene. CRISPR/Cas9 system occurs at once, meets this demand of researcher.

CRISPR/Cas9 system (Clustered regularly interspaced short palindromic Repeats/CRISPR-associated protein 9) it is a kind of easy to operate, extensive gene editing tool of applicability. Whole system is mainly made of the RNA (sgRNA) that a nucleic acid cleaving enzymatic (Cas9) and one play guidance recognition reaction.sgRNA It is identified by base pair complementarity and target gene site, then recruits Cas9 and carry out digestion, generate double-strand break, to realize The gene editing of DNA level.Because its applicability is wide, convenient and time-saving, it is applied to various aspects quickly, is especially built in cancer model It is vertical and gene therapy to probe into aspect, there is very big superiority.

However, being found in the continuous exploration of scientist, for the different sgRNA's of same gene design in same cell Digesting efficiency has very big difference, if efficient sgRNA cannot be designed, can only be made up by increasing concentration, in this way will Many gene rubbish can be brought to cell, at the same generate it is a high proportion of miss the target, to the research of scientific research personnel bring it is very big not Just, therefore the sgRNA of one high digesting efficiency of design is extremely important for the research in terms of gene.

Currently, the design software of existing sgRNA has nearly 30 kinds, be broadly divided into two classes: one kind is summarized from experiment Some rules of sgRNA, such as sgRNA sequence one end of pairing must contain PAM sequence, and 5 ' ends should be GG, G/C content It should be maintained at 60% or so, seed sequence can't stand mispairing etc., then directly screened by setting condition,；Another kind of master The specificity of sgRNA, such as CRISPR are calculated by assigning a weight to each base with statistical method Design.What the software of both types was all established is the model of a versatility, however due to different plant species and different cells Between have very big heterogeneity, cause existing software prediction efficiency be not very well, and because different experimental conditions under it is different Matter has a certain impact to the digesting efficiency of sgRNA, and general model evaluation accuracy rate is relatively low.

Accordingly, it is considered to which the heterogeneity between different platform species data, is established a with different platform or the data of species Property model to improve the specificity and high efficiency of sgRNA, for CRISPR/Cas9 system miss the target problem research it is extremely heavy It wants.

Summary of the invention

The purpose of the present invention is provide regarding to the issue above a kind of accuracy rate is high, have a wide range of application based on CRISPR/ The design method of the sgRNA of Cas9.

The purpose to realize the present invention, the present invention provide the design method of sgRNA based on CRISPR/Cas9 a kind of, This method includes the following steps:

1) value of the digesting efficiency of sgRNA and corresponding Cas9 is obtained, specifically:

11) value of the digesting efficiency of sgRNA and corresponding Cas9 is obtained from document；

12) sgRNA is obtained from SRA database, calculates the value for obtaining the digesting efficiency of corresponding Cas9；

13) according to species, cell type and experiment condition by step 11) and 12) in the data classification that gets at difference Reference genome, it is each with reference to list in genome portion first be classified as sgRNA title, second be classified as sgRNA sequence with And third is classified as the table of the digesting efficiency of corresponding Cas9；

2) personalization sgRNA is established to design a model, specifically:

21) according to demand from accordingly with reference in genome, extraction step 1) in the sequence information of sgRNA that obtains；

22) binary coding is carried out according to binary rules to the sgRNA sequence information extracted in step 21)；

23) to the sgRNA obtained in step 21), judge the data type of the digesting efficiency of its Cas9, then if numeric type It enters step 24), is then entered step 25) if classifying type；

24) to the sgRNA sequence information after coding in step 22), feature extraction is carried out with Lasso model, according to standard Linear regression establishes personalization sgRNA and designs a model；

25) it to the sgRNA sequence information after coding in step 22), is carried out with the L1 regularization in the recurrence of two sorted logics Feature selecting establishes personalization sgRNA further according to the L2 regularization in the recurrence of two sorted logics and designs a model；

3) quality that designs a model of personalized sgRNA established in step 2) is measured with NDCG algorithm and updates SRA number According to library, specifically:

31) the NDCG value that the personalized sgRNA established in step 2) designs a model is calculated；

32) judge whether there is corresponding personalization sgRNA model in existing SRA database, if being otherwise added to SRA Database, if then entering step 33)；

33) compare personalization sgRNA model and the sgRNA model in corresponding SRA database, select NDCG value big One is stored in SRA database；

4) it designs sgRNA and provides the assessed value of each sgRNA, specifically:

41) genome area provided according to user is chosen from SRA database and suitably refers to genome, therefrom searches Suo Suoyou meets the sgRNA of design rule, as the sgRNA of design；

42) it to the sgRNA designed in step 41), is assessed with the personalized sgRNA model established in step 2).

Preferably, the value of the digesting efficiency of corresponding Cas9 is calculated in the step 12) specifically:

121) in the long comparison to reference genome of reading sgRNA and corresponding two generation being sequenced；

122) it is long to take out the reading comprising sgRNA；

123) judge cut point whether generate the insertion on DNA or the insertion on deletion and DNA or delete whether be Frameshift mutation；

124) the frameshift mutation rate of each sgRNA is counted, specifically:

125) using the frameshift mutation rate being calculated in step 124) as the value of the digesting efficiency of Cas9.

Preferably, the sequence information of sgRNA includes sgRNA sequence, the required mark of sgRNA identification DNA in the step 21) The base of the upstream and downstream of the spacer of master chip section and sgRNA, the bases longs of the upstream and downstream of the spacer of the sgRNA are flat The value of platform default value or user setting.

Preferably, the binary rules in the step 22) specifically: corresponding 0100, the G corresponding 0010 of corresponding 1000, the C of A, Corresponding 0001, the N corresponding 0000 of T.

Preferably, carrying out feature extraction with Lasso model in the step 24) is to select spy by extracting non-zero weight Vector is levied, specifically:

Wherein, w is the weight of estimative feature vector, and x is the feature vector of the sgRNA selected, and n is sgRNA Quantity, y are the values of the digesting efficiency of the corresponding Cas9 of sgRNA；α is a constant, | | w | |₁It is the matrix of parameter vector； Lasso model is by increasing α | | w | |₁It solves this least square loss function, passes through traversal regularization matrix, non-zero weight Feature be extracted.

Preferably, the L1 regularization in the step 25) specifically:

Wherein, w and c is the weight and intercept of estimative feature, and X is the binary matrix of the sgRNA of coding, and n is The quantity of sgRNA, y are the values of the digesting efficiency of the corresponding Cas9 of sgRNA.

Preferably, the L2 regularization specifically:

Preferably, the NDCG value that the personalized sgRNA established designs a model is calculated in the step 31) specifically:

Wherein, DCG is the numerical value calculated with prediction sequence, and IDCG is to calculate resulting ideal DCG with true sequence, rel_iIt is the ranking value of the i-th position prediction.

Preferably, design rule in the step 41) specifically:

20bp+PAM

Wherein, bp is the unit for indicating DNA length, and PAM is that sgRNA identifies that the required mark segment of DNA, "+" indicate DNA Length meets 20bp, simultaneously containing the required mark segment PAM of sgRNA identification DNA.

Compared with prior art, the invention has the following advantages:

(1) it is directed to different plant species different type cell, has used personalized strategy, and with the machine learning of data-driven Algorithm is modeled, and assessment accuracy rate has significant improvement.

(2) it is not limited only between PAM and spacer using new coding rule so that the feature found is more complete.

(3) process that user oneself constructs model is imparted, so that application range is wider, is not limited only in database only Some species.

(4) it uses the OTF rate of NGS data as digestion rate, expands the range that can analyze data；

(5) data that user can upload oneself come expanding data library, accelerate the accumulation of data, advantageously account for now Because data volume deficiency leads to the predicament that cannot design optimal sgRNA very well.

Detailed description of the invention

Fig. 1 is the method flow diagram for establishing personalization sgRNA model and model evaluation；

Fig. 2 is the method flow diagram of design and assessment sgRNA.

Specific embodiment

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to Following embodiments.

Abbreviation explanation:

CRISPR:Clustered regularly interspaced short palindromic repeats

The small palindrome repetitive sequence at the interval of the rule of cluster

Cas9: the relevant enzyme with CRISPR II type system

NGS:Next Generation Sequencing, the sequencing of two generations

PAM:Protospacer-adjacent motif, sgRNA identify that DNA is required and indicate segment

The RNA of guiding function is played in sgRNA:CRISPR/Cas9 system

Insertion, deletion on DNA caused by indel:CRISPR/Cas9 is edited

20 or so bases of base pair complementarity are played in spacer:sgRNA

OTF:out of frame, frameshift mutation.

Read: reading length, is the sequencing sequence that a reaction obtains in high-flux sequence.

The present embodiment provides the design methods of sgRNA based on CRISPR/Cas9 a kind of, for different plant species different type Cell establishes the process that oneself personalization sgRNA designs a model, and model can be established according to different demands and designs sgRNA, tool Body includes following four step:

(1) data collection: the receipt being collected into from document is generally two classes: sgRNA and corresponding digesting efficiency number Value type or sgRNA and corresponding digesting efficiency classifying type (such as effective or invalid two classification)；It is downloaded from SRA database NGS then only have numeric type a kind of.Because NGS data pass through the numeric type one collected in process and document after counting OTF rate It causes, therefore the present embodiment is only illustrated two kinds of data of document classification type and NGS.

Classifying type data: for the classifying type data collected from document, it is in vain 0 that the present embodiment regulation, which is effectively 1, It is organized into the format such as table 1.

Table 1

sgID	Sequence	Score
			sgRNA_1	CGCAACCTGCTCAGCGCCTACGG	1
sgRNA_2	CAGTCTACATAACACGCCCATGG	1
			sgRNA_3	CGCAACCTGCTCAGCGCCTACGG	1
……	……	……
			sgRNA_1_1	GGCAACCGTGGCGGCAATCGAGG	0
sgRNA_2_2	CTTCTCGGAATTCGGTGAAGGTGG	0
			sgRNA_3_3	AACCTCCCGGCTTCTCGGAATTCGG	0
……	……	……

Numeric type data: for the numeric type data of NGS, first by BWA respectively the sequence of sgRNA and NGS Reads is compared to the mankind with reference on genome, takes out the reads comprising sgRNA, and judge whether generate indel in cut point And whether indel is OTF, then counts OTF rate (the OTF rate=include the sgRNA and be OTF of each sgRNA The sum of reads is divided by total reads number comprising the sgRNA).It finally arranges as the format such as table 2.

Table 2

sgID	Sequence	Score
			sgRNA_1	CGCAACCTGCTCAGCGCCTACGG	0.2345
sgRNA_2	CAGTCTACATAACACGCCCATGG	0.7846
			sgRNA_3	CGCAACCTGCTCAGCGCCTACGG	0.2367
……	……	……

(2) model is established: as shown in Figure 1, from the corresponding sequence information for extracting the sgRNA being collected into reference to genome. Assuming that setting upstream and downstream sequence is respectively 35 and 32 bases, then the sequence taken out is 90 (35+20+3+32) a bases.CACC TGGTATGTTCGTATCGGGCAGAATATCGCAACCTGCTCAGCGCCTACGGTCCATCTCGCTCAGGTACGACTGACCG ACCCAGTCTA。

Binary coding is carried out to the sgRNA information of extraction, rule is as shown in table 3.

Table 3

Then 90 bases of the above taking-up may be encoded as:

Feature is extracted with machine learning method, personalization sgRNA is established and designs a model.

For classifying type data, feature is selected with logistic regression and establishes prediction model.The recurrence of two sorted logics has two A optional regularization, the present invention carry out feature selecting with L1 regularization, and model is established in L2 regularization.

The optimization problem of the following sparse features selection of L1 regularization logistic regression solution:

Wherein, w and c is the weight and intercept of estimative feature, and X is the character representation of training sample, and n is training sample Quantity, y is the corresponding digesting efficiency value of sgRNA.

Cost function is minimized with L2 punishment logistic regression solution:

For numeric type data, make feature selecting of Lasso model, standard linear regression establishes prediction model. Lasso is the linear model for estimating sparse related coefficient, mainly selects feature vector by extracting non-zero weight.Minimize mesh Scalar functions are as follows:

Wherein, w is the weight of estimative feature vector, and x is the feature vector of the sgRNA selected, and n is training sample Quantity, y is the corresponding digesting efficiency value of sgRNA；α is a constant, | | w | |₁It is the matrix of parameter vector；Lasso mould Type is by increasing α | | w | |₁This least square loss function is solved, traversal regularization matrix, the feature quilt of non-zero weight are passed through It extracts, these features are considered being important the element for influencing sgRNA digesting efficiency.

After choosing these features, an assessment models then are established with a standard linear regression.

The modeling result of numeric type and classifying type all generates two files: one is xml document, and content includes selectable The result of feature and cross validation；Another file is pkl file, and content is the prediction model established, binary file.

Xml document content is as follows:

<ups_33_C/>

<ups_30_G/>

<ups_22_G/>

<ups_21_G/>

<ups_20_A/>

<ups_14_G/>

<ups_13_C/>

<ups_9_G/>

<ups_8_A/>

<ups_5_G/>

<ups_1_A/>

<spa_1_C/>

<spa_2_C/>

<spa_4_A/>

<spa_5_T/>

<spa_6_C/>

<spa_8_A/>

<spa_9_C/>

<spa_9_T/>

<spa_10_C/>

<spa_15_T/>

<spa_17_A/>

<spa_19_C/>

<pam_1_G/>

<pam_2_G/>

<dws_2_A/>

<dws_6_G/>

<dws_12_G/>

<dws_13_C/>

<dws_14_C/>

<dws_15_A/>

<dws_24_A/>

<dws_24_G/>

<dws_26_G/>

<dws_26_T/>

<dws_28_A/>

</features>

<cross_validation fold=" 5 ">

<pearson_cor value="0.862"/>

<r2value=" 0.683 "/>

</metric>

</cross_validation>

</report>

(3) quality of prediction model, NDCG (Normalized Discounted assessment models: are measured using NDCG algorithm Storage gain is lost in Cumulative Gain, normalization) it is the efficiency for being mainly used to measure an order models, its value generation Table prediction ranking results and actual sequence between similitude, range between zero and one, 1 indicate completely the same, numerical value It is better to represent this model more greatly.Specific formula is as follows:

DCG (Discounted Cumulative Gain, lose storage gain) is the numerical value calculated with prediction sequence, IDCG (ideal DCG), is ideal DCG, calculates gained with true sequence.The mathematical definition of DCG is as follows:

Wherein, rel_iIt is the ranking value of the i-th position prediction.

As shown in the table, sgID is the title of sgRNA, and seq is the spacer sequence of sgRNA, and Benchmark Score is Benchmark score, BS_rank are the sequence of Benchmark Score, and Cage is the score of prediction model of the present invention assessment, C_rank It is as shown in table 4 for the sequence of Cage.

Table 4

sgID	seq	Benchmark Score	BS_rank	Cage Score	C_rank
						sg1000	GCAGGTACCCTGCAACGTCGCGG	0.789456865	1	0.6905	1
sg1001	CTCCACTAGTCCCCGCGCCGCGG	0.506422166	2	0.6026	2
						sg1	GTAATGGCTTCCTCGTGAGTTGG	0.325738326	3	0.5548	3
sg1002	GACTCCGTTGGGATCCGCGCCGG	0.092078991	4	0.5095	4
						sg10	ATCTTAAGCAAACGCTTACCAGG	0.072255575	5	0.4959	5
sg1003	CCCGAAACGGTTGACTCCGTTGG	0.037552375	6	0.4473	6
						sg1004	AGGCGCGCGATCCAGGTAGCTGG	0.019922477	7	0.3281	8
sg100	AAAAAGCTGATGAAGTTGTTTGG	0.017296539	8	0.3357	7
						sg1005	CGGGGCCACCGCGACGTTGCAGG	0.002206787	9	0.3056	9
……	……	……	……	……	……

TOP50 NDCG=0.876322904

TOP 10%NDCG=0.84340749

If there is no this model in database, updates and arrive database, otherwise calculate two groups of NDCG value and be compared, if New model is bigger than the NDCG value of existing model, then database may be updated.

(4) design and assess: as shown in Fig. 2, for user designed sgRNA assess or for user to Genome area (such as chromosome 1,1,000,000to 1,002,000, hg19) out, carries out the design of sgRNA, first The species or cell type of the first determination sgRNA to be assessed, then select suitable model to be assessed, if without suitable Model, similar model may be selected, present embodiments provide be related to 3 species, 8 kinds of cells 10 models it is for selection It uses.As a result output is as shown in table 5.

Table 5

So far, user can choose the research for being suitble to the sgRNA of oneself demand to carry out next step.

Claims

1. a kind of design method of the sgRNA based on CRISPR/Cas9, which is characterized in that this method includes the following steps:

13) according to species, cell type and experiment condition by step 11) and 12) in the data classification that gets at different ginsengs Examine genome, it is each to be classified as sgRNA title, second be classified as sgRNA sequence and the with reference to listing portion first in genome Three are classified as the table of the digesting efficiency of corresponding Cas9；

2) personalization sgRNA is established to design a model, specifically:

23) to the sgRNA obtained in step 21), judge the data type of the digesting efficiency of its Cas9, then enter if numeric type 25) step 24) is then entered step if classifying type；

Classifying type data: being in vain 0 for the classifying type data collected from document, it is specified that being effectively 1；

Numeric type data: for the numeric type data of NGS, first by BWA respectively the reads of the sequence of sgRNA and NGS ratio To the mankind with reference on genome, take out include sgRNA reads, and judge cut point whether generate indel and Whether indel is OTF, then counts the OTF rate of each sgRNA, OTF rate=comprising the sgRNA and be OTF reads Sum is divided by total reads number comprising the sgRNA；

24) to the sgRNA sequence information after coding in step 22), feature extraction is carried out with Lasso model, according to normal linearity Foundation personalization sgRNA is returned to design a model；

25) to the sgRNA sequence information after coding in step 22), feature is carried out with the L1 regularization in the recurrence of two sorted logics Selection establishes personalization sgRNA further according to the L2 regularization in the recurrence of two sorted logics and designs a model；

3) quality that designs a model of personalized sgRNA established in step 2) is measured with NDCG algorithm and updates SRA database, Specifically:

32) judge whether there is corresponding personalization sgRNA model in existing SRA database, if being otherwise added to SRA data Library, if then entering step 33)；

33) compare personalization sgRNA model and the sgRNA model in corresponding SRA database, select one that NDCG value is big It is stored in SRA database；

41) genome area provided according to user is chosen from SRA database and suitably refers to genome, therefrom searches for institute There is the sgRNA for meeting design rule, as the sgRNA of design；

2. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 12) value of the digesting efficiency of corresponding Cas9 is calculated in specifically:

122) it is long to take out the reading comprising sgRNA；

123) judge whether cut point generates the insertion on DNA or the insertion on deletion and DNA or delete whether to be frameshit Mutation；

124) the frameshift mutation rate of each sgRNA is counted, specifically:

3. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 21) sequence information of sgRNA includes the spacer of sgRNA sequence, the required mark segment of sgRNA identification DNA and sgRNA in Upstream and downstream base, the bases longs of the upstream and downstream of the spacer of the sgRNA are the value of platform default value or user setting.

4. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 22) binary rules in specifically: corresponding 0001, the N corresponding 0000 of corresponding 0010, the T of corresponding 0100, the G of corresponding 1000, the C of A.

5. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 24) carrying out feature extraction with Lasso model in is to select feature vector by extracting non-zero weight, specifically:

Wherein, w is the weight of estimative feature vector, and x is the feature vector of the sgRNA selected, and n is the quantity of sgRNA, Y is the value of the digesting efficiency of the corresponding Cas9 of sgRNA；α is a constant, | | w | |₁It is the matrix of parameter vector；Lasso model By increasing α | | w | |₁This least square loss function is solved, by traversing regularization matrix, the feature of non-zero weight is mentioned It takes out.

6. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 31) the NDCG value that the personalized sgRNA established designs a model is calculated in specifically:

Wherein, DCG is the numerical value calculated with prediction sequence, and IDCG is to calculate resulting ideal DCG, rel with true sequence_iIt is The ranking value of i-th position prediction.

7. the design method of the sgRNA according to claim 1 based on CRISPR/Cas9, which is characterized in that the step 41) design rule in specifically:

20bp+PAM

Wherein, bp is the unit for indicating DNA length, and PAM is that sgRNA identifies that the required mark segment of DNA, "+" indicate DNA length Meet 20bp, simultaneously containing the required mark segment PAM of sgRNA identification DNA.