CN106446600A

CN106446600A - CRISPR/Cas9-based sgRNA design method

Info

Publication number: CN106446600A
Application number: CN201610341946.3A
Authority: CN
Inventors: 刘琦; 啜国晖; 陈亚男; 闫纪芳
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2016-05-20
Filing date: 2016-05-20
Publication date: 2017-02-22
Anticipated expiration: 2036-05-20
Also published as: CN106446600B

Abstract

The present invention relates to a method for designing sgRNA based on CRISPR/Cas9, which is characterized in that the method comprises the following steps: obtaining the value of the cleavage efficiency of sgRNA and corresponding Cas9; establishing a personalized sgRNA design model; using NDCG algorithm to measure and establish The quality of the personalized sgRNA design model and update the database; design sgRNA and give the evaluation value of each sgRNA. Compared with the prior art, the invention has the characteristics of high accuracy, complete features, wide application range and wide analysis data.

Description

A design method of sgRNA based on CRISPR/Cas9

技术领域technical field

本发明涉及基因编辑研究领域，尤其是一种基于CRISPR/Cas9基因编辑技术的sgRNA的设计方法。The invention relates to the field of gene editing research, in particular to a method for designing sgRNA based on CRISPR/Cas9 gene editing technology.

背景技术Background technique

随着分子生物学的发展，人们对于生命的构成元素有了更深一层的理解，但是生命过程的机制，尤其是某些疾病的治病机理还存在很多不解。基因与表型之间的关系，基因与基因之间的相互影响，迫切需要一种能在活体内快速敲除和插入基因的工程技术。CRISPR/Cas9系统应时出现，满足了科研工作者的这个需求。With the development of molecular biology, people have a deeper understanding of the constituent elements of life, but there are still many puzzles about the mechanism of life processes, especially the healing mechanism of certain diseases. The relationship between genes and phenotypes, and the mutual influence between genes, urgently require an engineering technology that can quickly knock out and insert genes in vivo. The CRISPR/Cas9 system appeared in time to meet the needs of scientific researchers.

CRISPR/Cas9系统(Clustered regularly interspaced short palindromicrepeats/CRISPR-associated protein 9)是一种操作简单，适用性广泛的基因编辑工具。整个系统主要由一个核酸切割酶(Cas9)和一个起引导识别作用的RNA(sgRNA)组成。sgRNA通过碱基互补配对与靶基因位点识别，然后招募Cas9进行酶切，产生双链断裂，从而实现在DNA水平的基因编辑。因为其适用性广，方便省时，很快应用于各个方面，尤其在癌症模型建立和基因治疗的探究方面，有着很大的优越性。The CRISPR/Cas9 system (Clustered regularly interspaced short palindromic repeats/CRISPR-associated protein 9) is a gene editing tool with simple operation and wide applicability. The whole system is mainly composed of a nuclease (Cas9) and an RNA (sgRNA) that guides recognition. sgRNA recognizes the target gene site through complementary base pairing, and then recruits Cas9 to perform enzyme cleavage and generate double-strand breaks, thereby realizing gene editing at the DNA level. Because of its wide applicability, convenience and time-saving, it can be quickly applied to various aspects, especially in the establishment of cancer models and the exploration of gene therapy, which has great advantages.

然而，在科学家的不断探索中发现，同一细胞中针对同一基因设计的不同sgRNA的酶切效率有很大的差异，如果不能设计高效率的sgRNA，只能通过增加浓度来弥补，这样将会给细胞带来很多的基因垃圾，同时产生高比例的脱靶，给科研人员的研究带来很大的不便，因此设计一个高酶切效率的sgRNA对于基因方面的研究非常重要。However, in the continuous exploration of scientists, it has been found that the cleavage efficiency of different sgRNAs designed for the same gene in the same cell is very different. If a high-efficiency sgRNA cannot be designed, it can only be compensated by increasing the concentration, which will give Cells bring a lot of gene garbage, and at the same time produce a high proportion of off-targets, which brings great inconvenience to researchers. Therefore, designing a sgRNA with high enzyme cutting efficiency is very important for gene research.

目前，已有的sgRNA的设计软件有近30种，主要分为两类：一类是从实验中总结sgRNA的一些规则，例如配对的sgRNA序列一端必需含有PAM序列，5’末端应该为GG，GC含量应该保持在60％左右，种子序列不能容忍错配等，然后通过设置条件直接筛选，；另一类主要通过运用统计学方法给每个碱基赋予一个权重来计算sgRNA的特异性，如CRISPRDesign。这两种类型的软件都建立的是一个通用性的模型，然而由于不同物种和不同细胞之间有很大的异质性，导致现存软件的预测效能并不是很好，且因为不同实验条件下的异质性对sgRNA的酶切效率有一定的影响，通用的模型评估准确率比较低。At present, there are nearly 30 kinds of sgRNA design software, which are mainly divided into two categories: one is to summarize some rules of sgRNA from experiments, for example, one end of the paired sgRNA sequence must contain a PAM sequence, and the 5' end should be GG, The GC content should be kept at about 60%, the seed sequence cannot tolerate mismatches, etc., and then directly screened by setting conditions; the other type mainly uses statistical methods to assign a weight to each base to calculate the specificity of sgRNA, such as CRISPR Design. Both types of software build a general model. However, due to the large heterogeneity between different species and different cells, the prediction performance of the existing software is not very good, and because different experimental conditions The heterogeneity of sgRNA has a certain impact on the digestion efficiency of sgRNA, and the general model evaluation accuracy is relatively low.

因此，考虑不同平台物种数据之间的异质性，用不同平台或者物种的数据建立个性化的模型以提高sgRNA的特异性和高效性，对于CRISPR/Cas9系统脱靶问题的研究极为重要。Therefore, considering the heterogeneity of species data from different platforms, it is extremely important for the study of the off-target problem of CRISPR/Cas9 system to establish a personalized model with data from different platforms or species to improve the specificity and efficiency of sgRNA.

发明内容Contents of the invention

本发明的目的是针对上述问题提供一种准确率高、应用范围广的基于CRISPR/Cas9的sgRNA的设计方法。The purpose of the present invention is to provide a CRISPR/Cas9-based sgRNA design method with high accuracy and wide application range for the above problems.

为实现本发明所述目的，本发明提供一种基于CRISPR/Cas9的sgRNA的设计方法，该方法包括下列步骤：In order to realize the purpose of the present invention, the present invention provides a method for designing sgRNA based on CRISPR/Cas9, which method comprises the following steps:

1)获取sgRNA和对应的Cas9的酶切效率的值，具体为：1) Obtain the value of the cleavage efficiency of the sgRNA and the corresponding Cas9, specifically:

11)从文献中获取sgRNA以及对应的Cas9的酶切效率的值；11) Obtain the value of the cleavage efficiency of sgRNA and corresponding Cas9 from the literature;

12)从SRA数据库中获取sgRNA，计算获取对应的Cas9的酶切效率的值；12) Obtain the sgRNA from the SRA database, and calculate and obtain the value of the enzyme cleavage efficiency of the corresponding Cas9;

13)按照物种、细胞类型和实验条件将步骤11)和12)中获取到的数据分类成不同的参考基因组，每个参考基因组中都列出一份第一列为sgRNA名称、第二列为sgRNA序列以及第三列为对应的Cas9的酶切效率的表格；13) Classify the data obtained in steps 11) and 12) into different reference genomes according to species, cell type and experimental conditions, and each reference genome lists a copy of the first column as the sgRNA name and the second column as The sgRNA sequence and the third column are tables of the corresponding Cas9 digestion efficiency;

2)建立个性化sgRNA设计模型，具体为：2) Establish a personalized sgRNA design model, specifically:

21)根据需求从相应的参考基因组中，提取步骤1)中获取的sgRNA的序列信息；21) Extract the sequence information of the sgRNA obtained in step 1) from the corresponding reference genome as required;

22)对步骤21)中提取的sgRNA序列信息按照二进制规则进行二进制编码；22) carry out binary coding to the sgRNA sequence information extracted in step 21) according to binary rules;

23)对步骤21)中获取的sgRNA，判断其Cas9的酶切效率的数据类型，若为数值型则进入步骤24)，若为分类型则进入步骤25)；23) For the sgRNA obtained in step 21), judge the data type of the enzyme cutting efficiency of its Cas9, if it is a numerical type, then enter step 24), if it is a classification type, then enter step 25);

24)对步骤22)中编码后的sgRNA序列信息，用Lasso模型进行特征提取，根据标准线性回归建立个性化sgRNA设计模型；24) For the sgRNA sequence information encoded in step 22), use the Lasso model to perform feature extraction, and establish a personalized sgRNA design model according to standard linear regression;

25)对步骤22)中编码后的sgRNA序列信息，用二分类逻辑回归中的L1正则化进行特征选择，再根据二分类逻辑回归中的L2正则化建立个性化sgRNA设计模型；25) For the encoded sgRNA sequence information in step 22), perform feature selection with L1 regularization in binary logistic regression, and then establish a personalized sgRNA design model according to L2 regularization in binary logistic regression;

3)运用NDCG算法衡量步骤2)中建立的个性化sgRNA设计模型的质量并更新SRA数据库，具体为：3) Use the NDCG algorithm to measure the quality of the personalized sgRNA design model established in step 2) and update the SRA database, specifically:

31)计算步骤2)中建立的个性化sgRNA设计模型的NDCG值；31) Calculate the NDCG value of the personalized sgRNA design model established in step 2);

32)判断现有SRA数据库中是否有对应的个性化sgRNA模型，若否则将其添加进SRA数据库，若是则进入步骤33)；32) Judging whether there is a corresponding personalized sgRNA model in the existing SRA database, if otherwise it is added to the SRA database, and if so, enter step 33);

33)比较该个性化sgRNA模型与对应的SRA数据库中的sgRNA模型，选择NDCG值大的一个存储在SRA数据库中；33) compare the personalized sgRNA model with the sgRNA model in the corresponding SRA database, and select the one with a large NDCG value to be stored in the SRA database;

4)设计sgRNA并给出每个sgRNA的评估值，具体为：4) Design sgRNA and give the evaluation value of each sgRNA, specifically:

41)根据用户给出的基因组区域，从SRA数据库中选取合适的参考基因组，从中搜索所有符合设计规则的sgRNA，将其作为设计的sgRNA；41) According to the genome region given by the user, select a suitable reference genome from the SRA database, search for all sgRNAs that meet the design rules, and use them as the designed sgRNA;

42)对步骤41)中设计的sgRNA，运用步骤2)中建立的个性化sgRNA模型进行评估。42) For the sgRNA designed in step 41), use the personalized sgRNA model established in step 2) to evaluate.

优选地，所述步骤12)中计算得到对应的Cas9的酶切效率的值具体为：Preferably, the value of the enzyme cleavage efficiency of the corresponding Cas9 calculated in the step 12) is specifically:

121)把sgRNA和相对应的二代测序的读长比对到参考基因组上；121) Aligning the read length of sgRNA and corresponding next-generation sequencing to the reference genome;

122)取出包含sgRNA的读长；122) Take out the read length comprising sgRNA;

123)判断在切割点是否产生DNA上的插入或删除以及DNA上的插入或删除是否为移码突变；123) Judging whether the insertion or deletion on the DNA occurs at the cutting point and whether the insertion or deletion on the DNA is a frameshift mutation;

124)统计每个sgRNA的移码突变率，具体为：124) Count the frameshift mutation rate of each sgRNA, specifically:

125)将步骤124)中计算得到的移码突变率作为Cas9的酶切效率的值。125) Use the frameshift mutation rate calculated in step 124) as the value of the enzyme cutting efficiency of Cas9.

优选地，所述步骤21)中sgRNA的序列信息包括sgRNA序列、sgRNA识别DNA必需的标志片段以及sgRNA的spacer的上下游的碱基，所述sgRNA的spacer的上下游的碱基长度为平台默认值或用户设置的值。Preferably, the sequence information of the sgRNA in the step 21) includes the sgRNA sequence, the necessary marker fragment for sgRNA recognition of DNA, and the upstream and downstream bases of the spacer of the sgRNA, and the base length of the upstream and downstream of the spacer of the sgRNA is the platform default value or a value set by the user.

优选地，所述步骤22)中的二进制规则具体为：A对应1000，C对应0100，G对应0010，T对应0001，N对应0000。Preferably, the binary rules in step 22) are specifically: A corresponds to 1000, C corresponds to 0100, G corresponds to 0010, T corresponds to 0001, and N corresponds to 0000.

优选地，所述步骤24)中用Lasso模型进行特征提取是通过提取非零权重来选择特征向量，具体为：Preferably, performing feature extraction with the Lasso model in the step 24) is to select the feature vector by extracting non-zero weights, specifically:

其中，w是被估计的特征向量的权重，x是被选择的sgRNA的特征向量，n是sgRNA的数量，y是sgRNA对应的Cas9的酶切效率的值；α是一个常数，||w||₁是参数向量的矩阵；Lasso模型通过增加α||w||₁来解这个最小二乘损失函数，通过遍历正则化矩阵，非零权重的特征被提取出来。Among them, w is the weight of the estimated feature vector, x is the feature vector of the selected sgRNA, n is the number of sgRNA, y is the value of the enzyme cutting efficiency of Cas9 corresponding to the sgRNA; α is a constant, ||w| | ₁ is a matrix of parameter vectors; the Lasso model solves this least-squares loss function by adding α||w|| ₁ , and features with non-zero weights are extracted by traversing the regularization matrix.

优选地，所述步骤25)中的L1正则化具体为：Preferably, the L1 regularization in the step 25) is specifically:

其中，w和c是被估计的特征的权重和截距，X是编码的sgRNA的二进制矩阵，n是sgRNA的数量，y是sgRNA对应的Cas9的酶切效率的值。Among them, w and c are the weights and intercepts of the estimated features, X is the binary matrix of encoded sgRNAs, n is the number of sgRNAs, and y is the value of the enzyme cleavage efficiency of Cas9 corresponding to sgRNAs.

优选地，所述L2正则化具体为：Preferably, the L2 regularization is specifically:

优选地，所述步骤31)中计算建立的个性化sgRNA设计模型的NDCG值具体为：Preferably, the NDCG value of the personalized sgRNA design model calculated and established in the step 31) is specifically:

其中，DCG是用预测排序计算的数值，IDCG是用真实排序计算所得的理想的DCG，rel_i是第i位置预测的排序值。Among them, DCG is the numerical value calculated by predicted sorting, IDCG is the ideal DCG calculated by real sorting, and rel _i is the predicted sorting value of the i-th position.

优选地，所述步骤41)中设计规则具体为：Preferably, the design rules in the step 41) are specifically:

20bp+PAM20bp+PAM

其中，bp为表示DNA长度的单位，PAM为sgRNA识别DNA必需的标志片段。Among them, bp is the unit indicating the length of DNA, and PAM is the necessary marker fragment for sgRNA to recognize DNA.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

(1)针对不同物种不同类型细胞，使用了个性化的策略，并用数据驱动的机器学习算法进行建模，评估准确率有很大程度的提高。(1) For different types of cells of different species, personalized strategies are used, and data-driven machine learning algorithms are used for modeling, and the evaluation accuracy is greatly improved.

(2)使用新的编码规则，使得找到的特征更加完整，不仅限于PAM和spacer之间。(2) Use new coding rules to make the found features more complete, not only between PAM and spacer.

(3)赋予了用户自己构建模型的流程，使得应用范围更广，不仅限于数据库中仅有的一些物种。(3) The user is given the process of constructing the model by himself, which makes the application scope wider, not limited to only some species in the database.

(4)使用NGS数据的OTF率作为酶切率，扩大了可分析数据的范围；(4) Using the OTF rate of NGS data as the enzyme digestion rate expands the range of data that can be analyzed;

(5)用户可以上传自己的数据来扩充数据库，加速了数据的积累，有利于解决现在因数据量不足导致不能很好设计最优sgRNA的困境。(5) Users can upload their own data to expand the database, which accelerates the accumulation of data and helps to solve the current dilemma of not being able to design the optimal sgRNA due to insufficient data.

附图说明Description of drawings

图1为建立个性化sgRNA模型与模型评估的方法流程图；Fig. 1 is the method flowchart of establishing individualized sgRNA model and model assessment;

图2为设计和评估sgRNA的方法流程图。Figure 2 is a flowchart of the method for designing and evaluating sgRNAs.

具体实施方式detailed description

下面结合附图和具体实施例对本发明进行详细说明。本实施例以本发明技术方案为前提进行实施，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. This embodiment is carried out on the premise of the technical solution of the present invention, and detailed implementation and specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.

缩写词说明：Explanation of acronyms:

CRISPR：Clustered regularly interspaced short palindromic repeatsCRISPR: Clustered regularly interspaced short palindromic repeats

成簇的规律的间隔的小回文重复序列clustered regularly interspaced small palindromic repeats

Cas9：跟CRISPR II型系统相关的酶Cas9: an enzyme related to the CRISPR type II system

NGS：Next Generation Sequencing，二代测序NGS: Next Generation Sequencing, next generation sequencing

PAM：Protospacer-adjacent motif，sgRNA识别DNA必需的标志片段PAM: Protospacer-adjacent motif, a marker fragment necessary for sgRNA to recognize DNA

sgRNA：CRISPR/Cas9系统中起引导作用的RNAsgRNA: RNA that acts as a guide in the CRISPR/Cas9 system

indel：CRISPR/Cas9编辑引起的DNA上的插入、删除indel: insertions and deletions on DNA caused by CRISPR/Cas9 editing

spacer：sgRNA中起碱基互补配对的20个左右的碱基spacer: About 20 bases in the sgRNA that start complementary base pairing

OTF：out of frame，移码突变。OTF: out of frame, frame shift mutation.

Read：读长，是高通量测序中一个反应获得的测序序列。Read: read length, which is the sequencing sequence obtained in one reaction in high-throughput sequencing.

本实施例提供一种基于CRISPR/Cas9的sgRNA的设计方法，针对不同物种不同类型细胞建立自己个性化sgRNA设计模型的流程，可以根据不同需求建立模型并设计sgRNA，具体包括下列四个步骤：This embodiment provides a method for designing sgRNA based on CRISPR/Cas9. The process of establishing a personalized sgRNA design model for different types of cells of different species can be established according to different requirements and design sgRNA, specifically including the following four steps:

(1)数据收集：从文献中收集到的收据一般为两类：sgRNA与相对应的酶切效率数值型或者sgRNA与相对应的酶切效率分类型(如有效或者无效二分类)；从SRA数据库中下载的NGS则只有数值型一种。因为NGS数据通过统计OTF率后的流程与文献中收集的数值型一致，故本实施例只对文献分类型和NGS两种数据的进行阐述。(1) Data collection: The receipts collected from the literature are generally of two types: sgRNA and the corresponding enzyme digestion efficiency numerical type or sgRNA and the corresponding enzyme digestion efficiency type (such as valid or invalid); from SRA The NGS downloaded in the database only has a numerical type. Because the flow of NGS data after OTF rate statistics is consistent with the numerical type collected in the literature, this example only elaborates on the literature classification and NGS data.

分类型数据：针对从文献中收集的分类型数据，本实施例规定有效为1，无效为0，整理成如表1的格式。Classified data: For the classified data collected from the literature, this embodiment stipulates that valid is 1, and invalid is 0, and it is sorted into a format such as Table 1.

表1Table 1

sgIDsgID Sequencesequence ScoreScore sgRNA_1sgRNA_1 CGCAACCTGCTCAGCGCCTACGGCGCAACCTGCTCCAGCGCCTACGG 11 sgRNA_2sgRNA_2 CAGTCTACATAACACGCCCATGGCAGTCTACATAACACGCCCATGG 11 sgRNA_3sgRNA_3 CGCAACCTGCTCAGCGCCTACGGCGCAACCTGCTCCAGCGCCTACGG 11 ……... ……... ……... sgRNA_1_1sgRNA_1_1 GGCAACCGTGGCGGCAATCGAGGGGCAACCGTGGCGGCAATCGAGG 00 sgRNA_2_2sgRNA_2_2 CTTCTCGGAATTCGGTGAAGGTGGCTTCTCGGAATTCGGTGAAGGTGG 00 sgRNA_3_3sgRNA_3_3 AACCTCCCGGCTTCTCGGAATTCGGAACCTCCCGGCTTCTCGGAATTCGG 00 ……... ……... ……...

数值型数据：针对NGS的数值型数据，首先通过BWA分别把sgRNA的序列和NGS的reads比对到人类参考基因组上，取出包含sgRNA的reads，并判断在切割点是否产生indel以及indel是否是OTF，然后统计每个sgRNA的OTF率(OTF率＝包含该sgRNA并且是OTF的reads的总数除以包含该sgRNA的总reads数)。最后整理为如表2的格式。Numerical data: For NGS numerical data, first compare the sgRNA sequence and NGS reads to the human reference genome through BWA, take out the reads containing sgRNA, and judge whether indels are generated at the cutting point and whether the indels are OTF , and then count the OTF rate of each sgRNA (OTF rate=the total number of reads containing the sgRNA and OTF divided by the total number of reads containing the sgRNA). Finally, it is organized into the format shown in Table 2.

表2Table 2

sgIDsgID Sequencesequence ScoreScore sgRNA_1sgRNA_1 CGCAACCTGCTCAGCGCCTACGGCGCAACCTGCTCCAGCGCCTACGG 0.23450.2345 sgRNA_2sgRNA_2 CAGTCTACATAACACGCCCATGGCAGTCTACATAACACGCCCATGG 0.78460.7846 sgRNA_3sgRNA_3 CGCAACCTGCTCAGCGCCTACGGCGCAACCTGCTCCAGCGCCTACGG 0.23670.2367 ……... ……... ……...

(2)建立模型：如图1所示，从相应的参考基因组提取收集到的sgRNA的序列信息。假设设置上下游序列分别为35和32个碱基，则取出的序列为90(35+20+3+32)个碱基。CACCTGGTAT GTTCGTATCG GGCAGAATATCGCAACCTGC TCAGCGCC TA CGGTCCATCT CGCTCAGGTACGACTGACCGACCCAGTCTA。(2) Modeling: As shown in Figure 1, the sequence information of the collected sgRNA is extracted from the corresponding reference genome. Assuming that the upstream and downstream sequences are set to be 35 and 32 bases respectively, the extracted sequence is 90 (35+20+3+32) bases. CACCTGGTAT GTTCGTATCG GGCAGAATATCGCAACCTGC TCAGCGCC TA CGGTCCATCT CGCTCAGGTACGACTGACCGACCCAGTCTA.

对提取的sgRNA信息进行二进制编码，规则如表3所示。The extracted sgRNA information is binary coded, and the rules are shown in Table 3.

表3table 3

则以上取出90个碱基可编码为：Then the above 90 bases can be coded as:

0100 1000 0100 0100 0001 0010 0010 0001 1000 00010100 1000 0100 0100 0001 0010 0010 0001 1000 0001

0010 0001 0001 0100 0010 0001 1000 0001 0100 00100010 0001 0001 0100 0010 0001 1000 0001 0100 0010

0010 0010 0100 1000 0010 1000 1000 0001 1000 00010010 0010 0100 1000 0010 1000 1000 0001 1000 0001

0100 0010 0100 1000 1000 0100 0100 0001 0010 01000100 0010 0100 1000 1000 0100 0100 0001 0010 0100

0001 0100 1000 0010 0100 0010 0100 0100 0001 10000001 0100 1000 0010 0100 0010 0100 0100 0001 1000

0100 0010 0010 0001 0100 0100 1000 0001 0100 00010100 0010 0010 0001 0100 0100 1000 0001 0100 0001

0100 0010 0100 0001 0100 1000 0010 0010 0001 10000100 0010 0100 0001 0100 1000 0010 0010 0001 1000

0100 0010 1000 0100 0001 0010 1000 0100 0100 00100100 0010 1000 0100 0001 0010 1000 0100 0100 0010

1000 0100 0100 0100 1000 0010 0001 0100 0001 10001000 0100 0100 0100 1000 0010 0001 0100 0001 1000

用机器学习方法提取特征，建立个性化sgRNA设计模型。Use machine learning methods to extract features and build a personalized sgRNA design model.

针对分类型数据，用逻辑回归来选择特征和建立预测模型。二分类逻辑回归有两个可选的正则化，本发明用L1正则化进行特征选择，L2正则化建立模型。For categorical data, logistic regression is used to select features and build predictive models. The binary classification logistic regression has two optional regularizations. The present invention uses L1 regularization for feature selection, and L2 regularization for model building.

L1正则化逻辑回归解下列稀疏特征选择的最优化问题：L1 regularized logistic regression solves the following optimization problems for sparse feature selection:

其中，w和c是被估计的特征的权重和截距，X是训练样本的特征表示，n是训练样本的数量，y是sgRNA相对应的酶切效率值。Among them, w and c are the weights and intercepts of the estimated features, X is the feature representation of the training samples, n is the number of training samples, and y is the enzyme cutting efficiency value corresponding to the sgRNA.

用L2惩罚逻辑回归解最小化价值函数：Solve the minimized value function with L2 penalized logistic regression:

针对数值型数据，用Lasso模型来做特征选择，标准线性回归来建立预测模型。Lasso是估计稀疏相关系数的线性模型，主要通过提取非零权重来选择特征向量。最小化目标函数为：For numerical data, the Lasso model is used for feature selection, and the standard linear regression is used to establish a predictive model. Lasso is a linear model that estimates sparse correlation coefficients, mainly by extracting non-zero weights to select feature vectors. The objective function to minimize is:

其中，w是被估计的特征向量的权重，x是被选择的sgRNA的特征向量，n是训练样本的数量，y是sgRNA相对应的酶切效率值；α是一个常数，||w||₁是参数向量的矩阵；Lasso模型通过增加α||w||₁来解这个最小二乘损失函数，通过遍历正则化矩阵，非零权重的特征被提取出来，这些特征被认为是重要的影响sgRNA酶切效率的元素。Among them, w is the weight of the estimated feature vector, x is the feature vector of the selected sgRNA, n is the number of training samples, y is the enzyme cutting efficiency value corresponding to the sgRNA; α is a constant, ||w|| ₁ is a matrix of parameter vectors; the Lasso model solves this least squares loss function by adding α||w|| ₁ , and by traversing the regularization matrix, features with non-zero weights are extracted, which are considered to be important influences Elements of sgRNA cleavage efficiency.

选到这些特征后，然后用一个标准线性回归建立一个评估模型。After selecting these features, a standard linear regression is then used to build an evaluation model.

数值型和分类型的建模结果都产生两个文件：一个是xml文件，内容包含有选择的特征，和交叉验证的结果；另一个文件是pkl文件，内容为建立的预测模型，二进制文件。Two files are generated for both numerical and subtype modeling results: one is an xml file, which contains selected features and cross-validation results; the other file is a pkl file, which contains the established predictive model and a binary file.

xml文件内容如下：The content of the xml file is as follows:

(3)评估模型：采用NDCG算法衡量预测模型的质量，NDCG(Normalized DiscountedCumulative Gain，归一化折损累积增益)是主要用来衡量一个排序模型的效能，它的值代表着预测的排序结果和实际的排序之间的相似性，范围在0和1之间，1表示完全一致，数值越大代表着这个模型越好。具体公式如下：(3) Evaluation model: The NDCG algorithm is used to measure the quality of the prediction model. NDCG (Normalized Discounted Cumulative Gain, normalized discounted cumulative gain) is mainly used to measure the performance of a sorting model, and its value represents the predicted sorting results and The similarity between the actual rankings ranges between 0 and 1, 1 means complete agreement, and the larger the value, the better the model. The specific formula is as follows:

DCG(Discounted Cumulative Gain，折损累积增益)是用预测排序计算的数值，IDCG(ideal DCG)，是理想的DCG，用真实排序计算所得。DCG的数学定义如下：DCG (Discounted Cumulative Gain, discounted cumulative gain) is a value calculated by predictive sorting, and IDCG (ideal DCG) is an ideal DCG calculated by real sorting. The mathematical definition of DCG is as follows:

其中，rel_i是第i位置预测的排序值。where rel _i is the ranking value predicted at the i-th position.

如下表所示，sgID为sgRNA的名称，seq为sgRNA的spacer序列，Benchmark Score为基准分数，BS_rank为Benchmark Score的排序，Cage为本发明预测模型评估的分数，C_rank为Cage的排序如表4所示。As shown in the table below, sgID is the name of the sgRNA, seq is the spacer sequence of the sgRNA, Benchmark Score is the benchmark score, BS_rank is the ranking of the Benchmark Score, Cage is the score of the prediction model evaluation of the present invention, and C_rank is the ranking of the Cage as shown in Table 4 Show.

表4Table 4

sgIDsgID seqseq Benchmark ScoreBenchmark Score BS_rankBS_rank Cage ScoreCage Score C_rankC_rank sg1000sg1000 GCAGGTACCCTGCAACGTCGCGGGCAGGTACCCTGCAACGTCGCGG 0.7894568650.789456865 11 0.69050.6905 11 sg1001sg1001 CTCCACTAGTCCCCGCGCCGCGGCTCCACTAGTCCCCGCGCCGCGG 0.5064221660.506422166 22 0.60260.6026 22 sg1sg1 GTAATGGCTTCCTCGTGAGTTGGGTAATGGCTTCCTCGTGAGTTGG 0.3257383260.325738326 33 0.55480.5548 33 sg1002sg1002 GACTCCGTTGGGATCCGCGCCGGGACTCCGTTGGGATCCGCGCCGG 0.0920789910.092078991 44 0.50950.5095 44 sg10sg10 ATCTTAAGCAAACGCTTACCAGGATCTTAAGCAAACGCTTACCAGG 0.0722555750.072255575 55 0.49590.4959 55 sg1003sg1003 CCCGAAACGGTTGACTCCGTTGGCCCGAAACGGTTGACTCCGTTGG 0.0375523750.037552375 66 0.44730.4473 66 sg1004sg1004 AGGCGCGCGATCCAGGTAGCTGGAGGCGCGCGATCCAGGTAGCTGG 0.0199224770.019922477 77 0.32810.3281 88 sg100sg100 AAAAAGCTGATGAAGTTGTTTGGAAAAAGCTGATGAAGTTGTTTGG 0.0172965390.017296539 88 0.33570.3357 77 sg1005sg1005 CGGGGCCACCGCGACGTTGCAGGCGGGGCCACCGCGACGTTGCAGG 0.0022067870.002206787 99 0.30560.3056 99 ……... ……... ……... ……... ……... ……...

TOP50 NDCG＝0.876322904TOP50 NDCG＝0.876322904

TOP 10％NDCG＝0.84340749TOP 10% NDCG = 0.84340749

如果数据库中没有此模型，则更新到数据库，否则算出两组的NDCG值进行比较，若新的模型比已有模型的NDCG值大，则可更新到数据库。If there is no such model in the database, it will be updated to the database, otherwise, the NDCG values of the two groups will be calculated for comparison, and if the new model is greater than the NDCG value of the existing model, it can be updated to the database.

(4)设计和评估：如图2所示，针对用户已设计好的sgRNA进行评估或者针对用户给出的基因组区域(如chromosome 1,1,000,000to 1,002,000,hg19)，进行sgRNA的设计，首先确定要评估的sgRNA的物种或者细胞类型，然后选择适合的模型进行评估，如果没有合适的模型，可选择相类似的模型，本实施例提供了涉及3个物种8种细胞的10个模型以供选择使用。结果输出如表5所示。(4) Design and evaluation: As shown in Figure 2, to evaluate the sgRNA designed by the user or to design the sgRNA for the genomic region given by the user (such as chromosome 1, 1,000,000 to 1,002,000, hg19), first determine the The species or cell type of the sgRNA to be evaluated, and then select a suitable model for evaluation. If there is no suitable model, a similar model can be selected. This example provides 10 models involving 3 species and 8 types of cells for selection. . The resulting output is shown in Table 5.

表5table 5

至此，用户可以选择适合自己需求的sgRNA进行下一步的研究。At this point, users can choose the sgRNA that suits their needs for further research.

Claims

1. the method for designing based on the sgRNA of CRISPR/Cas9, it is characterised in that the method comprises the following steps：

1) obtain the value of the digesting efficiency of sgRNA and corresponding Cas9, be specially：

11) from document, obtain the value of the digesting efficiency of sgRNA and corresponding Cas9；

12) from SRA database, obtain sgRNA, calculate the value of the digesting efficiency obtaining corresponding Cas9；

13) according to species, cell type and experiment condition by step 11) and 12) in the data that get be categorized into different ginsengs Examining genome, each a first is classified as sgRNA title, second is classified as sgRNA sequence and the with reference to listing in genome The form of three digesting efficiency being classified as corresponding Cas9；

2) set up personalized sgRNA to design a model, be specially：

21) according to demand from corresponding with reference to genome, extraction step 1) in the sequence information of sgRNA that obtains；

22) to step 21) in extract sgRNA sequence information carry out binary coding according to binary rules；

23) to step 21) the middle sgRNA obtaining, it is judged that the data type of the digesting efficiency of its Cas9, if numeric type then enters Step 24), if classifying type then enters step 25)；

24) to step 22) in coding after sgRNA sequence information, carry out feature extraction with Lasso model, according to normal linearity Return the personalized sgRNA of foundation to design a model；

25) to step 22) in coding after sgRNA sequence information, with two sorted logics recurrence in L1 regularization carry out feature Selecting, the L2 regularization in returning further according to two sorted logics is set up personalized sgRNA and is designed a model；

3) use NDCG algorithm to weigh step 2) in the quality that designs a model of personalized sgRNA set up update SRA database, It is specially：

31) calculation procedure 2) in the NDCG value that designs a model of personalized sgRNA set up；

32) judge whether existing SRA database has corresponding personalized sgRNA model, if being otherwise added to SRA data Storehouse, if then entering step 33)；

33) the sgRNA model in this personalization sgRNA model and corresponding SRA database is compared, select that NDCG value is big one It is stored in SRA database；

4) design sgRNA the assessed value providing each sgRNA, be specially：

41) genome area being given according to user, it is suitable with reference to genome to choose from SRA database, therefrom searches for institute There is the sgRNA meeting design rule, as the sgRNA of design；

42) to step 41) in design sgRNA, use step 2) in set up personalized sgRNA model be estimated.

2. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 1, it is characterised in that described step 12) value of the digesting efficiency being calculated corresponding Cas9 in is specially：

121) the long comparison of reading of sgRNA and corresponding two generations order-checking to reference on genome；

122) reading comprising sgRNA is taken out long；

123) judge whether the insertion in the insertion whether cut point produces on DNA or deletion and DNA or deletion are frameshit Sudden change；

124) add up the frameshift mutation rate of each sgRNA, be specially：

125) using step 124) in calculated frameshift mutation rate as the value of the digesting efficiency of Cas9.

3. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 1, it is characterised in that described step 21) in, the sequence information of sgRNA includes that sgRNA sequence, sgRNA identify the spacer of the required mark fragment of DNA and sgRNA The base of upstream and downstream, the bases longs of the upstream and downstream of the spacer of described sgRNA is the value of platform default value or user setup.

4. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 1, it is characterised in that described step 22) binary rules in is specially：A correspondence 1000, C correspondence 0100, G correspondence 0010, T correspondence 0001, N correspondence 0000.

5. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 1, it is characterised in that described step 24) carrying out feature extraction with Lasso model in is to select characteristic vector by extracting non-zero weight, is specially：

\underset{w}{m i n} \frac{1}{2 n} | | x w - y | |_{2}^{2} + α | | w | |_{1}

Wherein, w is the weight of estimative characteristic vector, and x is the characteristic vector of selected sgRNA, and n is the quantity of sgRNA, Y is the value of the digesting efficiency of the corresponding Cas9 of sgRNA；α is a constant, | | w | |₁It is the matrix of parameter vector；Lasso model By increasing α | | w | |₁Solving this least square loss function, by traversal regularization matrix, the feature of non-zero weight is carried Take out.

6. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 1, it is characterised in that described step 25) the L1 regularization in is specially：

\underset{w, c}{m i n} \frac{1}{2} | | w | |_{1} + {CΣ}_{i = 1}^{n} l o g (\exp (- y_{i} (X_{i}^{T} w + c)) + 1)

Wherein, w and c is weight and the intercept of estimative feature, and X is the binary matrix of the sgRNA of coding, and n is sgRNA Quantity, y is the value of the digesting efficiency of the corresponding Cas9 of sgRNA.

7. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 6, it is characterised in that described L2 is just Then change and be specially：

\underset{w, c}{m i n} \frac{1}{2} w^{T} w + {CΣ}_{i = 1}^{n} l o g (\exp (- y_{i} (X_{i}^{T} w + c)) + 1) .

8. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 1, it is characterised in that described step 31) the NDCG value that the personalized sgRNA that in, calculating is set up designs a model is specially：

N D C G = \frac{D C G}{I D C G}

D C G = {rel}_{1} + Σ_{i = 2}^{n} \frac{{rel}_{i}}{\log_{2} i}

Wherein, DCG is the numerical value calculating with prediction sequence, and IDCG is the preferable DCG, rel calculating gained with true sequence_iIt is The ranking value of the i-th position prediction.

9. the method for designing of the sgRNA based on CRISPR/Cas9 according to claim 1, it is characterised in that described step 41) in, design rule is specially：

20bp+PAM

Wherein, bp is for representing the unit of DNA length, and PAM is that sgRNA identifies the required mark fragment of DNA.