CN110070912A

CN110070912A - A kind of prediction technique of CRISPR/Cas9 undershooting-effect

Info

Publication number: CN110070912A
Application number: CN201910299222.0A
Authority: CN
Inventors: 樊永显; 徐海波; 张向文; 张龙
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2019-07-30
Anticipated expiration: 2039-04-15
Also published as: CN110070912B

Abstract

The invention discloses a kind of prediction techniques of CRISPR/Cas9 undershooting-effect, which comprises the steps of: 1) building includes the data set of positive sample and negative sample；2) sample data set is encoded and feature is added；3) sample data is handled using the method for feature selecting；4) BroadLearning fallout predictor is constructed.This method predetermined speed is fast, precision is high.

Description

A kind of prediction technique of CRISPR/Cas9 undershooting-effect

Technical field

The present invention relates to gene technology, the prediction technique of specifically a kind of CRISPR/Cas9 undershooting-effect.

Background technique

Since CRISPR/Cas9 technology is applied to gene editing field for the first time, CRISPR/Cas9 system is swept across rapidly Life science causes the dramatic change of gene editing technology.CRISPR/Cas9 system is turned after Zinc finger nuclease, class Third generation genome after record activity factor effector nuclease determines editing technique, can DNA sequence dna to specific position into Edlin and modification.Preceding two generations genome editing technique removes identification DNA sequence dna by protein-specific, and CRISPR passes through base It is the DNA sequence dna of 20nt that complementary pairing, which goes successful match length, to position target dna, therefore has better versatility. CRISPR/Cas9 system is made of CRISPR sequential element and Cas9 nuclease.Cas9 nuclease is crRNA's and tracRNA Under guidance, before having between region sequence targeted adjacent to the DNA double chain of motif (PAM, usually NGG, N are arbitrary base) Cutting forms DNA double chain break.There are potential undershooting-effects for CRISPR/Cas9 system target biology genome.Cas9 core Sour enzyme has certain fault-tolerant ability to sgRNA and target dna sequence Mismatching.SgRNA is in addition to cutting target site DNA chain In addition, it is also possible to, with the non-targeted DNA sequence dna local matching compared with high homology, activate the digestion of Cas9 nucleic acid with target site Non-targeted DNA sequence dna is cut, undershooting-effect is generated.Undershooting-effect can generate a large amount of non-targeted cutting to the gene of genome, make At uncontrollable influence, this is also the greatest problem that CRISPR/Cas9 system is used for clinical application.

One important research direction of current CRISPR/Cas9 system is exactly to predict targeting hit efficiency and position of missing the target It sets, effect that the accurate interaction for predicting CRISPR system and DNA sequence dna can be used to maximize targeting activity and minimum is missed the target It answers.Current most of existing CRISRP/Cas9 undershooting-effect design tools are all simply by sequences match score and alkali Site of missing the target is searched in base mispairing.Other tools miss the target site score by designing one come the effect of missing the target in site of predicting to miss the target Rate.The undershooting-effect of external research is predicted mainly to be predicted by sequence similarity and physicochemical properties at present, and It is domestic that more CRISPR/Cas9 sequence is predicted using convolutional neural networks.It is main in terms of the technology used by these achievements It is divided into: based on support vector machines (SVM) (Wong N, Liu W, Wang X.WU-CRISPR:characteristics of Functional guide RNAs for the CRISPR/Cas9system.Genome Biol, 2015,16:218.), base In random forest (Abadi S, Yan WX, Amar D, Mayrose I.A machine learning approach for predicting CRISPR-Cas9cleavage efficiencies and patterns underlying its Mechanism of action.PLoS Comput Biol, 2017,13 (10): e1005807.), it is based on convolutional neural networks (Kim HK,Min S,Song M,Jung S,Choi JW,Kim Y,Lee S,Yoon S,Kim HH.Deep learning improves prediction of CRISPR-Cpf1guide RNA activity.Nat Biotechnol,2018,36 (3): 239-241.), logic-based returns (Prykhozhij SV, Rajan V, Gaston D, Berman JN.CRISPR multitargeter:a web tool to find common and unique CRISPR single guide RNA Targets in a set of similar sequences.PLoS One, 2015,10 (3): e0119372.), it is based on pattra leaves This analysis (Hart T, Moffat J.BAGEL:a computational framework for identifying essential genes from pooled library screens.BMC Bioinformatics,2016,17:164.)。 The above technology is applied to machine learning method in the prediction of CRISPR/Cas9 undershooting-effect.

Summary of the invention

It is an object of the invention in order to overcome by biochemical test establish CRISPR/Cas9 undershooting-effect library it is time-consuming and at This high and conventional method short time consumption is long and precision of prediction is undesirable defect provides a kind of CRISPR/Cas9 and misses the target The prediction technique of effect.This method predetermined speed is fast, precision is high.

Realizing the technical solution of the object of the invention is:

A kind of prediction technique of CRISPR/Cas9 undershooting-effect, points unlike the prior art are, include the following steps:

1) building includes the data set of positive sample and negative sample: from published GUIDE-Seq, the reality of HTGTS, BLESS It tests and obtains positive sample in data, sgRNA is mapped in human genome using bowtie2 program, lookup and target dna sequence Mismatch DNA sequence dna of the number less than 4, as possible sequence of missing the target, obtained sequence be length be 23 bases and with The DNA sequence dna of NGG ending, wherein N is any one in ACGT, and it is obtained i.e. that positive sample sequence is removed from these sequences For negative sample, over-sampling is carried out to positive sample using Bootstrap method, and from negative sample carry out lack sampling therefrom choose with just The identical negative sample of number of samples, since human genome data are huge, the negative sample of finally obtained enormous amount be will lead to The very big imbalance of positive negative sample, this imbalance can adversely affect training process and even result in failure to train, therefore It needs to carry out resampling to the huge negative sample of quantity, this extreme imbalance is solved, here using Bootstrap method to sample This is sampled: making have a resampling put back in the range of n initial data, sample size is still n, each in initial data The probability that observation unit is pumped to every time is equal, is 1/n, obtained sample is known as Bootstrap sample, as reference data Collect S, formula (1) can be formulated as:

Wherein subsetOnly comprising positive sample, that is, sgRNA sequence and in practice can be in conjunction with CRISPR/Cas9 system sgRNA Sequence of missing the target, subsetOnly comprising in negative sample, that is, sgRNA sequence human's genome only there are four and following mismatch but not The union that two sequences combine can be indicated with the site in conjunction with sgRNA, ∪；

2) encoded to sample data set and feature is added: the sgRNA sequence and DNA sequence dna obtain to step 1) carries out One-hot coding, obtains sequence vector, adds CFD score, CCTop score, CRISTA score, GC of the sgRNA with DNA pairs Content, mispairing number and sgRNA-DNA sequence similarity score, obtain feature vector, while generating corresponding two tag along sort, Wherein, CRISTA score is based on random forest and regression model, it is contemplated that DNA protrusion and RNA are extracted to sgRNA editorial efficiency It influences, in conjunction with genome nucleotide acid content, sgRNA macroscopic property, sgRNA and target dna sequence base similitude etc., finally CRISTA score is generated,

CCTop score calculates the mismatch score that misses the target according to formula (2) first for each target of missing the target:

score_off-target=∑_mismatch1.2^pos(2),

Wherein pos indicates the position that mispairing occurs in sequence of missing the target, and calculates by 5 ' ends, and recycling is missed the target mismatch score Calculating is missed the target score, such as formula (3):

Wherein dist indicates each to miss the target site to the distance of nearest exon accordingly, and totaLoff_targets is Miss the target number of loci, this score only considers sequence of missing the target relevant to exon,

CFD score calculates in CD33 data sgRNA and DNA sequence dna in the case of each position mispairing and different PAM first The reaction efficiency of single mispairing is then multiplied by reaction efficiency, is used as multiple mismatches, such as a sgRNA-DNA is in place It sets 3 and A:G mispairing has occurred, T:C mispairing has occurred in position 5, and PAM type is ' AG ', this sgRNA-DNA couples CFD Score is CFDscore=P (active | A:G, 3) × P (active | T:C, 5) × P (active | AG), and each of these item is all It is that the frequency observed from CD33 data is calculated, CFD score is represented by formula (4):

Wherein Y=1 indicates that sgRNA can react with DNA sequence dna, X_i=1 indicates to occur in the mispairing of the position i,

G/C content is the ratio of ' G ' and ' C ' two total base numbers of base number Zhan in DNA sequence dna, can pass through formula (5) it calculates:

Mispairing number is sgRNA sequence and the unmatched number of DNA sequence dna, and sgRNA-DNA similarity score is sgRNA Sequence matches the ratio that number accounts for sequence total length with DNA sequence dna；

3) sample data is handled using the method for feature selecting: using the sklearn module of python, to obtained vector It is handled, extreme random forest training aids training data is constructed using Extratree module, obtains the feature of 190 dimensional vectors Importance, then before feature importance ranking 150 vector progress training in next step is therefrom selected, this step had both reduced vector dimension Degree, accelerates subsequent training speed, also by redundancy feature is reduced, improves training precision, feature importance uses Gini system Number is defined as follows:

For two classification problems, the target value of classification is 0 or 1, for node m, N_mThe secondary obtained region of observation is R_m, It enables

Wherein p_mkFor classification k ratio what is observed in node m, k is classification i.e. 0 or 1, y_iFor predicted value, thus obtain It is formula (7) to Gini coefficient:

H(X_m)=∑_kp_mk(1-p_mk) (7), it is clear that H (X_m) value is bigger, illustrate this feature discrimination energy with higher Power, therefore, can the size based on value come ranking and then select to need the feature that retains and cast out those useless features；

4) it constructs BroadLearning fallout predictor: being used as using width learning algorithm addition BP tune ginseng and be directed to CRISPR/ The Broad learning fallout predictor of Cas9 undershooting-effect obtains prediction result and determines whether sgRNA and DNA sequence dna can occur instead It answers, predicts the undershooting-effect of CRISPR/Cas9 system,

It selects Broad learning as fallout predictor, training sample is trained, first construction feature mapping node, For given input dataIt is defined as N number of sample of M dimensional feature, generates weight at randomAnd deviation For activation primitive, Feature Mapping node definition is formula (8):

Building enhancing node, generates weight at random in Feature Mapping nodal basisAnd deviationξ_jTo activate letter Number, enhancing node definition are formula (9):

Wherein Zⁿ=[Z₁,…,Z_n], finally obtained output isN is sample number, and C is sample class, output Y may be defined as formula (10):

Wherein,[Zⁿ|H^m]⁺For [Zⁿ|H^m] pseudoinverse, then using BP algorithm to weight carry out Ginseng is adjusted, final weight is obtained, finally, obtaining ten width learning models by distributing different weights to width learning model, obtaining Final output is obtained to the output of ten classifiers, then by temporal voting strategy.

BroadLearning fallout predictor is constructed based on Broad learning algorithm, sequence is being added in this fallout predictor On the basis of information and sequence physical chemical property, using integrated learning approach, CRISTA score described in step 2) is added, CCTOP score and CFD score are trained as training parameter, greatly improve the accuracy rate of model.

The technical program is compared with existing Predicting Technique:

(1) time-consuming short: Broad learning compares traditional depth structure, does not need the complicated fortune for carrying out multilayer mechanism It calculates, and does not need to carry out backpropagation using BP algorithm to adjust weight, can greatly shorten the training time；

(2) accuracy rate is high: CRISPR/Cas9 undershooting-effect fallout predictor of the designed, designed based on Broad learning, energy Feature is enough efficiently extracted, and then improves predictablity rate.

This method predetermined speed is fast, precision is high.

Detailed description of the invention

Fig. 1 is prediction technique schematic illustration in embodiment.

Specific embodiment

The contents of the present invention are further elaborated with reference to the accompanying drawings and examples, but are not to limit of the invention It is fixed.

Embodiment:

Referring to Fig.1, a kind of prediction technique of CRISPR/Cas9 undershooting-effect, includes the following steps:

1) building includes the data set of positive sample and negative sample: from published GUIDE-Seq, the reality of HTGTS, BLESS It tests and obtains positive sample in data, sgRNA is mapped in human genome using bowtie2 program, lookup and target dna sequence Mismatch DNA sequence dna of the number less than 4, as possible sequence of missing the target, obtained sequence be length be 23 bases and with The DNA sequence dna of NGG ending, wherein N is any one in ACGT, and it is obtained i.e. that positive sample sequence is excluded from these sequences For negative sample, over-sampling is carried out to positive sample using Bootstrap method, and from negative sample carry out lack sampling therefrom choose with just The identical negative sample of number of samples, since human genome data are huge, the negative sample of finally obtained enormous amount be will lead to The very big imbalance of positive negative sample, this imbalance can adversely affect training process and even result in failure to train, therefore It needs to carry out resampling to the huge negative sample of quantity, this extreme imbalance is solved, here using Bootstrap method to negative Sample carries out stochastical sampling: making have the resampling put back in the range of n initial data, sample size is still n, initial data In the probability that is pumped to every time of each observation unit it is equal, be 1/n, obtained sample becomes Bootstrap sample, as base Quasi- data set S can be formulated as formula (1):

Wherein, subsetOnly comprising that positive sample, that is, sgRNA sequence and can be tied in practice with CRISPR/Cas9 system sgRNA The sequence of missing the target closed, subsetOnly comprising meeting preset condition in negative sample, that is, sgRNA sequence human's genome but cannot be with The site that sgRNA is combined, ∪ indicate the union that two sequences combine, and this example obtains 1744 samples, wherein positive and negative sample is 872；

2) encoded to sample data set and feature is added: the sgRNA sequence and DNA sequence dna obtain to step 1) carries out One-hot coding, obtains 23*4*2=184 dimensional vector, add sgRNA and DNA pairs of CFD score, CCTop score, CRISTA score, G/C content, mispairing number and sgRNA-DNA sequence similarity score, obtain 190 dimensional vectors, while generating phase 0 answered, 1 two tag along sorts, wherein CRISTA score is based on random forest and regression model, it is contemplated that DNA protrusion and RNA are mentioned The influence that steamed stuffed bun by small bamboo food steamer is edited to sgRNA is taken, in conjunction with genome nucleotide acid content, sgRNA macroscopic property, sgRNA and target dna Series similitude etc. ultimately generates CRISTA score,

score_off-target=∑_mismatch1.2^pos(2),

3) sample data is handled using the method for feature selecting: using the sklearn module of python, to obtained vector It is handled, extreme random forest training aids training data is constructed using Extratree module, obtains the feature of 190 dimensional vectors Importance, then before feature importance ranking 150 vector progress training in next step is therefrom selected, this step had both reduced vector dimension Degree, accelerates subsequent training speed, also by redundancy feature is reduced, improves training precision；Feature importance is adopted in this example It is defined as follows with Gini coefficient:

H(X_m)=∑_kp_mk(1-p_mk) (7),

Obviously, H (X_m) value is bigger, illustrate this feature resolving ability with higher, therefore, can the size based on value come Ranking and then selection need the feature retained and cast out those useless features, are taken 150 before feature importance ranking in final this example Feature vector as final data, obtain 1744*150 dimensional vector；

This example selects Broad learning as fallout predictor, is trained to training sample, and construction feature maps first Node, for given input dataIt is defined as N number of sample of M dimensional feature, generates weight at randomAnd deviation For activation primitive, activation primitive used is tanh, and Feature Mapping node definition is formula (8):

Building enhancing node, generates weight at random in Feature Mapping nodal basisAnd deviationξ_jTo activate letter Number, activation primitive used are sigmoid, and enhancing node definition is formula (9):

Experiments have shown that:

The human genome CRISPR/Cas9 of the result predicted according to the method for this example and current mainstream is missed the target pre- Survey method CFD approach (Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I,Tothova Z,Wilen C,Orchard R.Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9.Nat Biotechnol.2016: 34 (2): 184-91.) it compares, the results are shown in Table 1:

The Experimental comparison results of table 1 and CFD approach

Prediction technique	Accuracy	AUC value
			CFD score	0.897	0.91
Broad learning	0.923	0.93

As can be seen from Table 1, higher accuracy rate can be obtained compared to CFD score according to the method for this example, while can be with Higher AUC value is obtained, this, which is represented, according to the fallout predictor that the method for this example obtains there is preferably prediction stability and classification to imitate Fruit.

Claims

1. a kind of prediction technique of CRISPR/Cas9 undershooting-effect, which comprises the steps of:

1) building includes the data set of positive sample and negative sample: from published GUIDE-Seq, the experiment number of HTGTS, BLESS According to middle acquisition positive sample, sgRNA is mapped in human genome using bowtie2 program, is not searched with target dna sequence not DNA sequence dna with number less than 4, as possible sequence of missing the target, obtained sequence is that length is 23 bases and is tied with NGG The DNA sequence dna of tail, wherein N is any one in ACGT, and it is to be negative that it is obtained that positive sample sequence is excluded from these sequences Sample carries out over-sampling to positive sample using Bootstrap method, and carries out lack sampling from negative sample and therefrom choose and positive sample The same number of negative sample；

2) sample data set is encoded and feature is added: to being made of sgRNA sequence and DNA sequence dna of obtaining of step 1 Positive negative sample carries out one-hot coding, adds sgRNA and contains with DNA pairs of CFD score, CCTop score, CRISTA score, GC Amount, mispairing number and sgRNA-DNA sequence similarity score this six features collectively form feature vector, while generating corresponding Two tag along sorts；

3) sample data is handled using the method for feature selecting: using the sklearn module of python, to obtained feature vector It is handled, extreme random forest training aids training data is constructed using Extratree module, obtains the feature weight of feature vector The property wanted, then the feature of feature importance ranking previous 105 is therefrom selected to carry out training in next step；

4) it constructs BroadLearning fallout predictor: BP tune ginseng is added as de- for CRISPR/Cas9 using width learning algorithm The Broad learning fallout predictor of targeted effect obtains prediction result and determines whether sgRNA can react with DNA sequence dna, in advance Survey the undershooting-effect of CRISPR/Cas9 system.