CN113257359A

CN113257359A - CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR

Info

Publication number: CN113257359A
Application number: CN202110639647.9A
Authority: CN
Inventors: 张桂珊; 陈耀文
Original assignee: Shantou University
Current assignee: Shantou University
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-08-13

Abstract

The embodiment of the invention discloses a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR, which comprises the following steps: (1) constructing a reference data set and an independent test set, and coding the reference data set and the test set, (2) constructing a CNN network, pre-training the coded reference data set, and extracting abstract features of a guide RNA sequence, (3) sequencing the abstract features of the guide RNA sequence extracted by the CNN according to importance by using a minimum redundancy maximum correlation method, and adding feature subsets one by one according to the importance sequencing by using a sequential forward search algorithm; (4) inputting the constructed feature subset into an SVR classifier to predict the editing efficiency of guide RNA, adjusting and optimizing model parameters according to Spearman correlation coefficient and AUROC value until an optimal solution is obtained, and storing a trained model, (5) applying the trained CNN-SVR model and combining a migration learning strategy to improve the prediction accuracy of the CNN-SVR in an independent test set. The method has high prediction accuracy and strong robustness.

Description

CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR

Technical Field

The invention relates to a gene editing technology, in particular to a method for predicting CRISPR/Cas9 guide RNA editing efficiency.

Background

The CRISPR/Cas9 system is derived from bacterial defense mechanisms and is a currently more common gene editing tool. The technology can edit and modify at specific positions on the genome, and revolutionary changes are brought to the fields of biology, biotechnology, medicine and the like. CRISPR/Cas9 consists of Cas9 with nuclease activity and a specifically programmed guide RNA that targets the complex to the target genomic region by recognizing 3' PAM, completing recognition and cleavage by base complementary pairing. The efficiency of gene editing depends largely on the activity of the guide RNA, and there is a large difference in the activity of different guide RNAs, resulting in a large difference in gene editing efficiency. However, the specific factors affecting the efficiency of guide RNA editing have not been completely clarified. Cas9 is able to bind to unintended genomic sites resulting in off-target. Designing guide RNAs with high editing efficiency and low off-target effects is an important research issue for optimizing this system. Accurate prediction of guide RNA editing efficiency will help to design guide RNAs with greater activity, maximizing the editing efficiency of the target site, while minimizing off-target effects. Therefore, the efficiency of computer-aided prediction of guide RNA editing is one of the key steps for successful gene editing using the CRISPR/Cas9 system.

Machine learning was applied stepwise in the study of CRISPR/Cas9 guide RNA editing efficiency prediction. Such methods take into account the characteristics of the different nucleotides of a given guide RNA sequence to assess its performance for cleavage. The machine learning prediction effect relies on artificial feature engineering with a priori knowledge. In addition, manually extracting features may introduce redundant information, which in turn affects the prediction effect. The design rules of the current machine learning guide RNA activity prediction method are incomplete or even biased, and still require the integrated analysis of a large amount of data.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR. The method can automatically extract abstract features of the input sequence, avoids manual feature selection, and has better potential in solving the problem of guide RNA activity prediction research.

In order to solve the technical problem, the embodiment of the invention provides a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR, which comprises the following steps:

(1) a reference dataset comprising guide RNA sequences and their binary tags was constructed. And integrating data sets of different platforms, performing standardized preprocessing, and constructing a test set. The test set contains guide RNA sequences and their corresponding edit efficiency values. The guide RNA editing efficiency standardization method is as follows: constructing a matrix of guide RNAs and editing efficiencies

The rows and columns of the matrix represent the number of experiments and guide RNAs, respectively.

Is shown as

The secondary experiment guides the efficiency of editing RNA, normalized as defined below:

wherein, y_norIndicating the efficiency of normalized guide RNA editing,

the average value of each row is represented by,

the average value of each column is shown,

to represent

Average value of (a).

(2) Encoding the reference data set and the test set: and (3) carrying out unique hot coding on the guide RNA sequences of the reference data set and the independent test set obtained in the step (1) to obtain a binary matrix. One-hot encoding represents an input sequence as

The binary matrix of (a) is obtained,

indicates the nucleotide species (A, C, G and T),

represents the sequence length. Each position of the sequence is represented by a binary vector of length 4, where A, C, G and T are represented by

，

，

，

And (4) showing.

(3) Constructing a Convolutional Neural Network (CNN) model: and constructing a CNN network model, and training basic network parameters of the CNN model by using a reference data set. The number of convolutional layers of the CNN model, the size and the length of each convolutional layer convolutional core, the number of layers of the fully-connected layers and the number of neurons of each fully-connected layer are determined empirically. The input layer inputs the one-hot coded guide RNA (binary matrix of 4 × 23) into a one-dimensional convolution layer (conv _1) with 256 one-dimensional convolution kernels of length 5 steps 1, regularized using the ReLU activation function and a random erasure rate of 0.3 dropout. The structure of the second layer (conv _2) is the same as the conv _1 layer. The output of the conv _2 layer is flattened and then passes through 4 full-connection layers (FC _1, FC _2, FC _3, FC _4), and the number of neurons is 256, 128, 64 and 40 respectively. And finally, combining the output of the last full-connection layer of the two branches, inputting the output layer of the full-connection neural network layer (FC) only containing one neuron, wherein the activation function is linear and the loss function is mse. And (3) optimizing hyper-parameters of the CNN by using grid search, determining the optimal dropout regularized random deletion rate of 0.3, batch size of 256 and epoch times of 200, and determining the optimal CNN model structure.

(4) Extracting important sequence features by using a feature selection method: and sequencing the abstract features of the guide sequence extracted by the CNN according to the importance by using the minimum redundant maximum correlation, and selecting the sequence features 13 bits before the importance for further training.

(5) Constructing an SVR classifier: inputting the abstract characteristics obtained in the step 1 into an SVR trained by a Gaussian radial basis kernel classifier for training, and searching for the optimal punishment parameters by using a grid search method

Is 1.7, nuclear parameter

Is 0.12 and

the parameter is 0.11, and the optimal SVR model is obtained.

(6) And (3) further optimizing the prediction precision of the CNN-SVR model by using transfer learning: the CNN-SVR base model is trained de novo on the baseline dataset. And transferring the parameters of all layers outside the last two fully-connected layers processed by the obtained basic model to a specific test set for prediction, thereby improving the prediction precision of the model.

The embodiment of the invention has the following beneficial effects: according to the invention, on the basis of a classical CNN network, the last layer of linear regression is improved into SVR, and the CNN extraction guide RNA sequence abstract characteristics and the advantage of SVR on high-dimensional data regression analysis are combined, so that the prediction accuracy is improved. The invention also trains a base model on the reference data set, combines the transfer learning training skill, and only finely adjusts the parameters of the last two fully-connected layers when the independent test set is predicted, thereby improving the robustness of the model and the prediction accuracy of the small sample data set.

Drawings

FIG. 1 is a flow chart of a CNN-SVR prediction method of the present invention;

fig. 2 is a diagram of a CNN network architecture in the present invention;

FIG. 3 is a method of model training in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

The application discloses a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR, which can accurately and robustly predict the editing efficiency of guide RNA. The continuous layers of the CNN network enable the model to automatically learn abstract features, and the last layer of the network can be regarded as a linear classifier operator for the features extracted by the previous hidden layer. Because many trainable parameters are included in the multi-layered perceptron after the CNN feature extraction layer, the CNN is not always the best choice for classification. SVR of fixed kernel function excels in handling such feature vectors and has significant advantages in minimizing generalization error. And (3) combining the CNN and the SVR to construct a CNN-SVR model, automatically extracting the characteristics of the sequence by the CNN, calculating a classifier function by the SVR in a high-dimensional characteristic space obtained by learning, carrying out regression analysis on the guide RNA sequence abstract characteristics extracted by the CNN, and outputting a predicted guide RNA editing efficiency value.

The method disclosed by the invention needs to train the CNN-SVR before using the CNN-SVR to predict CRISPR/Cas9 guide RNA editing efficiency. Therefore, the invention is divided into two parts, the first part is a training model, the second part is the prediction of the efficiency of the guide RNA editing of the test set, the main flow refers to fig. 1, the CNN network structure is shown in fig. 2, and the model training method flow is shown in fig. 3.

Aiming at data obtained by different platforms and experimental conditions, the model needs to be trained to ensure the effectiveness of the model. Training a model ab initio on a given training set, comprising the steps of:

firstly, integrating CRISPR/Cas9 guide RNA editing efficiency experiment data of different experiment platform open sources, and constructing a benchmark dataset.

Second, the reference data set is preprocessed by first applying a one-hot encodingConversion of guide RNA sequences each 23bp in length

The binary matrix of (2). Secondly, the editing efficiency value of the guide RNA is normalized to obtain

The efficiency value of the edit.

Third, the CNN network is trained ab initio on a reference data set. And optimizing the hyperparameter of the CNN by using network search. The hyper-parameter optimization is carried out in the following order: model weight initialization means ("zero", "he _ uniform", "uniform", "gloot _ uniform", "lect _ uniform", "normal", "he _ normal"), dropout regularization random deletion rate (0.2, 0.3, 0.4, 0.5, 0.6), batch size (64, 128, 256, 512), epoch number (50, 100, 200, 300). In order to avoid overfitting, a five-fold cross validation training model is adopted, and the parameter which enables the average loss value of the validation set to be minimum is selected as the optimal model.

And fourthly, optimizing the features extracted by the CNN by adopting a two-step method. First, the features extracted by CNN are sorted using the minimum redundant maximum correlation. Second, an optimal feature set is determined using a sequential forward search. Specifically, the features extracted by CNN are introduced into the training SVR from high to low according to the importance obtained by the minimum redundancy maximum correlation one by one, and the feature subset which enables AUROC to be maximum is selected as the optimal feature subset.

The fifth step: inputting the selected optimal feature subset into the SVR trained by the Gaussian radial basis kernel classifier for training and evaluation, and searching the optimal punishment parameter by using a grid search method

Nuclear parameters

And

and obtaining the optimal SVR model according to the parameters. The parameter grid search range is as follows:

，

，

。

and the second part is used for predicting the editing efficiency of the guide RNA of the test set, after the CNN-SVR is obtained by training, the editing efficiency of the guide RNA to be analyzed is predicted by the CNN-SVR, and the method comprises the following steps:

firstly, preprocessing an independent test set to be analyzed, firstly, converting each guide RNA sequence with the length of 23bp into a guide RNA sequence by using one-hot coding

The efficiency value of the edit.

And secondly, performing predictive analysis on the data to be analyzed by combining a migration learning method, and migrating the pre-trained CNN model parameters on the reference data set to an independent test set. The method comprises the following steps: the parameters of the convolutional layer, the pooling layer and the first three fully-connected layers are all frozen, the parameters obtained by learning from the reference data set are migrated and learned, and the weights of the last two fully-connected layers are only finely adjusted. And adjusting and optimizing model parameters according to the Spearman correlation coefficient and the AUROC value until an optimal solution is obtained, and successfully training and storing the model.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR is characterized by comprising the following steps:

s1: constructing a reference data set containing a guide RNA sequence and a binary label thereof, integrating data sets of different platforms, carrying out standardized pretreatment, and constructing a test set, wherein the test set comprises the guide RNA sequence and an editing efficiency value corresponding to the guide RNA sequence;

s2: carrying out unique hot coding on the guide RNA sequences of the reference data set and the test set to obtain a binary matrix;

s3: constructing a Convolutional Neural Network (CNN) model, and training basic network parameters of the CNN model by using the reference data set; optimizing hyper-parameters of the CNN by using grid search, wherein the hyper-parameters comprise dropout regularization random deletion rate, batch size and epoch times, and determining an optimal CNN model structure;

s4: sequencing the abstract features of the guide sequence extracted by the CNN according to importance by using the minimum redundant maximum correlation, and selecting the sequence features 13 bits before the importance for next training;

s5: inputting the abstract features into an SVR trained by a Gaussian radial basis kernel classifier for training, and searching an optimal punishment parameter C, a kernel parameter gamma and an ϵ parameter by using a grid search method to obtain an optimal SVR model;

s6: and training a CNN-SVR basic model from the beginning on a reference data set, and transferring the parameters of all layers outside the last two fully-connected layers processed by the obtained basic model to a specific test set for prediction.

2. The CNN-SVR based CRISPR/Cas9 guide RNA editing efficiency prediction method according to claim 1, wherein the performing of the normalization pre-treatment in step S1 comprises:

constructing a matrix of guide RNAs and editing efficiencies

The rows and columns of the matrix represent the number of experiments and guide RNAs, respectively,

is shown as

wherein, y_norIndicating the efficiency of normalized guide RNA editing,

the average value of each row is represented by,

the average value of each column is shown,

to represent

Average value of (a).

3. The CNN-SVR-based CRISPR/Cas9 guide RNA editing efficiency prediction method of claim 2, wherein in the step S2, the one-hot encoding represents the input sequence as

The binary matrix of (a) is obtained,

represents a nucleotide species including adenine A, guanine G, cytosine C and thymine T,

representing the length of the sequence, each position of the sequence being represented by a binary vector of length 4, where A, C, G and T are represented by

，

，

，

And (4) showing.

4. The CNN-SVR-based CRISPR/Cas9 guide RNA editing efficiency prediction method of claim 3, wherein the penalty parameter is

Is 1.7, nuclear parameter

Is 0.12 and

the parameter was 0.11.