CN113257359A - CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR - Google Patents
CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR Download PDFInfo
- Publication number
- CN113257359A CN113257359A CN202110639647.9A CN202110639647A CN113257359A CN 113257359 A CN113257359 A CN 113257359A CN 202110639647 A CN202110639647 A CN 202110639647A CN 113257359 A CN113257359 A CN 113257359A
- Authority
- CN
- China
- Prior art keywords
- cnn
- guide rna
- svr
- model
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The embodiment of the invention discloses a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR, which comprises the following steps: (1) constructing a reference data set and an independent test set, and coding the reference data set and the test set, (2) constructing a CNN network, pre-training the coded reference data set, and extracting abstract features of a guide RNA sequence, (3) sequencing the abstract features of the guide RNA sequence extracted by the CNN according to importance by using a minimum redundancy maximum correlation method, and adding feature subsets one by one according to the importance sequencing by using a sequential forward search algorithm; (4) inputting the constructed feature subset into an SVR classifier to predict the editing efficiency of guide RNA, adjusting and optimizing model parameters according to Spearman correlation coefficient and AUROC value until an optimal solution is obtained, and storing a trained model, (5) applying the trained CNN-SVR model and combining a migration learning strategy to improve the prediction accuracy of the CNN-SVR in an independent test set. The method has high prediction accuracy and strong robustness.
Description
Technical Field
The invention relates to a gene editing technology, in particular to a method for predicting CRISPR/Cas9 guide RNA editing efficiency.
Background
The CRISPR/Cas9 system is derived from bacterial defense mechanisms and is a currently more common gene editing tool. The technology can edit and modify at specific positions on the genome, and revolutionary changes are brought to the fields of biology, biotechnology, medicine and the like. CRISPR/Cas9 consists of Cas9 with nuclease activity and a specifically programmed guide RNA that targets the complex to the target genomic region by recognizing 3' PAM, completing recognition and cleavage by base complementary pairing. The efficiency of gene editing depends largely on the activity of the guide RNA, and there is a large difference in the activity of different guide RNAs, resulting in a large difference in gene editing efficiency. However, the specific factors affecting the efficiency of guide RNA editing have not been completely clarified. Cas9 is able to bind to unintended genomic sites resulting in off-target. Designing guide RNAs with high editing efficiency and low off-target effects is an important research issue for optimizing this system. Accurate prediction of guide RNA editing efficiency will help to design guide RNAs with greater activity, maximizing the editing efficiency of the target site, while minimizing off-target effects. Therefore, the efficiency of computer-aided prediction of guide RNA editing is one of the key steps for successful gene editing using the CRISPR/Cas9 system.
Machine learning was applied stepwise in the study of CRISPR/Cas9 guide RNA editing efficiency prediction. Such methods take into account the characteristics of the different nucleotides of a given guide RNA sequence to assess its performance for cleavage. The machine learning prediction effect relies on artificial feature engineering with a priori knowledge. In addition, manually extracting features may introduce redundant information, which in turn affects the prediction effect. The design rules of the current machine learning guide RNA activity prediction method are incomplete or even biased, and still require the integrated analysis of a large amount of data.
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is to provide a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR. The method can automatically extract abstract features of the input sequence, avoids manual feature selection, and has better potential in solving the problem of guide RNA activity prediction research.
In order to solve the technical problem, the embodiment of the invention provides a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR, which comprises the following steps:
(1) a reference dataset comprising guide RNA sequences and their binary tags was constructed. And integrating data sets of different platforms, performing standardized preprocessing, and constructing a test set. The test set contains guide RNA sequences and their corresponding edit efficiency values. The guide RNA editing efficiency standardization method is as follows: constructing a matrix of guide RNAs and editing efficienciesThe rows and columns of the matrix represent the number of experiments and guide RNAs, respectively.Is shown asThe secondary experiment guides the efficiency of editing RNA, normalized as defined below:
wherein, ynorIndicating the efficiency of normalized guide RNA editing,the average value of each row is represented by,the average value of each column is shown,to representAverage value of (a).
(2) Encoding the reference data set and the test set: and (3) carrying out unique hot coding on the guide RNA sequences of the reference data set and the independent test set obtained in the step (1) to obtain a binary matrix. One-hot encoding represents an input sequence asThe binary matrix of (a) is obtained,indicates the nucleotide species (A, C, G and T),represents the sequence length. Each position of the sequence is represented by a binary vector of length 4, where A, C, G and T are represented by,,,And (4) showing.
(3) Constructing a Convolutional Neural Network (CNN) model: and constructing a CNN network model, and training basic network parameters of the CNN model by using a reference data set. The number of convolutional layers of the CNN model, the size and the length of each convolutional layer convolutional core, the number of layers of the fully-connected layers and the number of neurons of each fully-connected layer are determined empirically. The input layer inputs the one-hot coded guide RNA (binary matrix of 4 × 23) into a one-dimensional convolution layer (conv _1) with 256 one-dimensional convolution kernels of length 5 steps 1, regularized using the ReLU activation function and a random erasure rate of 0.3 dropout. The structure of the second layer (conv _2) is the same as the conv _1 layer. The output of the conv _2 layer is flattened and then passes through 4 full-connection layers (FC _1, FC _2, FC _3, FC _4), and the number of neurons is 256, 128, 64 and 40 respectively. And finally, combining the output of the last full-connection layer of the two branches, inputting the output layer of the full-connection neural network layer (FC) only containing one neuron, wherein the activation function is linear and the loss function is mse. And (3) optimizing hyper-parameters of the CNN by using grid search, determining the optimal dropout regularized random deletion rate of 0.3, batch size of 256 and epoch times of 200, and determining the optimal CNN model structure.
(4) Extracting important sequence features by using a feature selection method: and sequencing the abstract features of the guide sequence extracted by the CNN according to the importance by using the minimum redundant maximum correlation, and selecting the sequence features 13 bits before the importance for further training.
(5) Constructing an SVR classifier: inputting the abstract characteristics obtained in the step 1 into an SVR trained by a Gaussian radial basis kernel classifier for training, and searching for the optimal punishment parameters by using a grid search methodIs 1.7, nuclear parameterIs 0.12 andthe parameter is 0.11, and the optimal SVR model is obtained.
(6) And (3) further optimizing the prediction precision of the CNN-SVR model by using transfer learning: the CNN-SVR base model is trained de novo on the baseline dataset. And transferring the parameters of all layers outside the last two fully-connected layers processed by the obtained basic model to a specific test set for prediction, thereby improving the prediction precision of the model.
The embodiment of the invention has the following beneficial effects: according to the invention, on the basis of a classical CNN network, the last layer of linear regression is improved into SVR, and the CNN extraction guide RNA sequence abstract characteristics and the advantage of SVR on high-dimensional data regression analysis are combined, so that the prediction accuracy is improved. The invention also trains a base model on the reference data set, combines the transfer learning training skill, and only finely adjusts the parameters of the last two fully-connected layers when the independent test set is predicted, thereby improving the robustness of the model and the prediction accuracy of the small sample data set.
Drawings
FIG. 1 is a flow chart of a CNN-SVR prediction method of the present invention;
fig. 2 is a diagram of a CNN network architecture in the present invention;
FIG. 3 is a method of model training in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.
The application discloses a CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR, which can accurately and robustly predict the editing efficiency of guide RNA. The continuous layers of the CNN network enable the model to automatically learn abstract features, and the last layer of the network can be regarded as a linear classifier operator for the features extracted by the previous hidden layer. Because many trainable parameters are included in the multi-layered perceptron after the CNN feature extraction layer, the CNN is not always the best choice for classification. SVR of fixed kernel function excels in handling such feature vectors and has significant advantages in minimizing generalization error. And (3) combining the CNN and the SVR to construct a CNN-SVR model, automatically extracting the characteristics of the sequence by the CNN, calculating a classifier function by the SVR in a high-dimensional characteristic space obtained by learning, carrying out regression analysis on the guide RNA sequence abstract characteristics extracted by the CNN, and outputting a predicted guide RNA editing efficiency value.
The method disclosed by the invention needs to train the CNN-SVR before using the CNN-SVR to predict CRISPR/Cas9 guide RNA editing efficiency. Therefore, the invention is divided into two parts, the first part is a training model, the second part is the prediction of the efficiency of the guide RNA editing of the test set, the main flow refers to fig. 1, the CNN network structure is shown in fig. 2, and the model training method flow is shown in fig. 3.
Aiming at data obtained by different platforms and experimental conditions, the model needs to be trained to ensure the effectiveness of the model. Training a model ab initio on a given training set, comprising the steps of:
firstly, integrating CRISPR/Cas9 guide RNA editing efficiency experiment data of different experiment platform open sources, and constructing a benchmark dataset.
Second, the reference data set is preprocessed by first applying a one-hot encodingConversion of guide RNA sequences each 23bp in lengthThe binary matrix of (2). Secondly, the editing efficiency value of the guide RNA is normalized to obtainThe efficiency value of the edit.
Third, the CNN network is trained ab initio on a reference data set. And optimizing the hyperparameter of the CNN by using network search. The hyper-parameter optimization is carried out in the following order: model weight initialization means ("zero", "he _ uniform", "uniform", "gloot _ uniform", "lect _ uniform", "normal", "he _ normal"), dropout regularization random deletion rate (0.2, 0.3, 0.4, 0.5, 0.6), batch size (64, 128, 256, 512), epoch number (50, 100, 200, 300). In order to avoid overfitting, a five-fold cross validation training model is adopted, and the parameter which enables the average loss value of the validation set to be minimum is selected as the optimal model.
And fourthly, optimizing the features extracted by the CNN by adopting a two-step method. First, the features extracted by CNN are sorted using the minimum redundant maximum correlation. Second, an optimal feature set is determined using a sequential forward search. Specifically, the features extracted by CNN are introduced into the training SVR from high to low according to the importance obtained by the minimum redundancy maximum correlation one by one, and the feature subset which enables AUROC to be maximum is selected as the optimal feature subset.
The fifth step: inputting the selected optimal feature subset into the SVR trained by the Gaussian radial basis kernel classifier for training and evaluation, and searching the optimal punishment parameter by using a grid search methodNuclear parametersAndand obtaining the optimal SVR model according to the parameters. The parameter grid search range is as follows:,,。
and the second part is used for predicting the editing efficiency of the guide RNA of the test set, after the CNN-SVR is obtained by training, the editing efficiency of the guide RNA to be analyzed is predicted by the CNN-SVR, and the method comprises the following steps:
firstly, preprocessing an independent test set to be analyzed, firstly, converting each guide RNA sequence with the length of 23bp into a guide RNA sequence by using one-hot codingThe binary matrix of (2). Secondly, the editing efficiency value of the guide RNA is normalized to obtainThe efficiency value of the edit.
And secondly, performing predictive analysis on the data to be analyzed by combining a migration learning method, and migrating the pre-trained CNN model parameters on the reference data set to an independent test set. The method comprises the following steps: the parameters of the convolutional layer, the pooling layer and the first three fully-connected layers are all frozen, the parameters obtained by learning from the reference data set are migrated and learned, and the weights of the last two fully-connected layers are only finely adjusted. And adjusting and optimizing model parameters according to the Spearman correlation coefficient and the AUROC value until an optimal solution is obtained, and successfully training and storing the model.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims (4)
1. A CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR is characterized by comprising the following steps:
s1: constructing a reference data set containing a guide RNA sequence and a binary label thereof, integrating data sets of different platforms, carrying out standardized pretreatment, and constructing a test set, wherein the test set comprises the guide RNA sequence and an editing efficiency value corresponding to the guide RNA sequence;
s2: carrying out unique hot coding on the guide RNA sequences of the reference data set and the test set to obtain a binary matrix;
s3: constructing a Convolutional Neural Network (CNN) model, and training basic network parameters of the CNN model by using the reference data set; optimizing hyper-parameters of the CNN by using grid search, wherein the hyper-parameters comprise dropout regularization random deletion rate, batch size and epoch times, and determining an optimal CNN model structure;
s4: sequencing the abstract features of the guide sequence extracted by the CNN according to importance by using the minimum redundant maximum correlation, and selecting the sequence features 13 bits before the importance for next training;
s5: inputting the abstract features into an SVR trained by a Gaussian radial basis kernel classifier for training, and searching an optimal punishment parameter C, a kernel parameter gamma and an ϵ parameter by using a grid search method to obtain an optimal SVR model;
s6: and training a CNN-SVR basic model from the beginning on a reference data set, and transferring the parameters of all layers outside the last two fully-connected layers processed by the obtained basic model to a specific test set for prediction.
2. The CNN-SVR based CRISPR/Cas9 guide RNA editing efficiency prediction method according to claim 1, wherein the performing of the normalization pre-treatment in step S1 comprises:
constructing a matrix of guide RNAs and editing efficienciesThe rows and columns of the matrix represent the number of experiments and guide RNAs, respectively,is shown asThe secondary experiment guides the efficiency of editing RNA, normalized as defined below:
3. The CNN-SVR-based CRISPR/Cas9 guide RNA editing efficiency prediction method of claim 2, wherein in the step S2, the one-hot encoding represents the input sequence asThe binary matrix of (a) is obtained,represents a nucleotide species including adenine A, guanine G, cytosine C and thymine T,representing the length of the sequence, each position of the sequence being represented by a binary vector of length 4, where A, C, G and T are represented by,,,And (4) showing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110639647.9A CN113257359A (en) | 2021-06-08 | 2021-06-08 | CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110639647.9A CN113257359A (en) | 2021-06-08 | 2021-06-08 | CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113257359A true CN113257359A (en) | 2021-08-13 |
Family
ID=77187119
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110639647.9A Pending CN113257359A (en) | 2021-06-08 | 2021-06-08 | CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113257359A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115099275A (en) * | 2022-06-29 | 2022-09-23 | 西南医科大学 | Training method of arrhythmia diagnosis model based on artificial neural network |
WO2023207686A1 (en) * | 2022-04-29 | 2023-11-02 | 京东方科技集团股份有限公司 | Gene editing result prediction method and apparatus, electronic device, program and medium |
WO2024164131A1 (en) * | 2023-02-07 | 2024-08-15 | 中国科学院脑科学与智能技术卓越创新中心 | Method and system for predicting base editing efficiency |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110010194A (en) * | 2019-04-10 | 2019-07-12 | 浙江科技学院 | A kind of prediction technique of RNA secondary structure |
-
2021
- 2021-06-08 CN CN202110639647.9A patent/CN113257359A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110010194A (en) * | 2019-04-10 | 2019-07-12 | 浙江科技学院 | A kind of prediction technique of RNA secondary structure |
Non-Patent Citations (3)
Title |
---|
GUISHAN ZHANG 等: "A Novel Hybrid CNN-SVR for CRISPR/Cas9 Guide RNA Activity Prediction", 《FRONTIERS IN GENETICS》 * |
张向荣 等编著: "《模式识别》", 31 May 2019, 西安:西安电子科技大学出版社 * |
张桂珊 等: "机器学习方法在CRISPR/Cas9 系统中的应用", 《遗传》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023207686A1 (en) * | 2022-04-29 | 2023-11-02 | 京东方科技集团股份有限公司 | Gene editing result prediction method and apparatus, electronic device, program and medium |
CN115099275A (en) * | 2022-06-29 | 2022-09-23 | 西南医科大学 | Training method of arrhythmia diagnosis model based on artificial neural network |
WO2024164131A1 (en) * | 2023-02-07 | 2024-08-15 | 中国科学院脑科学与智能技术卓越创新中心 | Method and system for predicting base editing efficiency |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111341386B (en) | Attention-introducing multi-scale CNN-BilSTM non-coding RNA interaction relation prediction method | |
CN113257359A (en) | CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR | |
US20220301658A1 (en) | Machine learning driven gene discovery and gene editing in plants | |
CN114927162A (en) | Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
US11861491B2 (en) | Deep learning-based pathogenicity classifier for promoter single nucleotide variants (pSNVs) | |
CN108427865B (en) | Method for predicting correlation between LncRNA and environmental factors | |
Huang et al. | Harnessing deep learning for population genetic inference | |
CN103020979A (en) | Image segmentation method based on sparse genetic clustering | |
CN115908909A (en) | Evolutionary neural architecture searching method and system based on Bayes convolutional neural network | |
El-Tohamy et al. | A deep learning approach for viral DNA sequence classification using genetic algorithm | |
EP4032093B1 (en) | Artificial intelligence-based epigenetics | |
CN117216656A (en) | 4mC site recognition algorithm based on pruning pre-training model and artificial feature code fusion | |
CN116343908B (en) | Method, medium and device for predicting protein coding region by fusing DNA shape characteristics | |
Sanchez | Reconstructing our past˸ deep learning for population genetics | |
Lahmer et al. | Classification of DNA microarrays using deep learning to identify cell cycle regulated genes | |
CN115691817A (en) | LncRNA-disease association prediction method based on fusion neural network | |
CN117012280A (en) | Method for constructing DNA sequence pre-training language model and application thereof | |
Kristensen et al. | Classification of DNA Sequences by a MLP and SVM Network | |
Chowdhury et al. | Cell type identification from single-cell transcriptomic data via gene embedding | |
Dong et al. | Cell type identification from single-cell transcriptomic data via semi-supervised learning | |
CN116994645B (en) | Prediction method of piRNA and mRNA target pair based on interactive reasoning network | |
Guo et al. | Deep Effective k-mer representation learning for polyadenylation signal prediction via co-occurrence embedding | |
de Abreu | Development of DNA sequence classifiers based on deep learning | |
CN118629494A (en) | Genome prediction method based on Transformer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210813 |
|
RJ01 | Rejection of invention patent application after publication |