CN112951322A

CN112951322A - Regular weight distribution siRNA design method based on grid search

Info

Publication number: CN112951322A
Application number: CN202110251670.0A
Authority: CN
Inventors: 万季; 刘鹏; 沈一鸣; 徐韵婉; 潘有东; 王弈; 宋麒
Original assignee: Shenzhen Neocura Biotechnology Corp
Current assignee: Shenzhen Neocura Biotechnology Corp
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2021-06-11
Anticipated expiration: 2041-03-08
Also published as: CN112951322B

Abstract

The invention discloses a regular weight distribution siRNA design method based on grid search, which relates to the technical field of bioinformatics and comprises the following steps: s1, training siRNA data to obtain siRNA design rule weight; s2, screening siRNA with high silencing efficiency from the candidate siRNA pool based on the obtained siRNA design rule weight. The invention treats each rule differently according to different weights, and based on the multi-rule design of different weights, all rules can be prevented from being treated according to the same importance degree, the important rules are highlighted, effective siRNA and ineffective siRNA can be distinguished, and the efficiency of candidate siRNA sequences can be quantitatively predicted.

Description

Regular weight distribution siRNA design method based on grid search

Technical Field

The invention relates to the technical field of bioinformatics, in particular to a regular weight distribution siRNA design method based on grid search.

Background

RNA interference (RNAi), a phenomenon of post-transcriptional gene silencing induced by short interfering RNA (siRNA), has the characteristics of specificity, selectivity, high efficiency, rapid action, and the like, is widely used in the fields of gene function exploration, gene expression regulation and control mechanisms, accelerates the research of functional genomics, and promotes the research of related fields such as gene therapy.

The siRNA is double-stranded RNA consisting of 20-27 base pairs, the efficiency of the siRNA is influenced by a plurality of factors such as the mRNA sequence of a target gene, the sequence of the siRNA and the like, one key factor is the siRNA sequence design, and a large number of researches prove that the siRNA designed aiming at the same target mRNA has very different effects. How to screen siRNA of high-efficiency silent target genes is a difficult problem, and the problems of high experiment cost, long time period, low efficiency and the like exist when the method of biological experiment is adopted for verification one by one, so that the research and application cost of the RNA interference technology can be effectively reduced by designing siRNA with the assistance of bioinformatics and computers and reducing the screening range of siRNA with high silencing efficiency.

Generally, the difference between the high-efficiency siRNA sequence and the low-efficiency siRNA sequence is analyzed by comparison, some sequence rules of siRNA with high silencing efficiency are summarized, candidate siRNA sequences are scored according to the conditions meeting the rules, and generally, the higher score indicates that the higher RNA interference efficiency exists. However, the experience summarized on the one hand based on the existing regular siRNA design has strong preference and is usually only suitable for certain specific data; another aspect is that the method based on the existing rules considers the weight of each rule to be the same, and does not consider treating each rule differently by different weights.

Disclosure of Invention

Aiming at the problems existing in siRNA design, the invention fully considers the different influences of different rules on the siRNA efficiency and develops a set of bioinformatics method for siRNA design. According to the scheme of the invention, a bioinformatics method for designing siRNA based on rule weight assignment of grid search is provided, which is realized by a computer, and the specific scheme is as follows:

a design method for distributing siRNA based on rule weight of grid search comprises the following steps:

s1, training siRNA data to obtain siRNA design rule weight;

s2, screening siRNA with high silencing efficiency from the candidate siRNA pool based on the obtained siRNA design rule weight.

Further, step S1 includes the following sub-steps:

s101, obtaining and processing siRNA data;

s102, obtaining siRNA design rules;

s103, setting siRNA design rule weight values for the siRNA design rules;

s104, calculating according to the weighted value of the siRNA design rule to obtain an siRNA design rule scoring matrix;

and S105, determining the optimal siRNA design rule weight according to the siRNA data acquired in S101 and the siRNA design rule scoring matrix acquired in S104.

Preferably, the siRNA data in S101 includes siRNA sequences and RNA interference efficiency values of sirnas.

Preferably, the acquiring and processing siRNA data in S101 includes processing the RNA interference efficiency values of the sirnas, and normalizing the RNA interference efficiency values of all the sirnas, where an efficiency of 0 represents that RNA interference cannot be performed, an efficiency of 100 represents complete RNA interference, and a corresponding mRNA or protein cannot be detected through an experiment.

Preferably, the acquiring and processing siRNA data in S101 includes randomly dividing the siRNA data into two parts, wherein the part accounting for 2/3 parts of the total data proportion serves as training set data, and the part accounting for 1/3 parts of the total data proportion serves as test set data.

The siRNA consists of a double-stranded RNA sequence, wherein the strand complementary to the mRNA is a guide strand, also called antisense strand, and the other strand is a passener strand, also called sense strand, based on which,

preferably, the siRNA design rule in S102 includes one or more of the following rules:

rule 1: because the 5' UTR region of the gene has rich regulatory protein binding sites, the binding of an RNA-induced silencing complex (RISC) and a target sequence can be influenced, and the siRNA target sequence needs to be 100bp behind the transcription initiation site of the gene CDS;

rule 2: too low GC content affects the binding efficiency of siRNA to mRNA, and too high causes that the double-stranded structure is not easy to be uncoiled in RISC to form a single-stranded structure with recognition capability. Therefore, the GC content of the siRNA target sequence should be between 35% and 55%.

Rule 3: the siRNA sequence does not have a hairpin structure (hairpin structure) sequence, and the hairpin structure refers to a structure formed by hydrogen bonding as complementary base pairs meet each other due to the fact that nucleotide single-stranded molecules are folded back by themselves.

Rule 4: avoiding repeated single base repeated sequences, C/G base repeated sequences and T/G base repeated sequences in the siRNA target sequence;

rule 5: the 2 nd to 8 th sequences at the 5' end of the siRNAsentisense chain are seeds (seed regions), the combination of the regions and target genes is mainly relied on, and the annealing temperature of the siRNA seed region is less than 25 ℃;

rule 6: the siRNA antisense strand 5 'end contains more A/U basic groups, specifically, the first 5 basic groups of the antisense strand 5' end have at least 3A/U basic groups, and the first 7 basic groups have at least 4A/U basic groups;

rule 7: the siRNA antisense strand has ' EnergyValley ', specifically, the number of G/C bases in the 9 th to 14 th sequences from the 5 ' end of the antisense strand is less than that of G/C bases in the first 8 th sequence, and the number of G/C bases in the first 8 th sequence is less than that of G/C bases in the 15 th to last sequences.

Rule 8: the 10 th U base of the siRNA is highly related to silencing efficiency, so that the 10 th U base of the siRNA sense chain is the U base;

rule 9: the first position of the siRNAsense strand is G/C base, the 10 th position is A/U base, the 13 th to 19 th positions have more than 3A/U bases, and the 19 th position is A/U base.

Preferably, the setting of the siRNA design rule weight value in S103 includes: and setting a reasonable weight value range for each design rule, and then traversing and combining the weight of each rule to finally obtain the weight value of each rule.

Preferably, the reasonable weight value ranges set initial weight ranges of 0, 0.5, 1.5, 2, 2.5, 3 for each rule, and then all combinations are formed using the Python toolkit itertools.

Preferably, the step of calculating the siRNA design rule score matrix in S104 includes: and calculating the scores of all the siRNA design rules under the weight combination of all different rules according to the weight values of the siRNA design rules, and traversing the weight set and the siRNA data to obtain an siRNA design rule score matrix.

Preferably, the model for calculating the siRNA design rule score is:

wherein, Score represents siRNA design rule Score, i represents ith rule, n is positive integer not less than 1 and represents corresponding rule number; w is a weighted value of the siRNA design rule, r represents that the siRNA design rule meets the condition, if the siRNA design rule meets the condition, the value is 1, otherwise, the value is 0.

Preferably, the step of determining optimal siRNA design rule weight in S105 comprises: calculating TPR and FPR values of each siRNA design rule weight combination in the training set data, screening out combinations with the TPR and FPR values larger than 0.9, then calculating the sum of the TPR and FPR values of the combinations, screening out the combination with the largest sum as the optimal siRNA design rule weight, and verifying in the test set data to avoid data overfitting.

Further, step S2 includes the following sub-steps:

s201, acquiring a candidate siRNA set according to a nucleotide sequence of an exon region of a target gene;

s202, screening siRNA with high silencing efficiency from a candidate siRNA set according to the weight of the siRNA design rule.

Preferably, the step S201 of obtaining the candidate siRNA pool comprises: searching nucleotide subsequence with set length in target gene exon area, and obtaining corresponding siRNA double-chain sequence as candidate siRNA in candidate siRNA set according to gene complementation rule.

The length of the sequence of the candidate siRNA is preferably 21.

Preferably, S202 includes: and scoring the candidate siRNA in the candidate siRNA set according to the weight of the siRNA design rule, and screening the siRNA with high silencing efficiency if the score is high according to the ranking from high to low.

Preferably, the screening rate of the high scoring person is to select the siRNA with the score of the first 5% as the siRNA with high silencing efficiency.

The system for realizing the method comprises the following steps: a training module and a screening module, wherein the training module and the screening module are connected with each other,

the training module is used for training siRNA data to obtain siRNA design rule weight;

the screening module is used for screening siRNA with high silencing efficiency from the candidate siRNA set based on the obtained siRNA design rule weight.

The method is realized by the following system: the regular weight distribution siRNA design system based on grid search comprises a data acquisition, processing and storage module, a weight setting module, a matrix acquisition module, an optimal siRNA design rule weight screening module, a candidate siRNA set acquisition module and a high silencing efficiency siRNA screening module;

the data acquisition, processing and storage module is used for acquiring and processing siRNA data, acquiring siRNA design rules and storing all data generated in the process of distributing siRNA design based on rule weight of grid search; the weight setting module is used for setting siRNA design rule weight values for the siRNA design rules, the matrix obtaining module is used for obtaining an siRNA design rule scoring matrix through calculation according to the siRNA design rule weight values, and the optimal siRNA design rule weight screening module is used for determining optimal siRNA design rule weights according to the siRNA data obtained in S101 and the siRNA design rule scoring matrix obtained in S104;

the high silencing efficiency siRNA screening module is used for screening out high silencing efficiency siRNA from the candidate siRNA set according to the weight of the siRNA design rule.

Advantageous effects

The grid search method is an exhaustive search method for specifying parameter values, and arranges and combines possible values of each parameter, lists all possible combination results and generates a grid. Each combination was then used for training and performance was evaluated using validation data. After all parameter combinations are tried by the fitting function, an appropriate classifier is returned, and the optimal parameter combination is automatically adjusted. Therefore, the siRNA with high RNA interference efficiency can be effectively screened by determining the weight value of the siRNA design rule through grid search.

The invention has the beneficial effects that:

1. the invention treats each rule differently according to different weights, and based on the multi-rule design of different weights, all rules can be prevented from being treated according to the same importance degree, the important rules are highlighted, effective siRNA and ineffective siRNA can be distinguished, and the efficiency of candidate siRNA sequences can be quantitatively predicted.

2. A large amount of siRNA data collected by the method are used for model training, different models (such as RT-PCT and luciferase) are trained according to different experimental conditions, and the base preference in the experience rule caused by different sample sets or the fact that the sample amount is not large enough can be avoided.

Drawings

FIG. 1 is a schematic flow chart of a method for designing regular weight distribution siRNA for grid search according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the sub-step S1 according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating the sub-step S2 in the embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

One embodiment of the invention: a design method for distributing siRNA based on rule weight of grid search comprises the following steps:

s1, training siRNA data to obtain siRNA design rule weight;

The design method of the regular weight distribution siRNA based on grid search mainly comprises two steps, wherein the first step is to train from large-scale siRNA data to obtain the weight of the siRNA design rule; the second step is to design high silencing efficiency siRNA based on regular weight. FIG. 2 is a flowchart of the S1 sub-step, and FIG. 3 is a flowchart of the S2 sub-step; the method adopts large-scale siRNA data to carry out siRNA design rule weight distribution, and treats each rule differently according to different weights, so that the important rules are highlighted, and the RNA interference efficiency of the screened siRNA can be effectively ensured.

In another embodiment of the present invention, step S1 specifically includes:

step S101: obtaining and processing siRNA data

Specifically, acquiring large-scale siRNA data requires including siRNA sequences and RNA interference efficiency values of siRNA, and uniformizes all RNA interference efficiency values, where an efficiency of 0 indicates that RNA interference cannot be performed, and an efficiency of 100 indicates that complete RNA interference cannot be performed, and a corresponding mRNA or protein cannot be detected through experiments. The data was then randomly divided into two parts, a training set data population 2/3 and a test set data population 1/3.

Step S102: obtaining siRNA design rules

The siRNA consists of a double-stranded RNA sequence, wherein the strand complementary to mRNA is a guide strand, also called antisense strand, and the other strand is a passener strand, also called sense strand; preferably, the following rules may be included:

rule 9: the first position of the siRNAsense strand is G/C base, the 10 th position is A/U base, the 13 th to 19 th positions have more than 3A/U bases, and the 19 th position is A/U base;

step S103: setting siRNA design rule weight value for siRNA design rule

Specifically, a reasonable weight value range is set for each design rule, and then the weights of each rule are combined in a traversing manner. Specifically, each rule was set with an initial weight range of 0, 0.5, 1.5, 2, 2.5, 3, and then all combinations were formed using the Python toolkit itertools.

Step S104: calculating according to the weighted value of the siRNA design rule to obtain an siRNA design rule scoring matrix

The model for calculating the siRNA design rule score is:

Specifically, the scores of all the siRNAs under different rule weight combinations are calculated, and the weight combination set and the siRNA data are traversed, so that an siRNA score matrix is obtained.

Step S105: determining the optimal siRNA design rule weight according to the siRNA data obtained in S101 and the siRNA design rule scoring matrix obtained in S104

Specifically, the TPR and FPR values of each regular weight combination of the training set data are calculated, then a combination with the TPR and the FPR both larger than 0.9 is selected, then the combination with the largest sum of the TPR and the FPR is selected as the optimal score weight, and verification is carried out in the test set data.

The beneficial effect of this embodiment does: screening siRNA design rules is beneficial to obtaining the optimal siRNA sequence; the result is verified by adopting the test set data, so that data overfitting can be avoided; each rule is treated differently according to different weights, and based on the multi-rule design of different weights, all rules can be prevented from being treated according to the same importance degree, the important rules are highlighted, effective siRNA and ineffective siRNA can be distinguished, and the efficiency of candidate siRNA sequences can be quantitatively predicted; a large amount of collected siRNA data are used for model training, different models (such as RT-PCT and luciferase) are trained according to different experimental conditions, and the base preference in the experience rule caused by different sample sets or insufficient sample amount and other factors can be avoided

In another embodiment of the present invention, step S2 includes the following sub-steps:

step S201: obtaining a candidate siRNA set according to the nucleotide sequence of the target gene exon

Specifically, the nucleotide subsequence with set length is searched for the exon region of the target gene, and the corresponding siRNA double-chain sequence is obtained according to the gene complementation rule.

Preferably, the length of the candidate siRNA sequence designed by default is 21.

Step S202: screening siRNA with high silencing efficiency from candidate siRNA set according to siRNA design rule weight

Specifically, the candidate siRNA is scored according to the obtained rule weight value, and then the siRNA with high silencing efficiency is selected according to the score from high to low.

Preferably, the program picks siRNAs with a score of the top 5% by default.

The beneficial effect of this embodiment does: the siRNA that gave the highest efficiency can be screened.

The software used in the invention specifically extracts the parameters as follows:

the data set is partitioned using an autonomously developed program, with example commands as:

1.Python3 data.split.py-i siRNA.csv--train siRNA.train.csv--test siRNA.test.csv

wherein-i is a sorted siRNA dataset, -train is followed by a generated training dataset, -test is followed by a generated test dataset;

all possible weight sets are generated using an autonomously developed program, with example commands as:

2.Python3 generate_scoring_set.py-i rules.csv-o set.csv

wherein-i is a list of rule weight value ranges, -o is followed by all combinations formed, the first column is a combination number, and the following is a weight value corresponding to each rule.

Sirnas were scored using an autonomously developed program, with exemplary commands as:

3.Python3 scoring_results.py-i siRNA.csv-r rule_set.csv-o out.csv

wherein, the I is followed by the siRNA data of the training set, the R is followed by the weight set, and the O is the scoring result file;

sirnas were designed using self-developed programming, with exemplary commands as:

4.Python3 design.py-i gene.csv-o out.csv

wherein, -i is a target gene list, -o is a candidate siRNA file;

screening for high silencing efficiency sirnas using an autonomously developed program, exemplified by the orders:

5.Python3 pick.py-i candidate.siRNA.csv-r rule.csv-o out.csv

wherein-i is a candidate siRNA sequence file, -r is a screening rule weight, and-o is a result file;

while the preferred embodiments and examples of the present invention have been described in detail, the present invention is not limited to the embodiments and examples, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A design method for distributing siRNA based on rule weight of grid search is characterized in that: the method comprises the following steps:

s1, training siRNA data to obtain siRNA design rule weight;

2. The method of claim 1, wherein the method comprises: the step S1 includes the following sub-steps:

s101, obtaining and processing siRNA data;

s102, obtaining siRNA design rules;

s103, setting siRNA design rule weight values for the siRNA design rules;

3. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the siRNA data in S101 comprises siRNA sequences and RNA interference efficiency values of the siRNAs; s101, the siRNA data are obtained and processed, wherein the siRNA data comprise processing the RNA interference efficiency values of the siRNA, the RNA interference efficiency values of all the siRNA are normalized, the efficiency is 0 and represents that the RNA interference cannot be carried out, the efficiency is 100 and represents that the complete RNA interference is carried out, and the corresponding mRNA or protein cannot be detected through experiments; the step of acquiring and processing the siRNA data in S101 includes randomly dividing the siRNA data into two parts, wherein the part accounting for 2/3 parts of the total data serves as training set data, and the part accounting for 1/3 parts of the total data serves as test set data.

4. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the siRNA design rule in S102 includes one or more of the following rules:

rule 1: the siRNA target sequence is 100bp behind the downstream of a transcription start site of a gene CDS;

rule 2: the GC content of the siRNA target sequence is between 35% and 55%;

rule 3: the hairpin structure sequence does not exist in the siRNA sequence;

rule 4: repeated single base repeated sequences or C/G base repeated sequences and T/G base repeated sequences do not exist in the siRNA target sequence;

rule 5: the annealing temperature of the siRNA seed region is less than 25 ℃;

rule 6: the first 5 bases of the 5' end of the antisense chain comprise at least 3A/U bases, and the first 7 bases comprise at least 4A/U bases;

rule 7: the number of G/C bases in the 9 th to 14 th bit sequences of the 5' end of the siRNA antisense strand is less than that in the first 8 th bit sequence, and the number of G/C bases in the first 8 th bit sequence is less than that in the 15 th to the last sequence;

rule 8: the 10 th site of the siRNAsense chain is a U basic group;

rule 9: the first position of the siRNAsense strand is G/C base, the 10 th position is A/U base, the 13 th to 19 th positions comprise more than 3A/U bases, and the 19 th position is A/U base.

5. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the setting of the siRNA design rule weight value in S103 includes: and setting a reasonable weight value range for each design rule, and then traversing and combining the weight of each rule to finally obtain the weight value of each rule.

6. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the step of calculating the siRNA design rule score matrix in S104 comprises: and calculating the scores of all the siRNA design rules under the weight combination of all different rules according to the weight values of the siRNA design rules, and traversing the weight set and the siRNA data to obtain an siRNA design rule score matrix.

7. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: s105, the step of determining the optimal siRNA design rule weight comprises the following steps: calculating TPR and FPR values of each siRNA design rule weight combination in the training set data, screening out combinations with the TPR and FPR values larger than 0.9, then calculating the sum of the TPR and FPR values of the combinations, screening out the combination with the largest sum as the optimal siRNA design rule weight, and verifying in the test set data.

8. The method for designing regular weight assignment siRNA based on lattice search according to any one of claims 1 to 7, wherein: the step S2 includes the following sub-steps:

9. The method of claim 8, wherein the method comprises: s201 said obtaining the candidate siRNA pool comprises: searching nucleotide subsequence with set length in target gene exon area, and obtaining corresponding siRNA double-chain sequence as candidate siRNA in candidate siRNA set according to gene complementation rule.

10. The method of claim 8, wherein the method comprises: the S202 includes: and scoring the candidate siRNA in the candidate siRNA set according to the weight of the siRNA design rule, and screening the siRNA with high silencing efficiency if the score is high according to the ranking from high to low.