CN112951322A - Regular weight distribution siRNA design method based on grid search - Google Patents

Regular weight distribution siRNA design method based on grid search Download PDF

Info

Publication number
CN112951322A
CN112951322A CN202110251670.0A CN202110251670A CN112951322A CN 112951322 A CN112951322 A CN 112951322A CN 202110251670 A CN202110251670 A CN 202110251670A CN 112951322 A CN112951322 A CN 112951322A
Authority
CN
China
Prior art keywords
sirna
rule
weight
design
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110251670.0A
Other languages
Chinese (zh)
Other versions
CN112951322B (en
Inventor
万季
刘鹏
沈一鸣
徐韵婉
潘有东
王弈
宋麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Neocura Biotechnology Corp
Original Assignee
Shenzhen Neocura Biotechnology Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Neocura Biotechnology Corp filed Critical Shenzhen Neocura Biotechnology Corp
Priority to CN202110251670.0A priority Critical patent/CN112951322B/en
Publication of CN112951322A publication Critical patent/CN112951322A/en
Application granted granted Critical
Publication of CN112951322B publication Critical patent/CN112951322B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The invention discloses a regular weight distribution siRNA design method based on grid search, which relates to the technical field of bioinformatics and comprises the following steps: s1, training siRNA data to obtain siRNA design rule weight; s2, screening siRNA with high silencing efficiency from the candidate siRNA pool based on the obtained siRNA design rule weight. The invention treats each rule differently according to different weights, and based on the multi-rule design of different weights, all rules can be prevented from being treated according to the same importance degree, the important rules are highlighted, effective siRNA and ineffective siRNA can be distinguished, and the efficiency of candidate siRNA sequences can be quantitatively predicted.

Description

Regular weight distribution siRNA design method based on grid search
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a regular weight distribution siRNA design method based on grid search.
Background
RNA interference (RNAi), a phenomenon of post-transcriptional gene silencing induced by short interfering RNA (siRNA), has the characteristics of specificity, selectivity, high efficiency, rapid action, and the like, is widely used in the fields of gene function exploration, gene expression regulation and control mechanisms, accelerates the research of functional genomics, and promotes the research of related fields such as gene therapy.
The siRNA is double-stranded RNA consisting of 20-27 base pairs, the efficiency of the siRNA is influenced by a plurality of factors such as the mRNA sequence of a target gene, the sequence of the siRNA and the like, one key factor is the siRNA sequence design, and a large number of researches prove that the siRNA designed aiming at the same target mRNA has very different effects. How to screen siRNA of high-efficiency silent target genes is a difficult problem, and the problems of high experiment cost, long time period, low efficiency and the like exist when the method of biological experiment is adopted for verification one by one, so that the research and application cost of the RNA interference technology can be effectively reduced by designing siRNA with the assistance of bioinformatics and computers and reducing the screening range of siRNA with high silencing efficiency.
Generally, the difference between the high-efficiency siRNA sequence and the low-efficiency siRNA sequence is analyzed by comparison, some sequence rules of siRNA with high silencing efficiency are summarized, candidate siRNA sequences are scored according to the conditions meeting the rules, and generally, the higher score indicates that the higher RNA interference efficiency exists. However, the experience summarized on the one hand based on the existing regular siRNA design has strong preference and is usually only suitable for certain specific data; another aspect is that the method based on the existing rules considers the weight of each rule to be the same, and does not consider treating each rule differently by different weights.
Disclosure of Invention
Aiming at the problems existing in siRNA design, the invention fully considers the different influences of different rules on the siRNA efficiency and develops a set of bioinformatics method for siRNA design. According to the scheme of the invention, a bioinformatics method for designing siRNA based on rule weight assignment of grid search is provided, which is realized by a computer, and the specific scheme is as follows:
a design method for distributing siRNA based on rule weight of grid search comprises the following steps:
s1, training siRNA data to obtain siRNA design rule weight;
s2, screening siRNA with high silencing efficiency from the candidate siRNA pool based on the obtained siRNA design rule weight.
Further, step S1 includes the following sub-steps:
s101, obtaining and processing siRNA data;
s102, obtaining siRNA design rules;
s103, setting siRNA design rule weight values for the siRNA design rules;
s104, calculating according to the weighted value of the siRNA design rule to obtain an siRNA design rule scoring matrix;
and S105, determining the optimal siRNA design rule weight according to the siRNA data acquired in S101 and the siRNA design rule scoring matrix acquired in S104.
Preferably, the siRNA data in S101 includes siRNA sequences and RNA interference efficiency values of sirnas.
Preferably, the acquiring and processing siRNA data in S101 includes processing the RNA interference efficiency values of the sirnas, and normalizing the RNA interference efficiency values of all the sirnas, where an efficiency of 0 represents that RNA interference cannot be performed, an efficiency of 100 represents complete RNA interference, and a corresponding mRNA or protein cannot be detected through an experiment.
Preferably, the acquiring and processing siRNA data in S101 includes randomly dividing the siRNA data into two parts, wherein the part accounting for 2/3 parts of the total data proportion serves as training set data, and the part accounting for 1/3 parts of the total data proportion serves as test set data.
The siRNA consists of a double-stranded RNA sequence, wherein the strand complementary to the mRNA is a guide strand, also called antisense strand, and the other strand is a passener strand, also called sense strand, based on which,
preferably, the siRNA design rule in S102 includes one or more of the following rules:
rule 1: because the 5' UTR region of the gene has rich regulatory protein binding sites, the binding of an RNA-induced silencing complex (RISC) and a target sequence can be influenced, and the siRNA target sequence needs to be 100bp behind the transcription initiation site of the gene CDS;
rule 2: too low GC content affects the binding efficiency of siRNA to mRNA, and too high causes that the double-stranded structure is not easy to be uncoiled in RISC to form a single-stranded structure with recognition capability. Therefore, the GC content of the siRNA target sequence should be between 35% and 55%.
Rule 3: the siRNA sequence does not have a hairpin structure (hairpin structure) sequence, and the hairpin structure refers to a structure formed by hydrogen bonding as complementary base pairs meet each other due to the fact that nucleotide single-stranded molecules are folded back by themselves.
Rule 4: avoiding repeated single base repeated sequences, C/G base repeated sequences and T/G base repeated sequences in the siRNA target sequence;
rule 5: the 2 nd to 8 th sequences at the 5' end of the siRNAsentisense chain are seeds (seed regions), the combination of the regions and target genes is mainly relied on, and the annealing temperature of the siRNA seed region is less than 25 ℃;
rule 6: the siRNA antisense strand 5 'end contains more A/U basic groups, specifically, the first 5 basic groups of the antisense strand 5' end have at least 3A/U basic groups, and the first 7 basic groups have at least 4A/U basic groups;
rule 7: the siRNA antisense strand has ' EnergyValley ', specifically, the number of G/C bases in the 9 th to 14 th sequences from the 5 ' end of the antisense strand is less than that of G/C bases in the first 8 th sequence, and the number of G/C bases in the first 8 th sequence is less than that of G/C bases in the 15 th to last sequences.
Rule 8: the 10 th U base of the siRNA is highly related to silencing efficiency, so that the 10 th U base of the siRNA sense chain is the U base;
rule 9: the first position of the siRNAsense strand is G/C base, the 10 th position is A/U base, the 13 th to 19 th positions have more than 3A/U bases, and the 19 th position is A/U base.
Preferably, the setting of the siRNA design rule weight value in S103 includes: and setting a reasonable weight value range for each design rule, and then traversing and combining the weight of each rule to finally obtain the weight value of each rule.
Preferably, the reasonable weight value ranges set initial weight ranges of 0, 0.5, 1.5, 2, 2.5, 3 for each rule, and then all combinations are formed using the Python toolkit itertools.
Preferably, the step of calculating the siRNA design rule score matrix in S104 includes: and calculating the scores of all the siRNA design rules under the weight combination of all different rules according to the weight values of the siRNA design rules, and traversing the weight set and the siRNA data to obtain an siRNA design rule score matrix.
Preferably, the model for calculating the siRNA design rule score is:
Figure BDA0002966322540000041
wherein, Score represents siRNA design rule Score, i represents ith rule, n is positive integer not less than 1 and represents corresponding rule number; w is a weighted value of the siRNA design rule, r represents that the siRNA design rule meets the condition, if the siRNA design rule meets the condition, the value is 1, otherwise, the value is 0.
Preferably, the step of determining optimal siRNA design rule weight in S105 comprises: calculating TPR and FPR values of each siRNA design rule weight combination in the training set data, screening out combinations with the TPR and FPR values larger than 0.9, then calculating the sum of the TPR and FPR values of the combinations, screening out the combination with the largest sum as the optimal siRNA design rule weight, and verifying in the test set data to avoid data overfitting.
Further, step S2 includes the following sub-steps:
s201, acquiring a candidate siRNA set according to a nucleotide sequence of an exon region of a target gene;
s202, screening siRNA with high silencing efficiency from a candidate siRNA set according to the weight of the siRNA design rule.
Preferably, the step S201 of obtaining the candidate siRNA pool comprises: searching nucleotide subsequence with set length in target gene exon area, and obtaining corresponding siRNA double-chain sequence as candidate siRNA in candidate siRNA set according to gene complementation rule.
The length of the sequence of the candidate siRNA is preferably 21.
Preferably, S202 includes: and scoring the candidate siRNA in the candidate siRNA set according to the weight of the siRNA design rule, and screening the siRNA with high silencing efficiency if the score is high according to the ranking from high to low.
Preferably, the screening rate of the high scoring person is to select the siRNA with the score of the first 5% as the siRNA with high silencing efficiency.
The system for realizing the method comprises the following steps: a training module and a screening module, wherein the training module and the screening module are connected with each other,
the training module is used for training siRNA data to obtain siRNA design rule weight;
the screening module is used for screening siRNA with high silencing efficiency from the candidate siRNA set based on the obtained siRNA design rule weight.
The method is realized by the following system: the regular weight distribution siRNA design system based on grid search comprises a data acquisition, processing and storage module, a weight setting module, a matrix acquisition module, an optimal siRNA design rule weight screening module, a candidate siRNA set acquisition module and a high silencing efficiency siRNA screening module;
the data acquisition, processing and storage module is used for acquiring and processing siRNA data, acquiring siRNA design rules and storing all data generated in the process of distributing siRNA design based on rule weight of grid search; the weight setting module is used for setting siRNA design rule weight values for the siRNA design rules, the matrix obtaining module is used for obtaining an siRNA design rule scoring matrix through calculation according to the siRNA design rule weight values, and the optimal siRNA design rule weight screening module is used for determining optimal siRNA design rule weights according to the siRNA data obtained in S101 and the siRNA design rule scoring matrix obtained in S104;
the high silencing efficiency siRNA screening module is used for screening out high silencing efficiency siRNA from the candidate siRNA set according to the weight of the siRNA design rule.
Advantageous effects
The grid search method is an exhaustive search method for specifying parameter values, and arranges and combines possible values of each parameter, lists all possible combination results and generates a grid. Each combination was then used for training and performance was evaluated using validation data. After all parameter combinations are tried by the fitting function, an appropriate classifier is returned, and the optimal parameter combination is automatically adjusted. Therefore, the siRNA with high RNA interference efficiency can be effectively screened by determining the weight value of the siRNA design rule through grid search.
The invention has the beneficial effects that:
1. the invention treats each rule differently according to different weights, and based on the multi-rule design of different weights, all rules can be prevented from being treated according to the same importance degree, the important rules are highlighted, effective siRNA and ineffective siRNA can be distinguished, and the efficiency of candidate siRNA sequences can be quantitatively predicted.
2. A large amount of siRNA data collected by the method are used for model training, different models (such as RT-PCT and luciferase) are trained according to different experimental conditions, and the base preference in the experience rule caused by different sample sets or the fact that the sample amount is not large enough can be avoided.
Drawings
FIG. 1 is a schematic flow chart of a method for designing regular weight distribution siRNA for grid search according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the sub-step S1 according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating the sub-step S2 in the embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
One embodiment of the invention: a design method for distributing siRNA based on rule weight of grid search comprises the following steps:
s1, training siRNA data to obtain siRNA design rule weight;
s2, screening siRNA with high silencing efficiency from the candidate siRNA pool based on the obtained siRNA design rule weight.
The design method of the regular weight distribution siRNA based on grid search mainly comprises two steps, wherein the first step is to train from large-scale siRNA data to obtain the weight of the siRNA design rule; the second step is to design high silencing efficiency siRNA based on regular weight. FIG. 2 is a flowchart of the S1 sub-step, and FIG. 3 is a flowchart of the S2 sub-step; the method adopts large-scale siRNA data to carry out siRNA design rule weight distribution, and treats each rule differently according to different weights, so that the important rules are highlighted, and the RNA interference efficiency of the screened siRNA can be effectively ensured.
In another embodiment of the present invention, step S1 specifically includes:
step S101: obtaining and processing siRNA data
Specifically, acquiring large-scale siRNA data requires including siRNA sequences and RNA interference efficiency values of siRNA, and uniformizes all RNA interference efficiency values, where an efficiency of 0 indicates that RNA interference cannot be performed, and an efficiency of 100 indicates that complete RNA interference cannot be performed, and a corresponding mRNA or protein cannot be detected through experiments. The data was then randomly divided into two parts, a training set data population 2/3 and a test set data population 1/3.
Step S102: obtaining siRNA design rules
The siRNA consists of a double-stranded RNA sequence, wherein the strand complementary to mRNA is a guide strand, also called antisense strand, and the other strand is a passener strand, also called sense strand; preferably, the following rules may be included:
rule 1: because the 5' UTR region of the gene has rich regulatory protein binding sites, the binding of an RNA-induced silencing complex (RISC) and a target sequence can be influenced, and the siRNA target sequence needs to be 100bp behind the transcription initiation site of the gene CDS;
rule 2: too low GC content affects the binding efficiency of siRNA to mRNA, and too high causes that the double-stranded structure is not easy to be uncoiled in RISC to form a single-stranded structure with recognition capability. Therefore, the GC content of the siRNA target sequence should be between 35% and 55%.
Rule 3: the siRNA sequence does not have a hairpin structure (hairpin structure) sequence, and the hairpin structure refers to a structure formed by hydrogen bonding as complementary base pairs meet each other due to the fact that nucleotide single-stranded molecules are folded back by themselves.
Rule 4: avoiding repeated single base repeated sequences, C/G base repeated sequences and T/G base repeated sequences in the siRNA target sequence;
rule 5: the 2 nd to 8 th sequences at the 5' end of the siRNAsentisense chain are seeds (seed regions), the combination of the regions and target genes is mainly relied on, and the annealing temperature of the siRNA seed region is less than 25 ℃;
rule 6: the siRNA antisense strand 5 'end contains more A/U basic groups, specifically, the first 5 basic groups of the antisense strand 5' end have at least 3A/U basic groups, and the first 7 basic groups have at least 4A/U basic groups;
rule 7: the siRNA antisense strand has ' EnergyValley ', specifically, the number of G/C bases in the 9 th to 14 th sequences from the 5 ' end of the antisense strand is less than that of G/C bases in the first 8 th sequence, and the number of G/C bases in the first 8 th sequence is less than that of G/C bases in the 15 th to last sequences.
Rule 8: the 10 th U base of the siRNA is highly related to silencing efficiency, so that the 10 th U base of the siRNA sense chain is the U base;
rule 9: the first position of the siRNAsense strand is G/C base, the 10 th position is A/U base, the 13 th to 19 th positions have more than 3A/U bases, and the 19 th position is A/U base;
step S103: setting siRNA design rule weight value for siRNA design rule
Specifically, a reasonable weight value range is set for each design rule, and then the weights of each rule are combined in a traversing manner. Specifically, each rule was set with an initial weight range of 0, 0.5, 1.5, 2, 2.5, 3, and then all combinations were formed using the Python toolkit itertools.
Step S104: calculating according to the weighted value of the siRNA design rule to obtain an siRNA design rule scoring matrix
The model for calculating the siRNA design rule score is:
Figure BDA0002966322540000091
wherein, Score represents siRNA design rule Score, i represents ith rule, n is positive integer not less than 1 and represents corresponding rule number; w is a weighted value of the siRNA design rule, r represents that the siRNA design rule meets the condition, if the siRNA design rule meets the condition, the value is 1, otherwise, the value is 0.
Specifically, the scores of all the siRNAs under different rule weight combinations are calculated, and the weight combination set and the siRNA data are traversed, so that an siRNA score matrix is obtained.
Step S105: determining the optimal siRNA design rule weight according to the siRNA data obtained in S101 and the siRNA design rule scoring matrix obtained in S104
Specifically, the TPR and FPR values of each regular weight combination of the training set data are calculated, then a combination with the TPR and the FPR both larger than 0.9 is selected, then the combination with the largest sum of the TPR and the FPR is selected as the optimal score weight, and verification is carried out in the test set data.
The beneficial effect of this embodiment does: screening siRNA design rules is beneficial to obtaining the optimal siRNA sequence; the result is verified by adopting the test set data, so that data overfitting can be avoided; each rule is treated differently according to different weights, and based on the multi-rule design of different weights, all rules can be prevented from being treated according to the same importance degree, the important rules are highlighted, effective siRNA and ineffective siRNA can be distinguished, and the efficiency of candidate siRNA sequences can be quantitatively predicted; a large amount of collected siRNA data are used for model training, different models (such as RT-PCT and luciferase) are trained according to different experimental conditions, and the base preference in the experience rule caused by different sample sets or insufficient sample amount and other factors can be avoided
In another embodiment of the present invention, step S2 includes the following sub-steps:
step S201: obtaining a candidate siRNA set according to the nucleotide sequence of the target gene exon
Specifically, the nucleotide subsequence with set length is searched for the exon region of the target gene, and the corresponding siRNA double-chain sequence is obtained according to the gene complementation rule.
Preferably, the length of the candidate siRNA sequence designed by default is 21.
Step S202: screening siRNA with high silencing efficiency from candidate siRNA set according to siRNA design rule weight
Specifically, the candidate siRNA is scored according to the obtained rule weight value, and then the siRNA with high silencing efficiency is selected according to the score from high to low.
Preferably, the program picks siRNAs with a score of the top 5% by default.
The beneficial effect of this embodiment does: the siRNA that gave the highest efficiency can be screened.
The software used in the invention specifically extracts the parameters as follows:
the data set is partitioned using an autonomously developed program, with example commands as:
1.Python3 data.split.py-i siRNA.csv--train siRNA.train.csv--test siRNA.test.csv
wherein-i is a sorted siRNA dataset, -train is followed by a generated training dataset, -test is followed by a generated test dataset;
all possible weight sets are generated using an autonomously developed program, with example commands as:
2.Python3 generate_scoring_set.py-i rules.csv-o set.csv
wherein-i is a list of rule weight value ranges, -o is followed by all combinations formed, the first column is a combination number, and the following is a weight value corresponding to each rule.
Sirnas were scored using an autonomously developed program, with exemplary commands as:
3.Python3 scoring_results.py-i siRNA.csv-r rule_set.csv-o out.csv
wherein, the I is followed by the siRNA data of the training set, the R is followed by the weight set, and the O is the scoring result file;
sirnas were designed using self-developed programming, with exemplary commands as:
4.Python3 design.py-i gene.csv-o out.csv
wherein, -i is a target gene list, -o is a candidate siRNA file;
screening for high silencing efficiency sirnas using an autonomously developed program, exemplified by the orders:
5.Python3 pick.py-i candidate.siRNA.csv-r rule.csv-o out.csv
wherein-i is a candidate siRNA sequence file, -r is a screening rule weight, and-o is a result file;
while the preferred embodiments and examples of the present invention have been described in detail, the present invention is not limited to the embodiments and examples, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (10)

1. A design method for distributing siRNA based on rule weight of grid search is characterized in that: the method comprises the following steps:
s1, training siRNA data to obtain siRNA design rule weight;
s2, screening siRNA with high silencing efficiency from the candidate siRNA pool based on the obtained siRNA design rule weight.
2. The method of claim 1, wherein the method comprises: the step S1 includes the following sub-steps:
s101, obtaining and processing siRNA data;
s102, obtaining siRNA design rules;
s103, setting siRNA design rule weight values for the siRNA design rules;
s104, calculating according to the weighted value of the siRNA design rule to obtain an siRNA design rule scoring matrix;
and S105, determining the optimal siRNA design rule weight according to the siRNA data acquired in S101 and the siRNA design rule scoring matrix acquired in S104.
3. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the siRNA data in S101 comprises siRNA sequences and RNA interference efficiency values of the siRNAs; s101, the siRNA data are obtained and processed, wherein the siRNA data comprise processing the RNA interference efficiency values of the siRNA, the RNA interference efficiency values of all the siRNA are normalized, the efficiency is 0 and represents that the RNA interference cannot be carried out, the efficiency is 100 and represents that the complete RNA interference is carried out, and the corresponding mRNA or protein cannot be detected through experiments; the step of acquiring and processing the siRNA data in S101 includes randomly dividing the siRNA data into two parts, wherein the part accounting for 2/3 parts of the total data serves as training set data, and the part accounting for 1/3 parts of the total data serves as test set data.
4. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the siRNA design rule in S102 includes one or more of the following rules:
rule 1: the siRNA target sequence is 100bp behind the downstream of a transcription start site of a gene CDS;
rule 2: the GC content of the siRNA target sequence is between 35% and 55%;
rule 3: the hairpin structure sequence does not exist in the siRNA sequence;
rule 4: repeated single base repeated sequences or C/G base repeated sequences and T/G base repeated sequences do not exist in the siRNA target sequence;
rule 5: the annealing temperature of the siRNA seed region is less than 25 ℃;
rule 6: the first 5 bases of the 5' end of the antisense chain comprise at least 3A/U bases, and the first 7 bases comprise at least 4A/U bases;
rule 7: the number of G/C bases in the 9 th to 14 th bit sequences of the 5' end of the siRNA antisense strand is less than that in the first 8 th bit sequence, and the number of G/C bases in the first 8 th bit sequence is less than that in the 15 th to the last sequence;
rule 8: the 10 th site of the siRNAsense chain is a U basic group;
rule 9: the first position of the siRNAsense strand is G/C base, the 10 th position is A/U base, the 13 th to 19 th positions comprise more than 3A/U bases, and the 19 th position is A/U base.
5. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the setting of the siRNA design rule weight value in S103 includes: and setting a reasonable weight value range for each design rule, and then traversing and combining the weight of each rule to finally obtain the weight value of each rule.
6. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: the step of calculating the siRNA design rule score matrix in S104 comprises: and calculating the scores of all the siRNA design rules under the weight combination of all different rules according to the weight values of the siRNA design rules, and traversing the weight set and the siRNA data to obtain an siRNA design rule score matrix.
7. The method of claim 2, wherein the design of regular weight distribution siRNA based on grid search is as follows: s105, the step of determining the optimal siRNA design rule weight comprises the following steps: calculating TPR and FPR values of each siRNA design rule weight combination in the training set data, screening out combinations with the TPR and FPR values larger than 0.9, then calculating the sum of the TPR and FPR values of the combinations, screening out the combination with the largest sum as the optimal siRNA design rule weight, and verifying in the test set data.
8. The method for designing regular weight assignment siRNA based on lattice search according to any one of claims 1 to 7, wherein: the step S2 includes the following sub-steps:
s201, acquiring a candidate siRNA set according to a nucleotide sequence of an exon region of a target gene;
s202, screening siRNA with high silencing efficiency from a candidate siRNA set according to the weight of the siRNA design rule.
9. The method of claim 8, wherein the method comprises: s201 said obtaining the candidate siRNA pool comprises: searching nucleotide subsequence with set length in target gene exon area, and obtaining corresponding siRNA double-chain sequence as candidate siRNA in candidate siRNA set according to gene complementation rule.
10. The method of claim 8, wherein the method comprises: the S202 includes: and scoring the candidate siRNA in the candidate siRNA set according to the weight of the siRNA design rule, and screening the siRNA with high silencing efficiency if the score is high according to the ranking from high to low.
CN202110251670.0A 2021-03-08 2021-03-08 Rule weight distribution siRNA design method based on grid search Active CN112951322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110251670.0A CN112951322B (en) 2021-03-08 2021-03-08 Rule weight distribution siRNA design method based on grid search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110251670.0A CN112951322B (en) 2021-03-08 2021-03-08 Rule weight distribution siRNA design method based on grid search

Publications (2)

Publication Number Publication Date
CN112951322A true CN112951322A (en) 2021-06-11
CN112951322B CN112951322B (en) 2023-09-26

Family

ID=76228748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110251670.0A Active CN112951322B (en) 2021-03-08 2021-03-08 Rule weight distribution siRNA design method based on grid search

Country Status (1)

Country Link
CN (1) CN112951322B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006062369A1 (en) * 2004-12-08 2006-06-15 Bioneer Corporation Method of inhibiting expression of target mrna using sirna consisting of nucleotide sequence complementary to said target mrna
CN103020489A (en) * 2013-01-04 2013-04-03 吉林大学 Novel method for forecasting siRNA interference efficiency based on ARM (Advanced RISC Machines) microprocessor
CN111354420A (en) * 2020-03-08 2020-06-30 吉林大学 siRNA research and development method for COVID-19 virus drug therapy
CN111986730A (en) * 2020-07-27 2020-11-24 中国科学院计算技术研究所苏州智能计算产业技术研究院 Method for predicting siRNA silencing efficiency

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006062369A1 (en) * 2004-12-08 2006-06-15 Bioneer Corporation Method of inhibiting expression of target mrna using sirna consisting of nucleotide sequence complementary to said target mrna
CN103020489A (en) * 2013-01-04 2013-04-03 吉林大学 Novel method for forecasting siRNA interference efficiency based on ARM (Advanced RISC Machines) microprocessor
CN111354420A (en) * 2020-03-08 2020-06-30 吉林大学 siRNA research and development method for COVID-19 virus drug therapy
CN111986730A (en) * 2020-07-27 2020-11-24 中国科学院计算技术研究所苏州智能计算产业技术研究院 Method for predicting siRNA silencing efficiency

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JYOTI K SHAH, HAROLD R GARNER, MICHAEL A WHITE, DAVID S SHAMES AND JOHN D MINNA: "sIR: siRNA Information Resource, a web-based tool for siRNA sequence design and analysis and an open access siRNA database", 《BMC BIOINFORMATICS》, pages 1 - 7 *

Also Published As

Publication number Publication date
CN112951322B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
Kurihara et al. Transcripts from downstream alternative transcription start sites evade uORF-mediated inhibition of gene expression in Arabidopsis
CN108920895B (en) Incidence relation prediction method of circular RNA and diseases
Kampmann et al. Next-generation libraries for robust RNA interference-based genome-wide screens
CN108595913B (en) Supervised learning method for identifying mRNA and lncRNA
Brameier et al. Ab initio identification of human microRNAs based on structure motifs
Lindow et al. Intragenomic matching reveals a huge potential for miRNA-mediated regulation in plants
CN105808976B (en) A kind of miRNA microRNA target prediction methods based on recommended models
CN114639441B (en) Transcription factor binding site prediction method based on weighted multi-granularity scanning
CN112951322B (en) Rule weight distribution siRNA design method based on grid search
EP4179538A1 (en) Method for prediction of the guide efficiency when targeting a gene of interest
Muvva et al. In silico identification of miRNAs and their targets from the expressed sequence tags of Raphanus sativus
CN109754844B (en) Method for predicting plant endogenous siRNA on whole genome level
JP2008521909A (en) Methods for designing short interfering RNAs, antisense polynucleotides, and other hybridizing polynucleotides
Ramesh et al. Guide RNA design for genome-wide CRISPR screens in Yarrowia lipolytica
CN111808935B (en) Identification method of plant endogenous siRNA transcription regulation relationship
Zhong et al. Improved Pre-miRNA classification by reducing the effect of class imbalance
Morgado et al. Learning sequence patterns of AGO-sRNA affinity from high-throughput sequencing libraries to improve in silico functional small RNA detection and classification in plants
US7941278B2 (en) MicroRNA motifs
CN116312755A (en) Target determination method and device and computer equipment
CN116189756A (en) Target selection method and device and computer equipment
CN116168764B (en) Method, device and equipment for optimizing 5' untranslated region sequence of messenger ribonucleic acid
Yan et al. miTarDigger: A Fusion Deep-learning Approach for Predicting Human miRNA Targets
Li et al. New support vector machine-based method for microRNA target prediction
CN116798513B (en) Method and system for screening siRNA sequence to reduce off-target effect
Qin Analysis of tissue specificity of alternative polyadenylation sites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant