CN114863995A - Silencer prediction algorithm based on bidirectional gated recurrent neural network - Google Patents
Silencer prediction algorithm based on bidirectional gated recurrent neural network Download PDFInfo
- Publication number
- CN114863995A CN114863995A CN202210325550.5A CN202210325550A CN114863995A CN 114863995 A CN114863995 A CN 114863995A CN 202210325550 A CN202210325550 A CN 202210325550A CN 114863995 A CN114863995 A CN 114863995A
- Authority
- CN
- China
- Prior art keywords
- sequence
- data
- neural network
- silencer
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003584 silencer Effects 0.000 title claims abstract description 44
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 35
- 230000000306 recurrent effect Effects 0.000 title claims abstract description 24
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 23
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000003062 neural network model Methods 0.000 claims abstract description 15
- 238000000034 method Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 15
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 210000002865 immune cell Anatomy 0.000 claims description 10
- 238000010801 machine learning Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 7
- 239000002773 nucleotide Substances 0.000 claims description 6
- 125000003729 nucleotide group Chemical group 0.000 claims description 6
- 125000004122 cyclic group Chemical group 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 239000012634 fragment Substances 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000002790 cross-validation Methods 0.000 claims 1
- 238000011161 development Methods 0.000 abstract description 4
- 108090000623 proteins and genes Proteins 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 108020004414 DNA Proteins 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 210000001744 T-lymphocyte Anatomy 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 102100022595 Broad substrate specificity ATP-binding cassette transporter ABCG2 Human genes 0.000 description 1
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 1
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 1
- 101000589436 Homo sapiens Membrane progestin receptor alpha Proteins 0.000 description 1
- 108010090306 Member 2 Subfamily G ATP Binding Cassette Transporter Proteins 0.000 description 1
- 102100032328 Membrane progestin receptor alpha Human genes 0.000 description 1
- 108010066419 Multidrug Resistance-Associated Protein 2 Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 108091092724 Noncoding DNA Proteins 0.000 description 1
- 108010034634 Repressor Proteins Proteins 0.000 description 1
- 102000009661 Repressor Proteins Human genes 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000002512 chemotherapy Methods 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 230000030279 gene silencing Effects 0.000 description 1
- 210000005260 human cell Anatomy 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 108020004999 messenger RNA Proteins 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a silencer prediction algorithm based on a bidirectional gated recurrent neural network, wherein the algorithm comprises the following steps: s1, collecting a data set; s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1; s3, training and verifying the model constructed in the step 2; and S4, predicting the probability of the silencer according to the model trained in the step 3. The invention adopts multiple times of training on the training set data to construct an optimal model for the prediction and classification of the silencers, and makes contribution to the prediction and development of the subsequent silencers.
Description
Technical Field
The invention relates to the field of biological information calculation, in particular to a silencer prediction algorithm based on a bidirectional gated recurrent neural network.
Background
In bioinformatics, a silencer is a DNA sequence that is a non-coding region, and in contrast to enhancers that enhance DNA transcription, silencers inhibit the expression process of a gene. The gene sequence on DNA is the template for the synthesis of messenger RNA, which is ultimately translated into protein. In the presence of the silencer, binding of the repressor protein to the silencer sequence prevents transcription of the DNA sequence by RNA polymerase, thereby preventing translation of the RNA into protein. Silencers act to block gene expression. Such as: the deletion of the silent regions related to the drug transport genes ABCC2 and ABCG2 on chromosome 10 closes the drug transport channel, resulting in chemotherapy resistance. The existing silencer machine learning prediction model gkm-SVM is obtained by training data after MPRA analysis. With the development of bioinformatics technology, the importance of studying the effect of silencers on gene expression is also becoming more and more prominent. The generalization capability of the machine learning method is lower due to the gradual increase of the data sample size. Therefore, in order to solve the technical problem, a new technical means needs to be provided.
Disclosure of Invention
In order to solve the existing problems, the invention provides a silencer prediction algorithm based on a bidirectional gated recurrent neural network, which comprises the following specific scheme:
a silencer prediction algorithm based on a bidirectional gated recurrent neural network comprises the following steps:
s1, collecting a data set;
s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1;
s3, training and verifying the model constructed in the step 2;
and S4, predicting the probability of the silencer according to the model trained in the step 3.
Preferably, the step of collecting the data set in step 1 comprises:
SA1, downloading silencer sequences from a known database and collecting a data set of an existing machine learning model;
SA2, applying the method of group scrambling to the positive samples in the silencer sequence downloaded in step A1 and de-duplicating to obtain the corresponding negative samples.
Preferably, the construction of the negative examples in step a2 uses a method with group scrambling comprising the steps of:
SA21, dividing the positive sample into a plurality of segments, wherein the dividing step size is 1, and the length of each segment is k, and if the sequence length of the positive sample cannot be divided by k, the length of the last segment is the remainder of dividing the sequence length of the positive sample by k;
SA22, the fragments generated from each positive sample in step A21 are combined to obtain a new sequence.
Preferably, the step of constructing the bidirectional gated recurrent neural network model in step 2 includes:
SB1, preprocessing the data in the data set collected in step 1;
SB2, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;
SB3, information Capture Using a Multi-headed self-attention mechanism, where multiple heads represent multiple different token subspaces, according to head i =Attention(QW i Q ,KW i k ,VW i v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through a full connection layer;
SB4, the global information obtained in step B3 is subject to target classification, specifically, the output of the upper layer is input to the full connection layer, and the cross entropy loss function is selected to perform a binary classification task.
Preferably, the preprocessing of the data in step B1 is for converting the nucleotide sequence data into digitized data that can be input to a feature extractor, and the preprocessing comprises:
SB11, dictionary code-converts the 16 nucleotides contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;
SB12, sequence completion-because the CNN input needs to be a fixed length sequence, the sequence length in the data set is filled with the number 0 to the length of the longest sequence in the data set by zero filling;
SB13, word embedding-since the numerical representation does not reflect the positional relationship between each element in the sequence, word embedding converts words into a vector form, which can correctly represent the relationship between each element in the sequence.
Preferably, the method for training and verifying the model in step 3 includes: dividing a data set into a training set and a verification set according to a five-fold cross verification mode, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain an optimal model.
Preferably, the prediction method in step 4 includes: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating cyclic neural network model for prediction, and the probability that the sequence data is an immune cell silencer is obtained.
The invention also discloses a computer readable storage medium, which stores a computer program, and after the computer program runs, the silencer prediction algorithm based on the bidirectional gated recurrent neural network is executed.
The invention also discloses a computer system, which comprises a processor and a storage medium, wherein the storage medium is stored with a computer program, and the processor reads the computer program from the storage medium and runs the computer program to execute the silencer prediction algorithm based on the bidirectional gated recurrent neural network.
The invention has the beneficial effects that:
the invention adopts multiple training on the training set data to construct an optimal model for silencer prediction and classification, and makes contribution to the prediction development of subsequent silencers.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a general framework flow diagram of the present invention;
FIG. 2 is a schematic diagram of a model of a bi-directional gated recurrent neural network;
FIG. 3 is a table illustrating data sets according to the present invention;
FIG. 4 is a table comparing deep learning and machine learning methods of the first data set of the present invention;
FIG. 5 is a table comparing deep learning and machine learning methods for data set two according to the present invention;
FIG. 6 is a comparison table of deep learning and machine learning methods for the data set of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a silencer prediction algorithm based on a bidirectional gated recurrent neural network includes the following steps:
s1, collecting the data set.
Specifically, as shown in fig. 3, the data set includes data set one, data set two, and data set three. Wherein, the first data set is K562 cancer cell silencing subdata which is collected in a national communications for the human and mouse genes and is published in 2020 and used for training a gkm-SVM, and the length of each positive sample and each negative sample is 2000 and each positive sample is 200 bp; the second data set is to download immune T cell silencer sequences from a database Slience DB, construct positive samples by using an inter-group scrambling method, and remove duplication to obtain corresponding negative samples with the length of 150 bp; and a third data set is to download the human cell silencer sequence from the database Slience DB, construct the positive sample by using an inter-group scrambling method, and obtain a corresponding negative sample with the length of 150bp by duplication removal. And finally obtaining a plurality of sequence data with the positive and negative sample ratio of 1: 1.
In summary, the step of collecting the data set in step 1 comprises:
SA1, downloading silencer sequences from a known database and collecting a data set of machine learning models provided by existing references;
SA2, applying the method of group scrambling to the positive samples in the silencer sequence downloaded in step A1 and de-duplicating to obtain the corresponding negative samples.
The method for inter-group scrambling used for construction of the negative sample in the step A2 comprises the following steps:
SA21, dividing the positive sample into a plurality of segments, wherein the dividing step size is 1, and the length of each segment is k, and if the sequence length of the positive sample cannot be divided by k, the length of the last segment is the remainder of dividing the sequence length of the positive sample by k;
SA22, the fragments generated from each positive sample in step A21 are aligned and combined to obtain a new sequence. The encoding method selects overlap division, and k is 2. For example: segmentation into (AT) (TC) (CG).
And S2, constructing a bidirectional gated recurrent neural network model based on the data set collected in the step 1. Fig. 2 is a schematic diagram of a bidirectional gated recurrent neural network model.
The method for constructing the bidirectional gated recurrent neural network model comprises the following steps of:
SB1, data preprocessing-preprocessing the data in the data set collected in step 1; the purpose of the pre-processing is to convert the nucleotide sequence data into digitized data that can be input to a feature extractor, the steps of the pre-processing including:
SB11, dictionary code-16, converts the 16 nucleotides (AA, AT, AC, AG, TA, TC, TG, CA, CT, CC, CG, TA, TT, TC, TG) contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;
SB12, sequence completion-because the input of the convolutional neural network CNN needs to be a sequence with fixed length, the sequence length in the data set is completed to the length of the longest sequence in the data set by using a zero filling method with the number 0 for each day;
SB13, word embedding-since the numerical representation does not reflect the positional relationship between each element in the sequence, word embedding converts words into a vector form, which can correctly represent the relationship between each element in the sequence.
SB2, feature extraction, namely, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;
SB3, sequence feature Capture-information Capture with a multi-headed autofocusing mechanism, where multi-headed represents multiple different token subspaces, such as in natural language processing, where an applet possesses the meaning of a fruit, while also having the meaning of a trademark, with different meanings learned by different token subspaces. According to head i =Attention(QW i Q ,KW i k ,VW i v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through matrix transformation of a full connection layer;
SB4, target classification-classifying the target domain and multi-source domain characteristic data by using a classifier, specifically, performing target classification on the global information obtained in the step B3, specifically, inputting the output of the upper layer to the full connection layer, and selecting a cross entropy loss function to perform a classification task.
S3, training and verifying the model constructed in the step 2; the training and verifying method comprises the following steps: dividing a data set into a training set and a verification set according to a five-fold cross verification mode, dividing the training set and the verification set according to a ratio of 4:1, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain an optimal model.
And S4, predicting the probability of the silencer according to the model trained in the step 3. The prediction method comprises the following steps: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating cyclic neural network model for prediction, and the probability that the sequence data is an immune cell silencer is obtained.
The parameter indexes of the invention are as follows:
the validation criteria we used include Recall, Precision-PRE, Accuracy Accuracy-ACC, AUC (Area Under Current), which are calculated as follows:
wherein TP-True positive means the number of True positive, i.e., the number of True immune cell silencer sequences correctly predicted as silencer sequences, TN-True negative means the number of True negative, i.e., the number of True non-silencer sequences correctly predicted as non-immune cell subsequence, FP-False positive means the number of False positive, i.e., the number of immune cell silencer sequences not originally predicted as immune cell silencer sequences, FN-False negative means the number of False negative, i.e., the number of immune cell silencer sequences originally predicted as non-immune cell silencer sequences. In addition, AUC and ACC are adopted in the experiment to measure the overall performance of the model. In general, the four indicators given in the above formula are influenced by the threshold, i.e. greater than or equal to the threshold is predicted as a positive sample, and less than the threshold is considered as a negative sample, and the threshold is set to 0.5 as a default value, but can be adjusted manually. The AUC and ACC are not affected by the threshold value, the range is between 0 and 1, and the closer to 1, the better the overall performance of the model is represented, so that the AUC and ACC are often regarded as more important evaluation indexes.
Specifically, in order to verify the superiority of the bidirectional gated recurrent neural network model compared with the current machine learning model gkm-SVM, three sets of experiments are carried out by using three data sets, wherein a first data set is disclosed: each of the positive and negative samples is 2000, and the sequence length is 200 bp; and a second data set: the positive sample is an immune T cell silencer 7142 strip with the length of 150 bp; data set three: 8000 positive samples of the human immune cell silencer with the length of 150 bp; as shown in fig. 4, fig. 5 and fig. 6, the performance comparison of the two models based on 4 evaluation indexes (reduce, Precision, ACC and AUC) shows that, as can be seen from the table, the deep learning model in both data sets can be greatly improved compared with the machine learning model.
The invention adopts multiple training on the training set data to construct an optimal model for silencer prediction and classification, and makes contribution to the prediction development of subsequent silencers.
The invention also discloses a computer readable storage medium, which stores a computer program, and after the computer program runs, the silencer prediction algorithm based on the bidirectional gated recurrent neural network is executed.
The invention also discloses a computer system, which comprises a processor and a storage medium, wherein the storage medium is stored with a computer program, and the processor reads the computer program from the storage medium and runs the computer program to execute the silencer prediction algorithm based on the bidirectional gated recurrent neural network.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (9)
1. A silencer prediction algorithm based on a bidirectional gated recurrent neural network is characterized by comprising the following steps:
s1, collecting a data set;
s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1;
s3, training and verifying the model constructed in the step 2;
and S4, predicting the probability of the silencer according to the model trained in the step 3.
2. The algorithm of claim 1, wherein the step of collecting the data set in step 1 comprises:
SA1, downloading silencer sequences from a known database and collecting a data set of an existing machine learning model;
SA2, applying the method of group scrambling to the positive samples in the silencer sequence downloaded in step A1 and de-duplicating to obtain the corresponding negative samples.
3. The algorithm of claim 2, wherein the construction of the negative examples in step a2 uses the steps of the inter-group shuffling method comprising:
SA21, dividing the positive sample into a plurality of segments, wherein the dividing step size is 1, and the length of each segment is k, and if the sequence length of the positive sample cannot be divided by k, the length of the last segment is the remainder of dividing the sequence length of the positive sample by k;
SA22, the fragments generated from each positive sample in step A21 are aligned and combined to obtain a new sequence.
4. The algorithm of claim 1, wherein the step of constructing the bidirectional gated recurrent neural network model in step 2 comprises:
SB1, preprocessing the data in the data set collected in step 1;
SB2, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;
SB3, information Capture Using Multi-headed self-attention mechanism, where Multi-headed represents multiple different token subspaces, according to head i =Attention(QW i Q ,KW i k ,VW i v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through a full connection layer;
SB4, the global information obtained in step B3 is subject to target classification, specifically, the output of the upper layer is input to the full connection layer, and the cross entropy loss function is selected to perform a binary classification task.
5. The algorithm of claim 4, wherein the preprocessing of the data in step B1 is for converting the nucleotide sequence data into digitized data that can be input to a feature extractor, the preprocessing comprising:
SB11, dictionary code-converts the 16 nucleotides contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;
SB12, sequence completion-because the CNN input needs to be a fixed length sequence, the sequence length in the data set is filled with the number 0 to the length of the longest sequence in the data set by zero filling;
SB13, word embedding-since the numerical representation does not reflect the positional relationship between each element in the sequence, word embedding converts words into a vector form, which can correctly represent the relationship between each element in the sequence.
6. The algorithm of claim 1, wherein the method for training and verifying the model in step 3 comprises: and dividing the data set into a training set and a verification set according to a five-fold cross validation mode, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain the optimal model.
7. The algorithm of claim 1, wherein the prediction method in step 4 comprises: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating circulation neural network model for prediction, and the probability that the external data is an immune cell silencer is obtained.
8. A computer-readable storage medium characterized by: a computer program stored on a medium, which when executed, performs the bidirectional gated recurrent neural network-based silencer prediction algorithm of any one of claims 1 to 7.
9. A computer system, characterized by: comprising a processor, a storage medium having a computer program stored thereon, the processor reading and executing the computer program from the storage medium to perform the bidirectional gated recurrent neural network-based silencer prediction algorithm of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210325550.5A CN114863995B (en) | 2022-03-30 | 2022-03-30 | Silencer prediction method based on bidirectional gating cyclic neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210325550.5A CN114863995B (en) | 2022-03-30 | 2022-03-30 | Silencer prediction method based on bidirectional gating cyclic neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114863995A true CN114863995A (en) | 2022-08-05 |
CN114863995B CN114863995B (en) | 2024-05-07 |
Family
ID=82630315
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210325550.5A Active CN114863995B (en) | 2022-03-30 | 2022-03-30 | Silencer prediction method based on bidirectional gating cyclic neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114863995B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020215694A1 (en) * | 2019-04-22 | 2020-10-29 | 平安科技(深圳)有限公司 | Chinese word segmentation method and apparatus based on deep learning, and storage medium and computer device |
CN111986730A (en) * | 2020-07-27 | 2020-11-24 | 中国科学院计算技术研究所苏州智能计算产业技术研究院 | Method for predicting siRNA silencing efficiency |
CN114121145A (en) * | 2021-11-26 | 2022-03-01 | 安徽大学 | Phage promoter prediction method based on multi-source transfer learning |
-
2022
- 2022-03-30 CN CN202210325550.5A patent/CN114863995B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020215694A1 (en) * | 2019-04-22 | 2020-10-29 | 平安科技(深圳)有限公司 | Chinese word segmentation method and apparatus based on deep learning, and storage medium and computer device |
CN111986730A (en) * | 2020-07-27 | 2020-11-24 | 中国科学院计算技术研究所苏州智能计算产业技术研究院 | Method for predicting siRNA silencing efficiency |
CN114121145A (en) * | 2021-11-26 | 2022-03-01 | 安徽大学 | Phage promoter prediction method based on multi-source transfer learning |
Also Published As
Publication number | Publication date |
---|---|
CN114863995B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6955580B2 (en) | Document summary automatic extraction method, equipment, computer equipment and storage media | |
US11593556B2 (en) | Methods and systems for generating domain-specific text summarizations | |
CN111563209B (en) | Method and device for identifying intention and computer readable storage medium | |
WO2016062044A1 (en) | Model parameter training method, device and system | |
Bai et al. | NHL Pathological Image Classification Based on Hierarchical Local Information and GoogLeNet‐Based Representations | |
JP2013206187A (en) | Information conversion device, information search device, information conversion method, information search method, information conversion program and information search program | |
WO2016095645A1 (en) | Stroke input method, device and system | |
CN110727765A (en) | Problem classification method and system based on multi-attention machine mechanism and storage medium | |
CN110992941A (en) | Power grid dispatching voice recognition method and device based on spectrogram | |
CN117708339B (en) | ICD automatic coding method based on pre-training language model | |
WO2021223467A1 (en) | Gene regulatory network reconstruction method and system, and device and medium | |
CN114881131A (en) | Biological sequence processing and model training method | |
CN116994745B (en) | Multi-mode model-based cancer patient prognosis prediction method and device | |
CN113886821A (en) | Malicious process identification method and device based on twin network, electronic equipment and storage medium | |
CN117527495A (en) | Modulation mode identification method and device for wireless communication signals | |
CN111091001B (en) | Method, device and equipment for generating word vector of word | |
CN114863995A (en) | Silencer prediction algorithm based on bidirectional gated recurrent neural network | |
CN116010902A (en) | Cross-modal fusion-based music emotion recognition method and system | |
CN109558735A (en) | A kind of rogue program sample clustering method and relevant apparatus based on machine learning | |
CN111552963B (en) | Malicious software classification method based on structure entropy sequence | |
CN111310176B (en) | Intrusion detection method and device based on feature selection | |
CN111159996B (en) | Short text set similarity comparison method and system based on text fingerprint algorithm | |
CN113742525A (en) | Self-supervision video hash learning method, system, electronic equipment and storage medium | |
CN113312619A (en) | Malicious process detection method and device based on small sample learning, electronic equipment and storage medium | |
WO2021072892A1 (en) | Legal provision search method based on neural network hybrid model, and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |