CN114863995A - Silencer prediction algorithm based on bidirectional gated recurrent neural network - Google Patents

Silencer prediction algorithm based on bidirectional gated recurrent neural network Download PDF

Info

Publication number
CN114863995A
CN114863995A CN202210325550.5A CN202210325550A CN114863995A CN 114863995 A CN114863995 A CN 114863995A CN 202210325550 A CN202210325550 A CN 202210325550A CN 114863995 A CN114863995 A CN 114863995A
Authority
CN
China
Prior art keywords
sequence
data
neural network
silencer
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210325550.5A
Other languages
Chinese (zh)
Other versions
CN114863995B (en
Inventor
郑春厚
江林杰
魏丕静
苏延森
夏俊峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202210325550.5A priority Critical patent/CN114863995B/en
Publication of CN114863995A publication Critical patent/CN114863995A/en
Application granted granted Critical
Publication of CN114863995B publication Critical patent/CN114863995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a silencer prediction algorithm based on a bidirectional gated recurrent neural network, wherein the algorithm comprises the following steps: s1, collecting a data set; s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1; s3, training and verifying the model constructed in the step 2; and S4, predicting the probability of the silencer according to the model trained in the step 3. The invention adopts multiple times of training on the training set data to construct an optimal model for the prediction and classification of the silencers, and makes contribution to the prediction and development of the subsequent silencers.

Description

Silencer prediction algorithm based on bidirectional gated recurrent neural network
Technical Field
The invention relates to the field of biological information calculation, in particular to a silencer prediction algorithm based on a bidirectional gated recurrent neural network.
Background
In bioinformatics, a silencer is a DNA sequence that is a non-coding region, and in contrast to enhancers that enhance DNA transcription, silencers inhibit the expression process of a gene. The gene sequence on DNA is the template for the synthesis of messenger RNA, which is ultimately translated into protein. In the presence of the silencer, binding of the repressor protein to the silencer sequence prevents transcription of the DNA sequence by RNA polymerase, thereby preventing translation of the RNA into protein. Silencers act to block gene expression. Such as: the deletion of the silent regions related to the drug transport genes ABCC2 and ABCG2 on chromosome 10 closes the drug transport channel, resulting in chemotherapy resistance. The existing silencer machine learning prediction model gkm-SVM is obtained by training data after MPRA analysis. With the development of bioinformatics technology, the importance of studying the effect of silencers on gene expression is also becoming more and more prominent. The generalization capability of the machine learning method is lower due to the gradual increase of the data sample size. Therefore, in order to solve the technical problem, a new technical means needs to be provided.
Disclosure of Invention
In order to solve the existing problems, the invention provides a silencer prediction algorithm based on a bidirectional gated recurrent neural network, which comprises the following specific scheme:
a silencer prediction algorithm based on a bidirectional gated recurrent neural network comprises the following steps:
s1, collecting a data set;
s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1;
s3, training and verifying the model constructed in the step 2;
and S4, predicting the probability of the silencer according to the model trained in the step 3.
Preferably, the step of collecting the data set in step 1 comprises:
SA1, downloading silencer sequences from a known database and collecting a data set of an existing machine learning model;
SA2, applying the method of group scrambling to the positive samples in the silencer sequence downloaded in step A1 and de-duplicating to obtain the corresponding negative samples.
Preferably, the construction of the negative examples in step a2 uses a method with group scrambling comprising the steps of:
SA21, dividing the positive sample into a plurality of segments, wherein the dividing step size is 1, and the length of each segment is k, and if the sequence length of the positive sample cannot be divided by k, the length of the last segment is the remainder of dividing the sequence length of the positive sample by k;
SA22, the fragments generated from each positive sample in step A21 are combined to obtain a new sequence.
Preferably, the step of constructing the bidirectional gated recurrent neural network model in step 2 includes:
SB1, preprocessing the data in the data set collected in step 1;
SB2, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;
SB3, information Capture Using a Multi-headed self-attention mechanism, where multiple heads represent multiple different token subspaces, according to head i =Attention(QW i Q ,KW i k ,VW i v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through a full connection layer;
SB4, the global information obtained in step B3 is subject to target classification, specifically, the output of the upper layer is input to the full connection layer, and the cross entropy loss function is selected to perform a binary classification task.
Preferably, the preprocessing of the data in step B1 is for converting the nucleotide sequence data into digitized data that can be input to a feature extractor, and the preprocessing comprises:
SB11, dictionary code-converts the 16 nucleotides contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;
SB12, sequence completion-because the CNN input needs to be a fixed length sequence, the sequence length in the data set is filled with the number 0 to the length of the longest sequence in the data set by zero filling;
SB13, word embedding-since the numerical representation does not reflect the positional relationship between each element in the sequence, word embedding converts words into a vector form, which can correctly represent the relationship between each element in the sequence.
Preferably, the method for training and verifying the model in step 3 includes: dividing a data set into a training set and a verification set according to a five-fold cross verification mode, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain an optimal model.
Preferably, the prediction method in step 4 includes: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating cyclic neural network model for prediction, and the probability that the sequence data is an immune cell silencer is obtained.
The invention also discloses a computer readable storage medium, which stores a computer program, and after the computer program runs, the silencer prediction algorithm based on the bidirectional gated recurrent neural network is executed.
The invention also discloses a computer system, which comprises a processor and a storage medium, wherein the storage medium is stored with a computer program, and the processor reads the computer program from the storage medium and runs the computer program to execute the silencer prediction algorithm based on the bidirectional gated recurrent neural network.
The invention has the beneficial effects that:
the invention adopts multiple training on the training set data to construct an optimal model for silencer prediction and classification, and makes contribution to the prediction development of subsequent silencers.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a general framework flow diagram of the present invention;
FIG. 2 is a schematic diagram of a model of a bi-directional gated recurrent neural network;
FIG. 3 is a table illustrating data sets according to the present invention;
FIG. 4 is a table comparing deep learning and machine learning methods of the first data set of the present invention;
FIG. 5 is a table comparing deep learning and machine learning methods for data set two according to the present invention;
FIG. 6 is a comparison table of deep learning and machine learning methods for the data set of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a silencer prediction algorithm based on a bidirectional gated recurrent neural network includes the following steps:
s1, collecting the data set.
Specifically, as shown in fig. 3, the data set includes data set one, data set two, and data set three. Wherein, the first data set is K562 cancer cell silencing subdata which is collected in a national communications for the human and mouse genes and is published in 2020 and used for training a gkm-SVM, and the length of each positive sample and each negative sample is 2000 and each positive sample is 200 bp; the second data set is to download immune T cell silencer sequences from a database Slience DB, construct positive samples by using an inter-group scrambling method, and remove duplication to obtain corresponding negative samples with the length of 150 bp; and a third data set is to download the human cell silencer sequence from the database Slience DB, construct the positive sample by using an inter-group scrambling method, and obtain a corresponding negative sample with the length of 150bp by duplication removal. And finally obtaining a plurality of sequence data with the positive and negative sample ratio of 1: 1.
In summary, the step of collecting the data set in step 1 comprises:
SA1, downloading silencer sequences from a known database and collecting a data set of machine learning models provided by existing references;
SA2, applying the method of group scrambling to the positive samples in the silencer sequence downloaded in step A1 and de-duplicating to obtain the corresponding negative samples.
The method for inter-group scrambling used for construction of the negative sample in the step A2 comprises the following steps:
SA21, dividing the positive sample into a plurality of segments, wherein the dividing step size is 1, and the length of each segment is k, and if the sequence length of the positive sample cannot be divided by k, the length of the last segment is the remainder of dividing the sequence length of the positive sample by k;
SA22, the fragments generated from each positive sample in step A21 are aligned and combined to obtain a new sequence. The encoding method selects overlap division, and k is 2. For example: segmentation into (AT) (TC) (CG).
And S2, constructing a bidirectional gated recurrent neural network model based on the data set collected in the step 1. Fig. 2 is a schematic diagram of a bidirectional gated recurrent neural network model.
The method for constructing the bidirectional gated recurrent neural network model comprises the following steps of:
SB1, data preprocessing-preprocessing the data in the data set collected in step 1; the purpose of the pre-processing is to convert the nucleotide sequence data into digitized data that can be input to a feature extractor, the steps of the pre-processing including:
SB11, dictionary code-16, converts the 16 nucleotides (AA, AT, AC, AG, TA, TC, TG, CA, CT, CC, CG, TA, TT, TC, TG) contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;
SB12, sequence completion-because the input of the convolutional neural network CNN needs to be a sequence with fixed length, the sequence length in the data set is completed to the length of the longest sequence in the data set by using a zero filling method with the number 0 for each day;
SB13, word embedding-since the numerical representation does not reflect the positional relationship between each element in the sequence, word embedding converts words into a vector form, which can correctly represent the relationship between each element in the sequence.
SB2, feature extraction, namely, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;
SB3, sequence feature Capture-information Capture with a multi-headed autofocusing mechanism, where multi-headed represents multiple different token subspaces, such as in natural language processing, where an applet possesses the meaning of a fruit, while also having the meaning of a trademark, with different meanings learned by different token subspaces. According to head i =Attention(QW i Q ,KW i k ,VW i v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through matrix transformation of a full connection layer;
SB4, target classification-classifying the target domain and multi-source domain characteristic data by using a classifier, specifically, performing target classification on the global information obtained in the step B3, specifically, inputting the output of the upper layer to the full connection layer, and selecting a cross entropy loss function to perform a classification task.
S3, training and verifying the model constructed in the step 2; the training and verifying method comprises the following steps: dividing a data set into a training set and a verification set according to a five-fold cross verification mode, dividing the training set and the verification set according to a ratio of 4:1, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain an optimal model.
And S4, predicting the probability of the silencer according to the model trained in the step 3. The prediction method comprises the following steps: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating cyclic neural network model for prediction, and the probability that the sequence data is an immune cell silencer is obtained.
The parameter indexes of the invention are as follows:
the validation criteria we used include Recall, Precision-PRE, Accuracy Accuracy-ACC, AUC (Area Under Current), which are calculated as follows:
Figure BDA0003573295670000071
Figure BDA0003573295670000072
Figure BDA0003573295670000073
Figure BDA0003573295670000074
wherein TP-True positive means the number of True positive, i.e., the number of True immune cell silencer sequences correctly predicted as silencer sequences, TN-True negative means the number of True negative, i.e., the number of True non-silencer sequences correctly predicted as non-immune cell subsequence, FP-False positive means the number of False positive, i.e., the number of immune cell silencer sequences not originally predicted as immune cell silencer sequences, FN-False negative means the number of False negative, i.e., the number of immune cell silencer sequences originally predicted as non-immune cell silencer sequences. In addition, AUC and ACC are adopted in the experiment to measure the overall performance of the model. In general, the four indicators given in the above formula are influenced by the threshold, i.e. greater than or equal to the threshold is predicted as a positive sample, and less than the threshold is considered as a negative sample, and the threshold is set to 0.5 as a default value, but can be adjusted manually. The AUC and ACC are not affected by the threshold value, the range is between 0 and 1, and the closer to 1, the better the overall performance of the model is represented, so that the AUC and ACC are often regarded as more important evaluation indexes.
Specifically, in order to verify the superiority of the bidirectional gated recurrent neural network model compared with the current machine learning model gkm-SVM, three sets of experiments are carried out by using three data sets, wherein a first data set is disclosed: each of the positive and negative samples is 2000, and the sequence length is 200 bp; and a second data set: the positive sample is an immune T cell silencer 7142 strip with the length of 150 bp; data set three: 8000 positive samples of the human immune cell silencer with the length of 150 bp; as shown in fig. 4, fig. 5 and fig. 6, the performance comparison of the two models based on 4 evaluation indexes (reduce, Precision, ACC and AUC) shows that, as can be seen from the table, the deep learning model in both data sets can be greatly improved compared with the machine learning model.
The invention adopts multiple training on the training set data to construct an optimal model for silencer prediction and classification, and makes contribution to the prediction development of subsequent silencers.
The invention also discloses a computer readable storage medium, which stores a computer program, and after the computer program runs, the silencer prediction algorithm based on the bidirectional gated recurrent neural network is executed.
The invention also discloses a computer system, which comprises a processor and a storage medium, wherein the storage medium is stored with a computer program, and the processor reads the computer program from the storage medium and runs the computer program to execute the silencer prediction algorithm based on the bidirectional gated recurrent neural network.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A silencer prediction algorithm based on a bidirectional gated recurrent neural network is characterized by comprising the following steps:
s1, collecting a data set;
s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1;
s3, training and verifying the model constructed in the step 2;
and S4, predicting the probability of the silencer according to the model trained in the step 3.
2. The algorithm of claim 1, wherein the step of collecting the data set in step 1 comprises:
SA1, downloading silencer sequences from a known database and collecting a data set of an existing machine learning model;
SA2, applying the method of group scrambling to the positive samples in the silencer sequence downloaded in step A1 and de-duplicating to obtain the corresponding negative samples.
3. The algorithm of claim 2, wherein the construction of the negative examples in step a2 uses the steps of the inter-group shuffling method comprising:
SA21, dividing the positive sample into a plurality of segments, wherein the dividing step size is 1, and the length of each segment is k, and if the sequence length of the positive sample cannot be divided by k, the length of the last segment is the remainder of dividing the sequence length of the positive sample by k;
SA22, the fragments generated from each positive sample in step A21 are aligned and combined to obtain a new sequence.
4. The algorithm of claim 1, wherein the step of constructing the bidirectional gated recurrent neural network model in step 2 comprises:
SB1, preprocessing the data in the data set collected in step 1;
SB2, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;
SB3, information Capture Using Multi-headed self-attention mechanism, where Multi-headed represents multiple different token subspaces, according to head i =Attention(QW i Q ,KW i k ,VW i v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through a full connection layer;
SB4, the global information obtained in step B3 is subject to target classification, specifically, the output of the upper layer is input to the full connection layer, and the cross entropy loss function is selected to perform a binary classification task.
5. The algorithm of claim 4, wherein the preprocessing of the data in step B1 is for converting the nucleotide sequence data into digitized data that can be input to a feature extractor, the preprocessing comprising:
SB11, dictionary code-converts the 16 nucleotides contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;
SB12, sequence completion-because the CNN input needs to be a fixed length sequence, the sequence length in the data set is filled with the number 0 to the length of the longest sequence in the data set by zero filling;
SB13, word embedding-since the numerical representation does not reflect the positional relationship between each element in the sequence, word embedding converts words into a vector form, which can correctly represent the relationship between each element in the sequence.
6. The algorithm of claim 1, wherein the method for training and verifying the model in step 3 comprises: and dividing the data set into a training set and a verification set according to a five-fold cross validation mode, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain the optimal model.
7. The algorithm of claim 1, wherein the prediction method in step 4 comprises: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating circulation neural network model for prediction, and the probability that the external data is an immune cell silencer is obtained.
8. A computer-readable storage medium characterized by: a computer program stored on a medium, which when executed, performs the bidirectional gated recurrent neural network-based silencer prediction algorithm of any one of claims 1 to 7.
9. A computer system, characterized by: comprising a processor, a storage medium having a computer program stored thereon, the processor reading and executing the computer program from the storage medium to perform the bidirectional gated recurrent neural network-based silencer prediction algorithm of any one of claims 1 to 7.
CN202210325550.5A 2022-03-30 2022-03-30 Silencer prediction method based on bidirectional gating cyclic neural network Active CN114863995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210325550.5A CN114863995B (en) 2022-03-30 2022-03-30 Silencer prediction method based on bidirectional gating cyclic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210325550.5A CN114863995B (en) 2022-03-30 2022-03-30 Silencer prediction method based on bidirectional gating cyclic neural network

Publications (2)

Publication Number Publication Date
CN114863995A true CN114863995A (en) 2022-08-05
CN114863995B CN114863995B (en) 2024-05-07

Family

ID=82630315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210325550.5A Active CN114863995B (en) 2022-03-30 2022-03-30 Silencer prediction method based on bidirectional gating cyclic neural network

Country Status (1)

Country Link
CN (1) CN114863995B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020215694A1 (en) * 2019-04-22 2020-10-29 平安科技(深圳)有限公司 Chinese word segmentation method and apparatus based on deep learning, and storage medium and computer device
CN111986730A (en) * 2020-07-27 2020-11-24 中国科学院计算技术研究所苏州智能计算产业技术研究院 Method for predicting siRNA silencing efficiency
CN114121145A (en) * 2021-11-26 2022-03-01 安徽大学 Phage promoter prediction method based on multi-source transfer learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020215694A1 (en) * 2019-04-22 2020-10-29 平安科技(深圳)有限公司 Chinese word segmentation method and apparatus based on deep learning, and storage medium and computer device
CN111986730A (en) * 2020-07-27 2020-11-24 中国科学院计算技术研究所苏州智能计算产业技术研究院 Method for predicting siRNA silencing efficiency
CN114121145A (en) * 2021-11-26 2022-03-01 安徽大学 Phage promoter prediction method based on multi-source transfer learning

Also Published As

Publication number Publication date
CN114863995B (en) 2024-05-07

Similar Documents

Publication Publication Date Title
JP6955580B2 (en) Document summary automatic extraction method, equipment, computer equipment and storage media
US11593556B2 (en) Methods and systems for generating domain-specific text summarizations
CN111563209B (en) Method and device for identifying intention and computer readable storage medium
WO2016062044A1 (en) Model parameter training method, device and system
Bai et al. NHL Pathological Image Classification Based on Hierarchical Local Information and GoogLeNet‐Based Representations
JP2013206187A (en) Information conversion device, information search device, information conversion method, information search method, information conversion program and information search program
WO2016095645A1 (en) Stroke input method, device and system
CN110727765A (en) Problem classification method and system based on multi-attention machine mechanism and storage medium
CN110992941A (en) Power grid dispatching voice recognition method and device based on spectrogram
CN117708339B (en) ICD automatic coding method based on pre-training language model
WO2021223467A1 (en) Gene regulatory network reconstruction method and system, and device and medium
CN114881131A (en) Biological sequence processing and model training method
CN116994745B (en) Multi-mode model-based cancer patient prognosis prediction method and device
CN113886821A (en) Malicious process identification method and device based on twin network, electronic equipment and storage medium
CN117527495A (en) Modulation mode identification method and device for wireless communication signals
CN111091001B (en) Method, device and equipment for generating word vector of word
CN114863995A (en) Silencer prediction algorithm based on bidirectional gated recurrent neural network
CN116010902A (en) Cross-modal fusion-based music emotion recognition method and system
CN109558735A (en) A kind of rogue program sample clustering method and relevant apparatus based on machine learning
CN111552963B (en) Malicious software classification method based on structure entropy sequence
CN111310176B (en) Intrusion detection method and device based on feature selection
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm
CN113742525A (en) Self-supervision video hash learning method, system, electronic equipment and storage medium
CN113312619A (en) Malicious process detection method and device based on small sample learning, electronic equipment and storage medium
WO2021072892A1 (en) Legal provision search method based on neural network hybrid model, and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant