CN114863995A

CN114863995A - Silencer prediction algorithm based on bidirectional gated recurrent neural network

Info

Publication number: CN114863995A
Application number: CN202210325550.5A
Authority: CN
Inventors: 郑春厚; 江林杰; 魏丕静; 苏延森; 夏俊峰
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-08-05
Anticipated expiration: 2042-03-30
Also published as: CN114863995B

Abstract

The invention discloses a silencer prediction algorithm based on a bidirectional gated recurrent neural network, wherein the algorithm comprises the following steps: s1, collecting a data set; s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1; s3, training and verifying the model constructed in the step 2; and S4, predicting the probability of the silencer according to the model trained in the step 3. The invention adopts multiple times of training on the training set data to construct an optimal model for the prediction and classification of the silencers, and makes contribution to the prediction and development of the subsequent silencers.

Description

Silencer prediction algorithm based on bidirectional gated recurrent neural network

Technical Field

The invention relates to the field of biological information calculation, in particular to a silencer prediction algorithm based on a bidirectional gated recurrent neural network.

Background

In bioinformatics, a silencer is a DNA sequence that is a non-coding region, and in contrast to enhancers that enhance DNA transcription, silencers inhibit the expression process of a gene. The gene sequence on DNA is the template for the synthesis of messenger RNA, which is ultimately translated into protein. In the presence of the silencer, binding of the repressor protein to the silencer sequence prevents transcription of the DNA sequence by RNA polymerase, thereby preventing translation of the RNA into protein. Silencers act to block gene expression. Such as: the deletion of the silent regions related to the drug transport genes ABCC2 and ABCG2 on chromosome 10 closes the drug transport channel, resulting in chemotherapy resistance. The existing silencer machine learning prediction model gkm-SVM is obtained by training data after MPRA analysis. With the development of bioinformatics technology, the importance of studying the effect of silencers on gene expression is also becoming more and more prominent. The generalization capability of the machine learning method is lower due to the gradual increase of the data sample size. Therefore, in order to solve the technical problem, a new technical means needs to be provided.

Disclosure of Invention

In order to solve the existing problems, the invention provides a silencer prediction algorithm based on a bidirectional gated recurrent neural network, which comprises the following specific scheme:

a silencer prediction algorithm based on a bidirectional gated recurrent neural network comprises the following steps:

s1, collecting a data set;

s2, constructing a bidirectional gating recurrent neural network model based on the data set collected in the step 1;

s3, training and verifying the model constructed in the step 2;

and S4, predicting the probability of the silencer according to the model trained in the step 3.

Preferably, the step of collecting the data set in step 1 comprises:

SA1, downloading silencer sequences from a known database and collecting a data set of an existing machine learning model;

SA2, applying the method of group scrambling to the positive samples in the silencer sequence downloaded in step A1 and de-duplicating to obtain the corresponding negative samples.

Preferably, the construction of the negative examples in step a2 uses a method with group scrambling comprising the steps of:

SA21, dividing the positive sample into a plurality of segments, wherein the dividing step size is 1, and the length of each segment is k, and if the sequence length of the positive sample cannot be divided by k, the length of the last segment is the remainder of dividing the sequence length of the positive sample by k;

SA22, the fragments generated from each positive sample in step A21 are combined to obtain a new sequence.

Preferably, the step of constructing the bidirectional gated recurrent neural network model in step 2 includes:

SB1, preprocessing the data in the data set collected in step 1;

SB2, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;

SB3, information Capture Using a Multi-headed self-attention mechanism, where multiple heads represent multiple different token subspaces, according to head _i ＝Attention(QW _i ^Q ,KW _i ^k ,VW _i ^v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through a full connection layer;

SB4, the global information obtained in step B3 is subject to target classification, specifically, the output of the upper layer is input to the full connection layer, and the cross entropy loss function is selected to perform a binary classification task.

Preferably, the preprocessing of the data in step B1 is for converting the nucleotide sequence data into digitized data that can be input to a feature extractor, and the preprocessing comprises:

SB11, dictionary code-converts the 16 nucleotides contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;

SB12, sequence completion-because the CNN input needs to be a fixed length sequence, the sequence length in the data set is filled with the number 0 to the length of the longest sequence in the data set by zero filling;

SB13, word embedding-since the numerical representation does not reflect the positional relationship between each element in the sequence, word embedding converts words into a vector form, which can correctly represent the relationship between each element in the sequence.

Preferably, the method for training and verifying the model in step 3 includes: dividing a data set into a training set and a verification set according to a five-fold cross verification mode, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain an optimal model.

Preferably, the prediction method in step 4 includes: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating cyclic neural network model for prediction, and the probability that the sequence data is an immune cell silencer is obtained.

The invention also discloses a computer readable storage medium, which stores a computer program, and after the computer program runs, the silencer prediction algorithm based on the bidirectional gated recurrent neural network is executed.

The invention also discloses a computer system, which comprises a processor and a storage medium, wherein the storage medium is stored with a computer program, and the processor reads the computer program from the storage medium and runs the computer program to execute the silencer prediction algorithm based on the bidirectional gated recurrent neural network.

The invention has the beneficial effects that:

the invention adopts multiple training on the training set data to construct an optimal model for silencer prediction and classification, and makes contribution to the prediction development of subsequent silencers.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a general framework flow diagram of the present invention;

FIG. 2 is a schematic diagram of a model of a bi-directional gated recurrent neural network;

FIG. 3 is a table illustrating data sets according to the present invention;

FIG. 4 is a table comparing deep learning and machine learning methods of the first data set of the present invention;

FIG. 5 is a table comparing deep learning and machine learning methods for data set two according to the present invention;

FIG. 6 is a comparison table of deep learning and machine learning methods for the data set of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a silencer prediction algorithm based on a bidirectional gated recurrent neural network includes the following steps:

s1, collecting the data set.

Specifically, as shown in fig. 3, the data set includes data set one, data set two, and data set three. Wherein, the first data set is K562 cancer cell silencing subdata which is collected in a national communications for the human and mouse genes and is published in 2020 and used for training a gkm-SVM, and the length of each positive sample and each negative sample is 2000 and each positive sample is 200 bp; the second data set is to download immune T cell silencer sequences from a database Slience DB, construct positive samples by using an inter-group scrambling method, and remove duplication to obtain corresponding negative samples with the length of 150 bp; and a third data set is to download the human cell silencer sequence from the database Slience DB, construct the positive sample by using an inter-group scrambling method, and obtain a corresponding negative sample with the length of 150bp by duplication removal. And finally obtaining a plurality of sequence data with the positive and negative sample ratio of 1: 1.

In summary, the step of collecting the data set in step 1 comprises:

SA1, downloading silencer sequences from a known database and collecting a data set of machine learning models provided by existing references;

The method for inter-group scrambling used for construction of the negative sample in the step A2 comprises the following steps:

SA22, the fragments generated from each positive sample in step A21 are aligned and combined to obtain a new sequence. The encoding method selects overlap division, and k is 2. For example: segmentation into (AT) (TC) (CG).

And S2, constructing a bidirectional gated recurrent neural network model based on the data set collected in the step 1. Fig. 2 is a schematic diagram of a bidirectional gated recurrent neural network model.

The method for constructing the bidirectional gated recurrent neural network model comprises the following steps of:

SB1, data preprocessing-preprocessing the data in the data set collected in step 1; the purpose of the pre-processing is to convert the nucleotide sequence data into digitized data that can be input to a feature extractor, the steps of the pre-processing including:

SB11, dictionary code-16, converts the 16 nucleotides (AA, AT, AC, AG, TA, TC, TG, CA, CT, CC, CG, TA, TT, TC, TG) contained in each sequence in the dataset into 1, 2, 3, … …, 16, respectively;

SB12, sequence completion-because the input of the convolutional neural network CNN needs to be a sequence with fixed length, the sequence length in the data set is completed to the length of the longest sequence in the data set by using a zero filling method with the number 0 for each day;

SB2, feature extraction, namely, taking a convolutional neural network CNN with a feature extraction function and a bidirectional gating circulation unit BiGRU as feature extractors to realize feature extraction of a target data set; specifically, the CNN is used for carrying out convolution operation on data, wherein convolution layers adopt a parallel connection mode, the sizes of convolution kernels are sequentially increased, the data after convolution are input into a bidirectional gating circulation unit BiGRU and output is obtained, and finally the characteristic information of the sequence is obtained;

SB3, sequence feature Capture-information Capture with a multi-headed autofocusing mechanism, where multi-headed represents multiple different token subspaces, such as in natural language processing, where an applet possesses the meaning of a fruit, while also having the meaning of a trademark, with different meanings learned by different token subspaces. According to head _i ＝Attention(QW _i ^Q ,KW _i ^k ,VW _i ^v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through matrix transformation of a full connection layer;

SB4, target classification-classifying the target domain and multi-source domain characteristic data by using a classifier, specifically, performing target classification on the global information obtained in the step B3, specifically, inputting the output of the upper layer to the full connection layer, and selecting a cross entropy loss function to perform a classification task.

S3, training and verifying the model constructed in the step 2; the training and verifying method comprises the following steps: dividing a data set into a training set and a verification set according to a five-fold cross verification mode, dividing the training set and the verification set according to a ratio of 4:1, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain an optimal model.

And S4, predicting the probability of the silencer according to the model trained in the step 3. The prediction method comprises the following steps: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating cyclic neural network model for prediction, and the probability that the sequence data is an immune cell silencer is obtained.

The parameter indexes of the invention are as follows:

the validation criteria we used include Recall, Precision-PRE, Accuracy Accuracy-ACC, AUC (Area Under Current), which are calculated as follows:

wherein TP-True positive means the number of True positive, i.e., the number of True immune cell silencer sequences correctly predicted as silencer sequences, TN-True negative means the number of True negative, i.e., the number of True non-silencer sequences correctly predicted as non-immune cell subsequence, FP-False positive means the number of False positive, i.e., the number of immune cell silencer sequences not originally predicted as immune cell silencer sequences, FN-False negative means the number of False negative, i.e., the number of immune cell silencer sequences originally predicted as non-immune cell silencer sequences. In addition, AUC and ACC are adopted in the experiment to measure the overall performance of the model. In general, the four indicators given in the above formula are influenced by the threshold, i.e. greater than or equal to the threshold is predicted as a positive sample, and less than the threshold is considered as a negative sample, and the threshold is set to 0.5 as a default value, but can be adjusted manually. The AUC and ACC are not affected by the threshold value, the range is between 0 and 1, and the closer to 1, the better the overall performance of the model is represented, so that the AUC and ACC are often regarded as more important evaluation indexes.

Specifically, in order to verify the superiority of the bidirectional gated recurrent neural network model compared with the current machine learning model gkm-SVM, three sets of experiments are carried out by using three data sets, wherein a first data set is disclosed: each of the positive and negative samples is 2000, and the sequence length is 200 bp; and a second data set: the positive sample is an immune T cell silencer 7142 strip with the length of 150 bp; data set three: 8000 positive samples of the human immune cell silencer with the length of 150 bp; as shown in fig. 4, fig. 5 and fig. 6, the performance comparison of the two models based on 4 evaluation indexes (reduce, Precision, ACC and AUC) shows that, as can be seen from the table, the deep learning model in both data sets can be greatly improved compared with the machine learning model.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A silencer prediction algorithm based on a bidirectional gated recurrent neural network is characterized by comprising the following steps:

s1, collecting a data set;

s3, training and verifying the model constructed in the step 2;

2. The algorithm of claim 1, wherein the step of collecting the data set in step 1 comprises:

3. The algorithm of claim 2, wherein the construction of the negative examples in step a2 uses the steps of the inter-group shuffling method comprising:

SA22, the fragments generated from each positive sample in step A21 are aligned and combined to obtain a new sequence.

4. The algorithm of claim 1, wherein the step of constructing the bidirectional gated recurrent neural network model in step 2 comprises:

SB1, preprocessing the data in the data set collected in step 1;

SB3, information Capture Using Multi-headed self-attention mechanism, where Multi-headed represents multiple different token subspaces, according to head _i ＝Attention(QW _i ^Q ,KW _i ^k ,VW _i ^v ) W is three different weight training matrixes, Q, K, V is an initialization vector, and finally all information capturing results are spliced together to obtain final global information through a full connection layer;

5. The algorithm of claim 4, wherein the preprocessing of the data in step B1 is for converting the nucleotide sequence data into digitized data that can be input to a feature extractor, the preprocessing comprising:

6. The algorithm of claim 1, wherein the method for training and verifying the model in step 3 comprises: and dividing the data set into a training set and a verification set according to a five-fold cross validation mode, wherein the training set is used for constructing and training a bidirectional gated cyclic neural network model, and the verification set is used for adjusting parameters of the model to finally obtain the optimal model.

7. The algorithm of claim 1, wherein the prediction method in step 4 comprises: when external data needs to be predicted, the sequence data is directly input into a trained bidirectional gating circulation neural network model for prediction, and the probability that the external data is an immune cell silencer is obtained.

8. A computer-readable storage medium characterized by: a computer program stored on a medium, which when executed, performs the bidirectional gated recurrent neural network-based silencer prediction algorithm of any one of claims 1 to 7.

9. A computer system, characterized by: comprising a processor, a storage medium having a computer program stored thereon, the processor reading and executing the computer program from the storage medium to perform the bidirectional gated recurrent neural network-based silencer prediction algorithm of any one of claims 1 to 7.