WO2023044931A1 - Rna-蛋白质相互作用预测方法、装置、介质及电子设备 - Google Patents

Rna-蛋白质相互作用预测方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2023044931A1
WO2023044931A1 PCT/CN2021/121103 CN2021121103W WO2023044931A1 WO 2023044931 A1 WO2023044931 A1 WO 2023044931A1 CN 2021121103 W CN2021121103 W CN 2021121103W WO 2023044931 A1 WO2023044931 A1 WO 2023044931A1
Authority
WO
WIPO (PCT)
Prior art keywords
rna
protein
sequence
predicted
mer
Prior art date
Application number
PCT/CN2021/121103
Other languages
English (en)
French (fr)
Inventor
王斯凡
张振中
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to CN202180002693.6A priority Critical patent/CN116897396A/zh
Priority to PCT/CN2021/121103 priority patent/WO2023044931A1/zh
Publication of WO2023044931A1 publication Critical patent/WO2023044931A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, in particular, to a method for predicting RNA-protein interaction, a device for predicting RNA-protein interaction, a computer-readable storage medium and electronic equipment.
  • Noncoding RNA (noncoding RNA, ncRNA) participates in many complex cellular processes, plays an important role in life processes such as alternative splicing, chromatin modification and epigenetics, and is closely related to many diseases. Studies have shown that most non-coding RNAs achieve their regulatory functions by interacting with proteins. Therefore, studying the interaction between non-coding RNA and protein is of great significance for revealing the molecular mechanism of non-coding RNA in human diseases and life activities, and has become one of the important ways to analyze the function of non-coding RNA and protein.
  • the present disclosure provides a method for predicting RNA-protein interaction, a device for predicting RNA-protein interaction, a computer-readable storage medium and electronic equipment.
  • the present disclosure provides a method for predicting RNA-protein interaction, including:
  • RNA-protein pair to be predicted, and obtaining an RNA sequence representation vector and a protein sequence representation vector in the RNA-protein pair to be predicted;
  • the interaction prediction model of the RNA-protein pair to be predicted is used to obtain the interaction of the RNA-protein pair to be predicted. function predictive value;
  • the interaction between the RNA and the protein is determined based on the interaction prediction value.
  • the feature extraction of the RNA-protein pair to be predicted is performed to obtain the sequence features of the RNA-protein pair to be predicted, including:
  • sequence features of the RNA-protein pair to be predicted are determined according to the original sequence feature set.
  • the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
  • Each k-mer subsequence is searched in the original sequence feature set, and the sequence feature of the RNA-protein pair to be predicted is obtained according to the search result.
  • the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
  • RNA sequence and the protein sequence in the RNA-protein pair to be predicted are respectively converted into k-mer subsequences, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences;
  • the determining the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set includes:
  • RNA sequence and the protein sequence in the RNA-protein pair to be predicted are respectively converted into k-mer subsequences, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences;
  • the sequence characteristics of the RNA-protein pair to be predicted are composed of the first sequence characteristics and the second sequence characteristics.
  • the vectorization of the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted includes:
  • RNA sequence and the protein sequence in the RNA-protein pair to be predicted are converted into k-mer subsequences respectively, and the k-mer subsequences include M RNA k-mer subsequences and N protein k-mer subsequences sequence;
  • the vectorization of the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted includes:
  • the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted using the interaction
  • the prediction model obtains the interaction prediction value of the RNA-protein pair to be predicted, including:
  • the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted Based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, using at least three interaction prediction models to obtain the RNA-protein to be predicted Multiple interaction predictors for pairs.
  • the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted using the interaction
  • the prediction model obtains multiple interaction prediction values of the RNA-protein pair to be predicted, including:
  • At least one first interaction model and at least two second interaction prediction models are included; or at least two first interaction models and at least one second interaction prediction model are included.
  • the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted using the interaction
  • the prediction model obtains multiple interaction prediction values of the RNA-protein pair to be predicted, including:
  • At least one of the traditional machine learning models and at least two of the deep learning models are included; or, at least two of the traditional machine learning models and at least one of the deep learning models are included.
  • the traditional machine learning model includes at least one of a support vector machine model, a logistic regression model, and a decision tree model
  • the deep learning model includes a convolutional neural network model and a recurrent neural network At least one of the models.
  • the determining the interaction between the RNA and the protein according to the interaction prediction value includes:
  • said obtaining the original sequence feature set includes:
  • the feature extraction is performed on each RNA-protein pair in the original data set to obtain the original sequence feature set, including:
  • RNA and protein Arrange and combine the basic units of RNA and protein to obtain k-mer subsequences
  • the original sequence feature set is determined according to the variance of each k-mer subsequence.
  • the calculation of the average number of occurrences of each k-mer subsequence in each RNA-protein pair is performed, and the calculation of the The variance of each k-mer subsequence, including:
  • each k-mer subsequence is calculated according to the average number of occurrences of each k-mer subsequence in each RNA-protein pair and the number of occurrences in each RNA-protein pair.
  • each k-mer subsequence in each RNA-protein pair calculates the variance of each k-mer subsequence, including:
  • n is the number of RNA-protein pairs in the original data set
  • m is the average value of the number of occurrences of each k-mer subsequence in each RNA-protein pair
  • x n is each k-mer subsequence The number of occurrences of the nth RNA-protein pair.
  • the determining the original sequence feature set according to the variance of each k-mer sequence includes:
  • the k-mer subsequences satisfying the preset condition are determined according to the variance of each k-mer subsequence, and the original sequence feature set is composed of the k-mer subsequences satisfying the preset condition.
  • performing feature extraction on each RNA-protein pair in the original data set to obtain the original sequence feature set further includes:
  • the frequency of occurrence of each k-mer subsequence pair in the original data set is counted, and the original sequence feature set is composed of k-mer subsequence pairs satisfying the preset condition of the frequency of occurrence.
  • the method further includes:
  • the interaction prediction model is trained.
  • the training the interaction prediction model includes:
  • RNA-protein pair a positive example RNA-protein pair and a negative example RNA-protein pair in the training data set
  • the training data set is used as the input of the interaction prediction model, and the model parameters of the interaction prediction model are iteratively updated. When the iteration termination condition is satisfied, the training of all model parameters is completed to use the trained The interaction prediction model predicts the interaction of the RNA-protein pair to be predicted.
  • the method further includes:
  • a prediction result of the interaction between the RNA and the protein is output.
  • RNA-protein interaction prediction device including:
  • the data acquisition module is used to acquire the RNA-protein pair to be predicted
  • a feature extraction module configured to perform feature extraction on the RNA-protein pair to be predicted, to obtain sequence features of the RNA-protein pair to be predicted;
  • a data vectorization module configured to vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted;
  • the interaction prediction module is used to obtain the predicted sequence by using an interaction prediction model based on the sequence characteristics of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted The interaction prediction value of the RNA-protein pair;
  • An interaction determination module configured to determine the interaction between the RNA and the protein according to the interaction prediction value.
  • the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the methods described above is implemented.
  • the present disclosure provides an electronic device, including: a processor; and a memory, configured to store executable instructions of the processor; wherein, the processor is configured to execute any one of the above-mentioned instructions by executing the executable instructions described method.
  • Figure 1 shows a schematic diagram of an exemplary system architecture of an RNA-protein interaction prediction method and device that can be applied to an embodiment of the present disclosure
  • Figure 2 schematically shows a flow chart of an RNA-protein interaction prediction method according to an embodiment of the present disclosure
  • Fig. 3 schematically shows a flow chart of determining the sequence characteristics of the RNA-protein pair to be predicted according to an embodiment of the present disclosure
  • Fig. 4 schematically shows a flow chart of obtaining an original sequence feature set according to an embodiment of the present disclosure
  • Fig. 5 schematically shows a flow chart of training an interaction prediction model according to an embodiment of the present disclosure
  • Figure 6 schematically shows a block diagram of an RNA-protein interaction prediction device according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic structural diagram of a computer system suitable for implementing the electronic device of the embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • numerous specific details are provided in order to give a thorough understanding of embodiments of the present disclosure.
  • those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details being omitted, or other methods, components, devices, steps, etc. may be adopted.
  • well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
  • FIG. 1 shows a schematic diagram of the system architecture of an exemplary application environment in which a method and device for predicting RNA-protein interaction according to an embodiment of the present disclosure can be applied.
  • the system architecture 100 may include one or more of terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.
  • Terminal devices 101, 102, 103 may be various electronic devices, including but not limited to desktop computers, portable computers, smart phones, and tablet computers. It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
  • the server 105 may be one server, or a server cluster composed of multiple servers, or a cloud computing platform or a virtualization center.
  • the server 105 can be used to perform: obtaining the RNA-protein pair to be predicted; performing feature extraction on the RNA-protein pair to obtain the sequence features of the RNA-protein pair to be predicted; vectorizing the predicted RNA-protein pair to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted; based on the sequence features of the RNA-protein pair to be predicted, the RNA-protein pair to be predicted
  • the RNA sequence representation vector and the protein sequence representation vector are obtained, and the interaction prediction value of the RNA-protein pair is obtained by using the interaction prediction model; the interaction between the RNA and the protein is determined according to the interaction prediction value.
  • the RNA-protein interaction prediction method provided by the embodiments of the present disclosure is generally executed by the server 105.
  • the RNA-protein interaction prediction device is generally set in the server 105, and the server can use the RNA-protein interaction to be predicted
  • the prediction result is sent to the terminal device, and displayed to the user by the terminal device.
  • the RNA-protein interaction prediction method provided by the embodiments of the present disclosure can also be executed by one or more of the terminal devices 101, 102, and 103.
  • the prediction device can also be set in the terminal equipment 101, 102, 103, for example, after being executed by the terminal equipment, the prediction result can be directly displayed on the display screen of the terminal equipment, or the prediction result can be provided to the user through voice broadcast, This is not specifically limited in this exemplary embodiment.
  • ncRPI noncoding RNA-protein interactions
  • RNA-protein interaction prediction method may include the following steps S210 to S250:
  • Step S210 Obtain the RNA-protein pair to be predicted
  • Step S220 Perform feature extraction on the RNA-protein pair to be predicted to obtain the sequence features of the RNA-protein pair to be predicted;
  • Step S230 Vectorizing the RNA-protein pair to be predicted to obtain the RNA sequence representation vector and protein sequence representation vector in the RNA-protein pair to be predicted;
  • Step S240 Based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, use an interaction prediction model to obtain the RNA-protein to be predicted pairs of interaction predictions;
  • Step S250 Determine the interaction between the RNA and the protein according to the interaction prediction value.
  • the RNA-protein interaction prediction method by obtaining the RNA-protein pair to be predicted; performing feature extraction on the RNA-protein pair to be predicted, the RNA-protein interaction to be predicted is obtained.
  • the sequence characteristics of the protein pair vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted; based on the RNA-protein pair to be predicted.
  • the sequence features, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted use the interaction prediction model to obtain the interaction prediction value of the RNA-protein pair to be predicted; according to the interaction prediction The value determines the interaction between the RNA and protein.
  • RNA sequences and protein sequences can be fully mined, so as to accurately predict the interaction between RNA and proteins; on the other hand, the interaction prediction
  • the effective combination of the characteristics of the model can further improve the accuracy of predicting the interaction between RNA and proteins.
  • step S210 the RNA-protein pair to be predicted is obtained.
  • RNA-protein pair to be predicted can be obtained. And the interaction between RNA and protein in each RNA-protein pair to be predicted is unknown.
  • the user can input the RNA-protein pair to be predicted through the terminal device.
  • the user may manually input the RNA-protein pair to be predicted, or input the RNA-protein pair to be predicted by voice, which is not specifically limited in this example.
  • an RNA can be input, and then a protein can be input, and there is no limitation on the input order of the two.
  • RNA and protein can be entered into different text boxes, or they can be entered into the same text box. For example, after the input is completed, click the "Start Prediction" button, and then start to execute the prediction steps provided in some embodiments of the present application.
  • RNA and protein means that the function of the protein is reflected in the interaction with other proteins and RNA.
  • protein-RNA interactions play an important role in protein synthesis.
  • many functions of RNA are also inseparable from the interaction with proteins.
  • the interaction can be regulation, guidance, etc., and is not limited here.
  • RNA in the presence of an interaction, RNA can guide protein synthesis, or RNA can regulate protein function.
  • the interaction between RNA and protein can also mean that the two can regulate each other's life cycle and function through physical interaction.
  • the RNA coding sequence can guide protein synthesis, and correspondingly, the protein can also regulate the expression and function of RNA.
  • the prediction result of the RNA-protein interaction to be predicted can also be output to the terminal device for users to view.
  • the prediction result may be directly displayed on the display screen of the terminal device, or the prediction result may be provided to the user through voice broadcast, which is not specifically limited in this example.
  • At least one RNA sequence to be predicted can also be obtained, and a protein sequence that interacts with each input RNA sequence to be predicted can be searched in the database through the interaction prediction model.
  • a protein sequence that interacts with each input RNA sequence to be predicted can be searched in the database through the interaction prediction model.
  • at least one protein sequence in the database can be selected, and multiple RNA-protein pairs are formed from the RNA sequence to be predicted and each protein sequence, and then the interaction can be predicted
  • the model predicts the interaction of each RNA-protein pair, and outputs the protein sequence that can interact with the RNA sequence to be predicted according to the prediction result.
  • several types of protein sequences can be stored in the database in advance, so as to be recalled when predicting the interaction of RNA-protein pairs.
  • the protein sequence can be stored in the Redis database or in the MySQL database, and then the protein sequence to be predicted can be queried and selected in real time.
  • Redis is a key-value storage system.
  • the Redis database can include: a key-value pair (key-value) formed by a sequence identifier and a corresponding protein sequence, wherein the key (key) is a sequence identifier, and the value ( value) is the corresponding protein sequence.
  • Redis can support more than 100K+ read and write frequencies per second, and has certain advantages in data reading and storage speed.
  • MySQL is a relational database management system. The relational database stores data in different tables instead of storing all data in a unified manner, which increases storage speed and flexibility. It has stable advantages in data storage and can avoid data occurrence. lost.
  • RNA sequences can also be stored in the database in advance, so as to be recalled when predicting the interaction of RNA-protein pairs. Therefore, at least one protein sequence to be predicted can also be obtained, and an RNA sequence that interacts with each input protein sequence to be predicted can be searched in the database through the interaction prediction model. Similarly, after the user enters the protein sequence through the terminal device, at least one RNA sequence in the database can be selected to form multiple RNA-protein pairs from the protein sequence to be predicted and each RNA sequence, and then the interaction prediction model can be used to predict each The interaction of RNA-protein pairs, and outputting the RNA sequence that can interact with the protein sequence to be predicted according to the prediction result, which is not specifically limited in the present disclosure.
  • step S220 feature extraction is performed on the RNA-protein pair to be predicted to obtain sequence features of the RNA-protein pair to be predicted.
  • RNA-protein pair to be predicted Before using the interaction prediction model to predict the interaction of each RNA-protein pair to be predicted, it is necessary to obtain the input features of the interaction prediction model.
  • feature extraction can be performed on the RNA-protein pair to be predicted, that is, feature extraction is performed sequentially on the RNA sequence and protein sequence in the RNA-protein pair to be predicted, to obtain corresponding RNA sequence features and protein sequence features,
  • the sequence features of the RNA-protein pair to be predicted are composed of RNA sequence features and protein sequence features, and the sequence features can be used as the input of the interaction prediction model.
  • the RNA-protein pair to be predicted can also be vectorized, that is, the RNA sequence and protein sequence in the RNA-protein pair are respectively vectorized to obtain the corresponding RNA sequence representation vector and protein representation vector, and the RNA Sequence representation vectors and protein representation vectors were used as input to the interaction prediction model, respectively.
  • RNA-protein pair It is also possible to perform feature extraction and vectorization processing on the RNA-protein pair to be predicted at the same time to obtain the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein representation vector in the RNA-protein pair to be predicted,
  • the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein representation vector in the RNA-protein pair to be predicted can be used as the input of the interaction prediction model, which is not specifically limited in the present disclosure.
  • feature extraction can be performed on the RNA-protein pair to be predicted according to step S310 and step S320 .
  • step S310 Obtain the original sequence feature set.
  • the original data set can be obtained, and feature extraction is performed on each RNA-protein pair in the original data set to obtain the original sequence feature set.
  • the RPI1807 data set can be used as the original data set.
  • This data set can contain 3243 RNA-protein pairs, and 1807 pairs of positive examples and 1436 pairs of negative examples are included in the 3243 RNA-protein pairs.
  • a positive example can indicate that there is an interaction between the RNA and the protein in the RNA-protein pair
  • a negative example can indicate that there is no interaction between the RNA and the protein in the RNA-protein pair.
  • the RPI2241 data set, the RPI369 data set, etc. may also be used as the original data set for experimentation, which is not specifically limited in the present disclosure.
  • feature extraction can be performed on the RNA-protein pairs in the original data set according to steps S410 to S430 to obtain the original sequence data set.
  • Step S410 Arranging and combining the basic units of RNA and protein to obtain k-mer subsequences.
  • bases are the basic units of RNA.
  • RNA sequence four kinds of bases can be included, namely adenine (A), uracil (U), guanine (G) and cytosine (C). All k-mer subsequences of the RNA sequence can be obtained by permuting and combining the four bases.
  • amino acids are the basic units of proteins.
  • protein sequence 20 amino acids can be included, and the 20 amino acids are coded sequentially as A, G, V, I, L, F, P, Y, M, T, S, H, N, Q, W, R, K, D, E, C.
  • the 20 amino acids can be divided into ⁇ A, G, V ⁇ , ⁇ I, L, F, P ⁇ , ⁇ Y, M, T, S ⁇ , ⁇ H, N , Q, W ⁇ , ⁇ R, K ⁇ , ⁇ D, E ⁇ and ⁇ C ⁇ , there are 7 types, and each type of amino acid is recoded, such as 1, 2, 3, 4, 5, 6 and 7.
  • the protein sequence ALQDVG can be converted to 124611. Then all the k-mer subsequences of the amino acid sequence can be obtained by permuting and combining the seven types of amino acids.
  • the 20 amino acids can also be classified according to the composition of the amino acids, and the k-mer subsequence of the amino acid sequence can be obtained directly based on the arrangement and combination of the 20 amino acids without classification, which is not specifically limited in this disclosure .
  • the k-mer subsequence refers to a k-mer subsequence composed of k bases or k-type amino acids as a group.
  • the k-mer subsequence may include an RNA k-mer subsequence and a protein k-mer subsequence.
  • the k-mer subsequence may refer to an RNA k-mer subsequence obtained by permuting and combining four types of bases, and for a certain value of k, 4 k types of k-mer subsequences may be obtained.
  • a k-mer subsequence may also refer to a protein k-mer subsequence obtained by permuting and combining 7 types of amino acids, and 7 k types of k-mer subsequences can be obtained for a certain k value. It can be understood that the classification of the 20 amino acids into 7 categories is only illustrative and may not be classified. Similarly, the four bases of the RNA sequence can also be classified according to actual needs.
  • one or more values of k may be used, and the specific value of k may be adjusted according to actual conditions, which is not limited herein.
  • two values of 3 and 4 may be taken as an example for illustration.
  • AAA and AUC are two 3-mer subsequences of RNA sequences
  • AAAA and AAAU are two 4-mer subsequences of RNA sequences.
  • 111 and 112 are two 3-mer subsequences of the protein sequence, and 1111 and 1122 are two 4-mer subsequences of the protein sequence.
  • k may only be 3 or only 4, which is not specifically limited in the present disclosure.
  • Step S420 Calculate the average number of occurrences of each k-mer subsequence in each RNA-protein pair, and calculate the variance of each k-mer subsequence according to the average occurrence number.
  • all 3-mer subsequences and 4-mer subsequences of RNA sequences and protein sequences can be obtained according to step S410, that is, 64 kinds of 3-mer subsequences and 256 kinds of 4-mer subsequences of RNA sequences can be obtained. 343 3-mer subsequences and 2401 4-mer subsequences of sequence and protein sequences.
  • the average number of occurrences of each 3-mer subsequence or 4-mer subsequence in each RNA-protein pair in the original data set can be calculated, and each 3-mer subsequence or 4-mer subsequence can be calculated based on the average number of occurrences.
  • the variance of the mer subsequence is the average number of occurrences.
  • the RNA sequence and Protein sequences were converted into 3-mer subsequences and 4-mer subsequences.
  • the 3-mer subsequence of the sequence may include "AGA”, “GAU”, “AUG” and “UGG”
  • the 4-mer subsequence of the sequence may include "AGAU”, " GAUG” and "AUGG”, that is, the RNA sequence can be read through forward overlapping to obtain the corresponding 3-mer subsequence or 4-mer subsequence.
  • the RNA sequence can also be read by reverse overlapping to obtain the corresponding 3-mer subsequence or 4-mer subsequence.
  • the 3-mer subsequence of this sequence can also include "GGU”, “GUA”, “UAG” and “AGA”
  • the 4-mer subsequence of this sequence can also include "GGUA”, "GUAG” and "UAGA” ".
  • the RNA sequence can also be read in a non-overlapping manner to obtain the corresponding 3-mer subsequence or 4-mer subsequence, for example, the 3-mer subsequence of the sequence can also include "AGA” and "UGG", this disclosure does not specifically limit it.
  • the number of occurrences of each 3-mer subsequence and/or 4-mer subsequence in each RNA-protein pair can be determined. Count the number of occurrences of each 3-mer subsequence and/or 4-mer subsequence in each RNA-protein pair to obtain the total number of occurrences of the subsequence in the original data set, which can be calculated according to the total number of occurrences Average frequency of occurrence of each 3-mer subsequence and/or 4-mer subsequence in each RNA-protein pair. Finally, the variance of each subsequence can be calculated from the average number of occurrences of each 3-mer subsequence and/or 4-mer subsequence in each RNA-protein pair and the number of occurrences in each RNA-protein pair.
  • the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or a 4-mer subsequence of an RNA sequence or a protein sequence.
  • You can first count the total number of occurrences of the subsequence in the RPI1807 data set. For example, n RNA-protein pairs (n 3243) in the RPI1807 data set can be cycled, and the number of occurrences of the subsequence in each RNA-protein pair can be obtained by statistics as x 1 , x 2 ,...,x n .
  • the total number of occurrences of the subsequence in the RPI1807 data set is obtained, which is recorded as num i .
  • the average number of occurrences of the subsequence in each RNA - protein pair can be calculated according to the total number of occurrences num i , that is, according to:
  • the variance of the subsequence can be calculated by the average number of occurrences of the i-th k-mer subsequence in each RNA-protein pair and the number of occurrences in each RNA-protein pair, that is, according to:
  • n is the number of RNA-protein pairs in the RPI1807 data set
  • m i is the average number of occurrences of the subsequence in each RNA-protein pair
  • x n is the number of the subsequence in the nth RNA-protein pair
  • x 1 is the number of occurrences of the subsequence in the first RNA-protein pair
  • x 2 is the number of occurrences of the subsequence in the second RNA-protein pair.
  • Step S430 Determine the original sequence feature set according to the variance of each k-mer subsequence.
  • the k-mer subsequence that meets the preset conditions can be determined according to the variance of each k-mer subsequence, and is composed of k-mer subsequences that meet the preset conditions Raw sequence feature set.
  • all the 3-mer subsequences and 4-mer subsequences of the RNA sequence and all the 3-mer subsequences and 4-mer subsequences of the protein sequence can be sorted according to the variance, such as sorting in descending order, you can choose The top k-mer subsequences constitute the original sequence feature set.
  • the top 560 k-mer subsequences can be selected, and these 560 k-mer subsequences form the original sequence feature set.
  • it can include the 3-mer subsequences of the top 60 RNA sequences, the 4-mer subsequences of the top 200 RNA sequences, the 3-mer subsequences of the top 200 protein sequences, and the 4-mer subsequences of the top 100 protein sequences sequence.
  • the number of selected k-mer subsequences is only illustrative, and any number of k-mer subsequences can be selected according to actual needs.
  • a variance threshold may also be preset, and k-mer subsequences with a variance greater than the threshold are screened out, and the filtered k-mer subsequences form the original sequence feature set.
  • the preset variance threshold is 3
  • k-mer subsequences with a variance greater than 3 can be selected to form the original sequence feature set.
  • the frequency of occurrence of each k-mer subsequence in the original data set can be counted, and the frequency of occurrence of each k-mer subsequence in the original data set can be calculated according to the frequency of occurrence.
  • the ratio between the frequency of occurrence and the total number of RNA-protein pairs in the original data set can be calculated to obtain the frequency of occurrence of the subsequence in the original data set.
  • each k-mer subsequence is flagged for its presence in each RNA-protein pair.
  • the variance of each k-mer subsequence can be calculated according to the occurrence frequency of each k-mer subsequence in the original data set and the tag value in each RNA-protein pair.
  • the frequency of occurrence of the i-th k-mer subsequence in the RPI1807 data set obtained by statistics is recorded as num i , and then the frequency of occurrence of the subsequence in the RPI1807 data set can be calculated according to the frequency of occurrence num i , that is, in each RNA - the frequency of occurrence Freq i in the protein pair.
  • all 3-mer subsequences and 4-mer subsequences of the RNA sequence and all 3-mer subsequences and 4-mer subsequences of the protein sequence can be respectively Sequences are sorted according to the size of the variance, such as in descending order, and the top-ranked k-mer subsequences can be selected to form the original sequence feature set.
  • a variance threshold may also be preset, and k-mer subsequences with a variance greater than the threshold are screened out, and the filtered k-mer subsequences form the original sequence feature set.
  • the k-mer feature of each RNA-protein pair in the original data set may be extracted, and the extracted k-mer feature of the RNA sequence and the k-mer feature of the protein sequence form the original sequence feature set.
  • the k-mer feature may include monomer component information (that is, each base contained) and sequence order information of the RNA sequence. Therefore, using the k-mer feature can better describe an RNA sequence, that is, an RNA sequence can be more accurately determined according to the k-mer feature, and different RNA sequences can also be distinguished by the k-mer feature.
  • the frequent itemset features of each RNA-protein pair in the original data set can also be extracted, and the original sequence feature set is composed of the extracted frequent itemset features.
  • the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence. Therefore, using frequent itemset features can better distinguish between interacting and non-interacting RNA-protein pairs. It is also possible to extract k-mer features and frequent itemset features at the same time, and combine them to form the original sequence feature set. By combining the characteristics of k-mer features and frequent itemset features, it is possible to predict unknown RNA-protein pairs more accurately.
  • the interaction between RNA and protein is not specifically limited in this disclosure.
  • the frequent itemset feature refers to the k-mer subsequence pair composed of RNA k-mer subsequence and protein k-mer subsequence with a certain degree of support in the original data set, and the support degree refers to the combination of A and B.
  • AAU,137 it means a 3-mer subsequence pair composed of an RNA 3-mer subsequence AAU and a protein 3-mer subsequence 137.
  • the support degree of this subsequence pair is the ratio of RNA-protein pairs containing both subsequences AAU and 137 to all RNA-protein pairs in the original data set.
  • the RNA sequence and protein sequence in each RNA-protein pair can be converted into k-mer subsequences respectively to obtain k-mer subsequence pairs.
  • the frequency of occurrence of each k-mer subsequence pair in the original data set is counted, and the k-mer subsequence pair that meets the preset condition of the frequency of occurrence is used as the frequent itemset feature and constitutes the original sequence feature set.
  • the RNA sequences and protein sequences of all positive RNA-protein pairs in the RPI1807 dataset can be converted into positive 3-mer subsequences and positive 4-mer subsequences, respectively.
  • the RNA sequences and protein sequences of all negative RNA-protein pairs in this dataset are converted into negative 3-mer subsequences and negative 4-mer subsequences, respectively.
  • all positive and negative RNA 3-mer subsequences, positive and negative RNA 4-mer subsequences, positive and negative protein 3-mer subsequences, and positive and negative examples in the dataset can be found Protein 4-mer subsequences.
  • RNA 3-mer subsequences and protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences in the data set are combined in pairs to obtain a variety of 3-mer subsequence pairs and 4-mer subsequences.
  • mer subsequence pair Exemplarily, positive example RNA 3-mer subsequences and positive example protein 3-mer subsequences can be cross-combined to obtain positive example 3-mer subsequence pairs.
  • Negative RNA 3-mer subsequences and negative protein 3-mer subsequences can be cross-combined to obtain negative 3-mer subsequence pairs.
  • the positive example RNA 4-mer subsequence and the positive example protein 4-mer subsequence can be cross-combined to obtain the positive example 4-mer subsequence pair.
  • the negative example RNA 4-mer subsequence and the negative example protein 4-mer subsequence can be cross-combined to obtain the negative example 4-mer subsequence pair.
  • the occurrence frequency of each sub-sequence pair in the data set can be counted. For example, for any positive example 3-mer subsequence pair, it can be based on:
  • num is the number of occurrences of the positive 3-mer subsequence pair in the data set
  • NUM is the total number of occurrences of all positive 3-mer subsequence pairs in the data set.
  • all 3-mer subsequence pairs and 4-mer subsequence pairs can be sorted according to the frequency of occurrence, For example, in descending order, the top-ranked k-mer subsequence pairs can be selected to form frequent itemsets. For example, to sort all the positive 3-mer subsequence pairs in descending order, the first m 3-mer subsequence pairs can be selected to form the frequent itemset A1. All the positive 4-mer subsequence pairs are sorted in descending order, and the first n 4-mer subsequence pairs can be selected to form the frequent itemset A2.
  • All the negative 3-mer subsequence pairs are sorted in descending order, and the first p 3-mer subsequence pairs can be selected to form the frequent itemset A3.
  • All the negative 4-mer subsequence pairs are sorted in descending order, and the first q 4-mer subsequence pairs can be selected to form the frequent itemset A4. Then the original sequence feature set is composed of these four frequent itemsets A1, A2, A3 and A4.
  • the occurrence frequency threshold can also be preset, and the k-mer subsequence pairs whose occurrence frequency is greater than the threshold are filtered out, and the filtered k-mer subsequence pairs are used as frequent itemset features to form the original sequence feature set , which is not specifically limited in the present disclosure.
  • the RNA sequence and protein sequence in each RNA-protein pair can be converted into k-mer subsequences respectively, and the first candidate item set is composed of k-mer subsequences, and the k-mer subsequences Including RNA k-mer subsequence and protein k-mer subsequence.
  • the RNA sequence and protein sequence of each RNA-protein pair in the RPI1807 data set can be converted into 3-mer subsequences and 4-mer subsequences respectively.
  • RNA 3-mer subsequences By traversing the RPI1807 data set, all RNA 3-mer subsequences, RNA 4-mer subsequences, protein 3-mer subsequences and protein 4-mer subsequences in the data set can be found out, and all 3-mer subsequences in the data set
  • the -mer subsequence and the 4-mer subsequence form the first candidate itemset C1.
  • the occurrence frequency of each k-mer subsequence in the first candidate item set C1 in the original data set can be counted.
  • the subsequence may be a 3-mer subsequence of an RNA sequence or a protein sequence, or may be a 4-mer subsequence of an RNA sequence or a protein sequence.
  • the number of occurrences of the jth k-mer subsequence in the RPI1807 data set obtained by statistics is recorded as num j , and then the frequency of occurrence Freq j of the subsequence in the RPI1807 data set can be calculated according to the number of occurrences uum j .
  • each 3-mer subsequence or 4-mer subsequence in the RPI1807 data set in the first candidate item set C1 can be calculated.
  • all 3-mer subsequences and 4-mer subsequences can be screened according to a preset occurrence frequency threshold. For example, RNA 3-mer subsequences with frequency greater than the first threshold, RNA 4-mer subsequences with frequency greater than the second threshold, protein 3-mer subsequences with frequency greater than the third threshold, and protein 3-mer subsequences with frequency greater than the second threshold can be selected.
  • the four-threshold protein 4-mer subsequences together constitute the frequent itemset L1.
  • the first threshold, the second threshold, the third threshold and the fourth threshold may be the same or different, which is not specifically limited in the present disclosure.
  • the frequency of occurrence of 3-mer subsequences and 4-mer subsequences of RNA sequences and the frequency of occurrence of 3-mer subsequences and 4-mer subsequences of protein sequences can also be sorted in descending order, ranked by The preceding subsequences form the frequent itemset L1, which is not specifically limited in this disclosure.
  • RNA 3-mer subsequence and protein 3-mer subsequence, RNA 4-mer subsequence and protein 4-mer subsequence in the frequent itemset L1 can be cross-combined in pairs to obtain a variety of 3-mer subsequences pairs and 4-mer subsequence pairs, and the second candidate item set C2 is composed of multiple subsequence pairs obtained by combination.
  • the 3-mer subsequence pair "AUC_137" and “AUC_123” can also be obtained by cross-combining the RNA 3-mer subsequence and the protein 3-mer subsequence, and by combining the RNA 4-mer subsequence and the protein 4-
  • the 4-mer subsequence pairs "AAUU_1737”, “AAUU_1234", “AGUC_1737” and “AGUC_1234" can be obtained by cross-combining mer subsequences.
  • the occurrence frequency of each subsequence pair in the second candidate item set C2 in the RPI1807 data set can be counted.
  • the subsequence pair may be a 3-mer subsequence pair, or a 4-mer subsequence pair.
  • the frequency of occurrence of the fth kind of k-mer subsequence pair in the RPI1807 data set obtained by statistics is recorded as num f , and then the frequency of occurrence of the subsequence pair in the RPI1807 data set can be calculated according to the frequency of occurrence num f , that is, the The support of the subsequence pair support f .
  • the support degree of each subsequence pair in the second candidate item set C2 can be calculated.
  • the subsequence pair satisfying the preset condition can be determined according to the support degree of each subsequence pair, and the original sequence feature set is composed of the subsequence pair satisfying the preset condition.
  • a support threshold can be preset, and subsequence pairs whose support is greater than the threshold are screened out, and the screened subsequence pairs form the original sequence feature set.
  • there are 370 subsequence pairs whose support degree is greater than the threshold and these 370 subsequence pairs are 370 frequent itemset features, and the original sequence feature set can be composed of 370 frequent itemset features.
  • all subsequence pairs in the second candidate item set C2 can also be sorted in descending order according to the support degree, and the top-ranked subsequence pairs are selected to form the original sequence feature set, which is not specifically limited in this disclosure .
  • the frequent itemset feature can combine the kmer feature of the RNA sequence and the kmer feature of the protein sequence, and use the frequent itemset feature to better distinguish between interacting and non-interacting RNA-protein pairs . Therefore, when the original sequence feature set is composed of frequent itemset features, and the feature extraction of the RNA-protein pair to be predicted is performed based on the original sequence feature set, the extracted sequence features of the RNA-protein pair to be predicted can be more accurate. Determine whether the RNA-protein pair to be predicted has an interaction.
  • Step S320 Determine the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set.
  • the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and after the original sequence feature set is obtained, each k-mer subsequence can be searched in the original sequence feature set. mer subsequence, and obtain the sequence characteristics of the RNA-protein pair to be predicted according to the search results.
  • the sequence feature of the RNA-protein pair to be predicted may refer to a complete sequence feature composed of RNA sequence features and protein sequence features.
  • the original sequence feature set may consist of 560 k-mer subsequences.
  • the 560 kinds of k-mer subsequences can be [CCC, ..., AGU, CCCC, ..., CUGG, 777, ..., 373, 7774, ..., 7571].
  • the feature calculation can be performed on the RNA sequence and the protein sequence in the RNA-protein pair to be predicted based on the original sequence feature set to obtain the sequence feature of the RNA-protein pair to be predicted.
  • the features of each feature dimension correspond to a k-mer subsequence.
  • the subsequence CCC is the feature of the first feature dimension
  • the subsequence 7571 is the feature of the 560th feature dimension. All 3-mer subsequences and 4-mer subsequences of the RNA-protein pair to be predicted can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results.
  • the feature value on the feature dimension is 1, and if it does not exist, it is 0.
  • the corresponding RNA 3-mer subsequence includes CCA, ..., CCC, ..., CCA, ..., AUA
  • RNA 4-mer subsequences include CCAC,...,CCCC,...,AAUA
  • protein 3-mer subsequences include 123,...,373,...,777,...,373, protein 4-mer subsequences include 1233,..., 7377, ..., 7373.
  • the search results of each subsequence in the original sequence feature set can be referred to in Table 1.
  • the feature CCC of the first feature dimension in the original sequence feature set is also a 3-mer subsequence of the RNA-protein pair. Therefore, it can be determined that the feature CCC on the first feature dimension in the original sequence feature set exists, and the corresponding feature value can be recorded as 1. For another example, if the feature CUGG in the original sequence feature set does not exist in the RNA sequence of the RNA-protein pair to be predicted, then the feature value on the 260th feature dimension in the original sequence feature set can be recorded as 0. Finally, a 560-dimensional eigenvalue vector [1, ..., 1, 1, ..., 0, 1, ..., 1, 0, ..., 0] can be calculated, which is the RNA-protein to be predicted right sequence features. It can be understood that each eigenvalue contained in the eigenvalue vector is in one-to-one correspondence with the eigenvalues of each feature dimension in the original sequence feature set.
  • the original sequence feature set is composed of the extracted k-mer feature of the RNA sequence and the k-mer feature of the protein sequence.
  • the feature extraction of the RNA-protein pair to be predicted is performed to obtain the sequence characteristics of the RNA-protein pair to be predicted.
  • the k-mer feature can include the RNA sequence Monomer component information (that is, each base contained) and sequence order information. Therefore, using the k-mer feature can better describe an RNA sequence, that is, an RNA sequence can be more accurately determined according to the k-mer feature, and different RNA sequences can also be distinguished by the k-mer feature.
  • the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and the RNA k-mer subsequence and protein k -mer subsequences are cross-combined to obtain a variety of RNA-protein k-mer subsequence pairs.
  • the RNA 3-mer subsequence can be And protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences were cross-combined in pairs to obtain a variety of 3-mer subsequence pairs and 4-mer subsequence pairs.
  • Each RNA-protein k-mer subsequence pair can be searched in the original sequence feature set, and the sequence features of the RNA-protein pair can be obtained according to the search results.
  • the original sequence feature set may consist of 370 frequent itemset features.
  • the 370 frequent itemset features can be [CCA_121, ..., UCUG_1312, ..., AAU_122, ..., CUUU_1312, ...]. Convert the RNA-protein pair to be predicted into 3-mer subsequence and 4-mer subsequence to obtain RNA 3-mer subsequence, RNA 4-mer subsequence, protein 3-mer subsequence and protein 4-mer subsequence.
  • RNA 3-mer subsequences and protein 3-mer subsequences, RNA 4-mer subsequences and protein 4-mer subsequences can be paired to obtain a variety of 3-mer subsequence pairs and 4-mer subsequence pairs. Then, feature calculation can be performed on the RNA sequence and protein sequence in the RNA-protein pair to be predicted based on the original sequence feature set to obtain the sequence feature of the RNA-protein pair.
  • the feature of each feature dimension corresponds to a kind of k-mer subsequence pair.
  • the subsequence pair CUG_122 is a feature of the first feature dimension. All subsequence pairs of the RNA-protein pair can be searched in the original sequence feature set, and according to the search result, it is determined whether the feature on each feature dimension in the original sequence feature set exists. If it exists, the feature value on the feature dimension is 1, and if it does not exist, it is 0.
  • RNA sequence in the RNA-protein pair to be predicted is "CCAUCUGAAU”
  • protein sequence is "1312137122”. It can be seen that CCA_121, UCUG_1312, and AAU_122 in the subsequence pair of the RNA-protein pair exist in the original sequence feature set. Therefore, the feature value on the corresponding feature dimension in the original sequence feature can be recorded as 1.
  • the search results of each subsequence pair in the original sequence feature set can be referred to in Table 2.
  • a 370-dimensional eigenvalue vector [1, ..., 1, ..., 1, ..., 0, ...] can be calculated, which is the sequence feature of the RNA-protein pair to be predicted.
  • each eigenvalue contained in the eigenvalue vector is in one-to-one correspondence with the eigenvalues of each feature dimension in the original sequence feature set.
  • the RNA sequence and protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively, and each k-mer subsequence can be searched in the original sequence feature set. mer subsequence to get the first sequence features. Then, RNA k-mer subsequences and protein k-mer subsequences can be combined to obtain multiple RNA-protein k-mer subsequence pairs, and each RNA-protein k-mer subsequence pair can be found in the original sequence feature set , to get the second sequence features. Finally, the sequence features of the RNA-protein pair to be predicted can be composed of the first sequence feature and the second sequence feature.
  • the original sequence feature set may include two feature subsets, and the two feature subsets include 560 k-mer subsequences [CCC, ..., CCCC, ..., 777, ..., 7774, ...] and 370 k-mer subsequences respectively.
  • Frequent itemset features [CCA_121, ..., UCUG_1312, ..., AAU_122, ..., CUUU_1312, ...].
  • the RNA-protein pair to be predicted can be converted into RNA 3-mer subsequence, RNA 4-mer subsequence, protein 3-mer subsequence and protein 4-mer subsequence.
  • RNA 3-mer subsequence can also be converted into Pair with protein 3-mer subsequence, RNA 4-mer subsequence and protein 4-mer subsequence to obtain various 3-mer subsequence pairs and 4-mer subsequence pairs.
  • feature calculation can be performed on the RNA sequence and protein sequence in the RNA-protein pair to be predicted based on the original sequence feature set to obtain the sequence feature of the RNA-protein pair to be predicted.
  • all subsequences and subsequence pairs of the RNA-protein pair to be predicted can be searched in the original sequence feature set, and it is determined whether the features on each feature dimension in the original sequence feature set exist according to the search results.
  • a 560-dimensional eigenvalue vector [1, . . . , 1, . . . , 1, . . . , 0, .
  • a 370-dimensional eigenvalue vector [1, . . . , 1, . . . , 1, . . . , 0, .
  • a 930-dimensional eigenvalue vector can be obtained by concatenating two eigenvalue vectors, which is the sequence feature of the RNA-protein pair to be predicted. It is also possible to directly input two eigenvalue vectors into the interaction prediction model at the same time, which is not specifically limited in this disclosure.
  • a 930-dimensional original sequence feature set can also be composed of 560 k-mer subsequences and 370 frequent itemset features.
  • the original sequence feature set is [CCC, ..., CCCC, ..., 777, ..., 7774, ..., CCA_121, ..., UCUG_1312, ..., AAU_122, ..., CUUU_1312, ...].
  • All subsequences and subsequence pairs of the RNA-protein pair can be searched in the original sequence feature set, and according to the search results, it is determined whether the features on each feature dimension in the original sequence feature set exist, and a 930-dimensional eigenvalue vector [1 ,...,1,...,1,...,0,...,1,...,1,...,1,...,0,...], the eigenvalue vector is the sequence feature of the RNA-protein pair to be predicted.
  • step S230 the RNA-protein pair to be predicted is vectorized to obtain an RNA sequence representation vector and a protein sequence representation vector in the RNA-protein pair to be predicted.
  • the obtained sequence features can be used as the input of the first interaction prediction model.
  • the RNA-protein pair to be predicted can also be vectorized, and the obtained vector can be used as the input of the second interaction prediction model.
  • the RNA sequence and the protein sequence in the RNA-protein pair to be predicted can be converted into k-mer subsequences respectively.
  • RNA sequences that do not overlap can be divided into M RNA k-mer subsequences
  • protein sequences that do not overlap can be divided into N protein k-mer subsequences.
  • the RNA sequence is AUCUGAAAU, it can be divided into three RNA k-mer subsequences, namely AUC, UGA and AAU.
  • the non-overlapping division of the RNA sequence and the protein sequence into multiple k-mer subsequences is to vectorize the RNA sequence and the protein sequence, that is, the bases in the RNA sequence and the protein sequence
  • the amino acids of are vectorized in the form of k-joints.
  • each base contained in the RNA sequence in the RNA-protein pair to be predicted can also be vectorized to obtain multiple base vectors, and the multiple base vectors can be spliced to obtain an RNA sequence representation vector.
  • each amino acid contained in the protein sequence in the RNA-protein pair to be predicted can be vectorized to obtain multiple amino acid vectors, and multiple amino acid vectors can be spliced to obtain a protein sequence representation vector.
  • the overlapping RNA sequences can also be divided into P RNA k-mer subsequences, and the overlapping protein sequences can be divided into Q protein k-mer subsequences, which is not specifically limited in this disclosure. Then, the RNA sequence representation vector and the protein sequence representation vector can be respectively input into the second interaction prediction model.
  • each k-mer subsequence of the RNA sequence and the protein sequence can be encoded first.
  • Each RNA 3-mer subsequence and protein 3-mer subsequence can be encoded in Embedding (vector mapping) in sequence, and each 3-mer subsequence can be represented by a low-dimensional vector, and corresponding multiple 3-mer subsequences can be obtained.
  • mer subsequence vector e.
  • each 3-mer subsequence can be One-Hot (one-hot) encoded, and One-Hot encoding is also called one-bit effective encoding.
  • the method is to use N-bit status registers to perform N-state Encoding, each state has an independent register bit, and at any time, only one bit in the register is valid.
  • a 64-dimensional One-Hot vector can be obtained by encoding, and the i-th element in the vector Set it to 1, and set other elements to 0, such as [0, 1, 0, 0, ..., 0].
  • a 343-dimensional One-Hot vector can be obtained by encoding, and the jth element in the vector is set to 1, All other elements are set to 0.
  • each RNA 3-mer subsequence and protein 3-mer subsequence can correspond to a 3-mer One-Hot vector.
  • dense vectors can also be used to represent each 3-mer subsequence.
  • the Word2vec algorithm can be used to map each 3-mer subsequence into a vector space, and each 3-mer subsequence can be represented by a subsequence vector in the vector space.
  • RNA sequence data can be obtained, and the BERT pre-training model can be used for training. After the training is completed, a certain RNA sequence can be input into the trained model to obtain the high-dimensional vector of the RNA sequence. Not specifically limited.
  • RNA sequences and protein sequences in the RNA-protein pairs to be predicted can be converted into 3-mer subsequences respectively, such as obtaining M RNA 3-mer subsequence and N protein 3-mer subsequences.
  • M 3-mer One-Hot vectors corresponding to M RNA 3-mer subsequences can be determined through query, and the M 3-mer One-Hot vectors are sequentially spliced, such as splicing in the row direction to obtain a A two-dimensional matrix of M*64, such as:
  • the two-dimensional matrix is the 3-mer One-Hot representation vector of the RNA sequence.
  • N 3-mer One-Hot vectors corresponding to N protein 3-mer subsequences can also be determined by querying, and the N 3-mer One-Hot vectors are sequentially spliced in the row direction to obtain an N*343 binary Dimensional matrix, the two-dimensional matrix is the 3-mer One-Hot representation vector of the protein sequence. It is understandable that M 3-mer One-Hot vectors or N 3-mer One-Hot vectors can also be spliced in columns, and direct (ie tail) splicing can also be performed to obtain sequenced 3-mer One-Hot vectors , which is not specifically limited in the present disclosure.
  • the RNA sequence representation vector and the protein sequence representation vector can be used as the input of the deep learning model to further discover the Few or new feature combinations reveal the interactions between implicit features.
  • each base included in the RNA sequence in the RNA-protein pair can also be vectorized to obtain multiple base vectors, and the multiple base vectors can be spliced to obtain an RNA sequence representation vector.
  • each amino acid contained in the protein sequence in the RNA-protein pair can be vectorized to obtain multiple amino acid vectors, and multiple amino acid vectors can be spliced to obtain a protein sequence representation vector.
  • each base and each type of amino acid can be One-Hot coded.
  • base A can be represented by a One-Hot vector [1, 0, 0, 0]
  • U can be represented by [0, 0, 0, 1].
  • Class 1 proteins can be coded as [1, 0, 0, 0, 0, 0, 0] and class 7 proteins can be coded as [0, 0, 0, 0, 0, 1].
  • dense vectors can also be used to represent each 3-mer subsequence.
  • the Word2vec algorithm can be used to map each base and each type of amino acid into a vector space to obtain a corresponding vector representation.
  • Doc2vec algorithm, Glove algorithm, etc. can also be used to convert each base and each type of amino acid into a vector, which is not specifically limited in the present disclosure.
  • the RNA sequence can be represented by a single-base One-Hot vector to obtain an L*4 matrix.
  • L is the sequence length of the RNA sequence, that is, the number of bases contained in the RNA sequence.
  • a One-Hot vector of 8 bases can be obtained.
  • each column represents the One-Hot vector representation of a base.
  • the eight One-Hot vectors can be concatenated in the row direction to obtain an 8*4 matrix, which is the One-Hot vector representation of the RNA sequence.
  • step S240 based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted, the RNA to be predicted is obtained using an interaction prediction model - Interaction predictions for protein pairs.
  • At least three interaction prediction models can be used to predict the interaction of the RNA-protein pair to be predicted.
  • the interaction prediction models may all be traditional machine learning models, or all may be deep learning models, or at least three interaction prediction models including both traditional machine learning models and deep learning models.
  • the traditional machine learning model refers to the processing of natural data in its original form. For example, composing a pattern recognition or machine learning system requires expert knowledge to extract features from raw data (such as pixel values of an image) and convert them into an appropriate feature representation.
  • the traditional machine learning model may include a linear regression model, a logistic regression model, a support vector machine model, a decision tree model, a K-nearest neighbor (K-Nearest Neighbor, KNN) model, a random forest model, and a naive Bayesian model, etc. .
  • the deep learning model has the ability to automatically extract features, and can be composed of multiple processing layers to form a complex computing model, thereby automatically obtaining data representation and multiple levels of abstraction, which is a kind of learning for feature representation.
  • the deep learning model may include a convolutional neural network model, a recurrent neural network model, and the like.
  • the interaction prediction models are all traditional machine learning models
  • the RNA-protein pair to be predicted can not be vectorized, but only the sequence features obtained by feature extraction of the RNA-protein pair to be predicted can be used as The input to each traditional machine learning model.
  • feature extraction of the RNA-protein pair to be predicted may not be performed, but only the RNA sequence representation vector and protein sequence representation vector obtained by vectorizing the RNA-protein pair to be predicted as input to each deep learning model.
  • the sequence of the RNA-protein pair to be predicted can be The features are input into the first interaction prediction model to obtain the first interaction prediction value.
  • the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted can be input into the second interaction prediction model to obtain the second interaction prediction value.
  • the interaction prediction model used may include at least one first interaction prediction model and at least two second interaction prediction models, or may also include at least two first interaction prediction models and at least one second interaction prediction model predictive model.
  • At least three interaction prediction models are trained to obtain multiple learning results, and the multiple learning results are fused to obtain a strong supervised model.
  • the fusion of various learning results can obtain a better learning effect than a single model.
  • a strongly supervised model can be viewed as an ensemble model that actually contains multiple weakly supervised models. Therefore, the first interaction prediction model and the second interaction prediction model can be combined, and the outputs of multiple models can be fused and learned.
  • each model is trained individually during model training.
  • the model parameters of each model can be optimized through the backpropagation algorithm, and a strong supervision model can be obtained from each optimized model, which can solve the problems of low generalization ability and easy overfitting when using a single model for prediction. question.
  • the sequence features of the RNA-protein pair to be predicted can be input into a traditional machine learning model to obtain the first interaction prediction value.
  • the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted can be input into the deep learning model to obtain the second interaction prediction value.
  • each deep learning model may include at least two sub-deep learning models, which are respectively used to process the RNA sequence and the protein sequence in the RNA-protein pair.
  • the representation vector of the RNA sequence can be input into the first sub-deep learning model to obtain the first sequence feature.
  • the representation vector of the protein sequence can be input into the second sub-deep learning model to obtain the second sequence feature.
  • the first sequence feature and the second sequence feature can be fused through the fully connected layer of each deep learning model to obtain a second interaction prediction value based on the fused features.
  • the interaction prediction model used may include at least one traditional machine learning model and at least two deep learning models, or may also include at least two traditional machine learning models and at least one deep learning model.
  • the traditional machine learning model can include at least one of SVM model, LR (Logistic Regression, logistic regression) model, random forest model, and the deep learning model can include at least one of CNN model and recurrent neural network model,
  • the recurrent neural network model may be an LSTM model, a BiLSTM (Bi-directional LSTM, two-way long-short memory network) model.
  • a 930-dimensional original sequence feature set can be composed of 560 k-mer subsequences and 370 frequent itemset features, and the original sequence feature set is characterized according to the k-mer subsequence of the RNA-protein pair to be predicted Calculate and obtain a 930-dimensional eigenvalue vector, and use it as the input of the SVM model to obtain the interaction prediction value y 1 .
  • the RNA-protein pairs to be predicted can also be quantized to obtain the k-mer One-Hot vectors of the RNA sequence and the protein sequence respectively, and use it as the input of the CNN model and the BiLSTM model to obtain the interaction prediction value y 2 , y 3 .
  • a feature extraction method and model parameters may be manually designed to make the model more explanatory.
  • the CNN model and the BiLSTM model have good generalization capabilities, and can discover combinations of features that appear rarely or new in the data, and then reveal the interaction between implicit features.
  • the CNN model can better capture the features but ignores the location information of the features, that is, the relationship between bases and amino acids at a certain interval cannot be extracted.
  • the BiLSTM model has a better memory ability, and can use the sequence information and position information of the data to make up for the shortcomings of the CNN model in memory ability.
  • the SVM model is simple and has good interpretability.
  • the interpretability of the overall model for RPI prediction can be enhanced.
  • the present disclosure can effectively combine the characteristics of each interaction prediction model, thereby improving the prediction ability of the overall model.
  • step S250 the interaction between the RNA and the protein is determined according to the interaction prediction value.
  • RNA and protein can be determined based on the fusion results.
  • a predictive value threshold may be preset, and each interaction predictive value is marked according to the threshold.
  • the interaction prediction value is greater than the prediction value threshold, it is marked as 1.
  • the interaction prediction value is less than the prediction value threshold, it is marked as 0.
  • set the prediction value threshold to 0.6, and if the interaction prediction value is greater than or equal to 0.6, mark the prediction result of the corresponding interaction prediction model as 1, that is, the interaction prediction model predicts that the input RNA-protein pair has interaction. Otherwise, mark the prediction result of the corresponding interaction prediction model as 0.
  • the individual marked values can be summed, that is, according to:
  • T is the tag value of the prediction result of the i-th interaction prediction model
  • n is the number of interaction prediction models.
  • T ⁇ n/2 it indicates that the RNA-protein pair to be predicted has an interaction.
  • T ⁇ n/2 it indicates that the RNA-protein pair has no interaction.
  • T ⁇ n/2 it indicates that the RNA-protein pair to be predicted has no interaction.
  • n is an even number, it can also be set so that if T>n/2, it indicates that the RNA-protein pair to be predicted has an interaction, and if T ⁇ n/2, it indicates that the RNA-protein pair to be predicted has no interaction function, which is not specifically limited in the present disclosure.
  • weight parameters of multiple interaction prediction models can be obtained, and multiple interaction prediction values and corresponding weight parameters can be fused and calculated, for example, multiple interaction prediction values can be weighted and summed , and determine the interaction of RNA-protein pairs according to the calculated results.
  • SVM model CNN model and BiLSTM model to predict the interaction of RNA-protein pairs, it can be based on:
  • the interaction prediction value y out of the RNA-protein pair is calculated.
  • y 1 is the output value of the SVM model
  • y 2 is the output value of the CNN model
  • y 3 is the output value of the BiLSTM model
  • ⁇ , ⁇ , ⁇ are the weight parameters of the SVM model, CNN model and BiLSTM model respectively
  • y out Can be any value between 0-1.
  • 0.5 may be used as a boundary value
  • the prediction result may be marked as 1, which means that the RNA-protein pair has an interaction.
  • y out ⁇ 0.5 the prediction result can be marked as 0, which means that the RNA-protein pair has no interaction.
  • the weight parameter may be set manually, for example, a larger weight may be set for an interaction prediction model with a higher accuracy rate, which is not specifically limited in the present disclosure.
  • At least three interaction prediction models can be pre-trained according to step S510 and step S520, so as to realize the optimization of all model parameters in each prediction model, and the final result obtained according to the training can be
  • the model makes predictions for RNA-protein pairs whose interactions are unknown.
  • Step S510 Obtain a training data set, which includes positive RNA-protein pairs and negative RNA-protein pairs.
  • RNA-protein pairs in the original data set can be used as the training data set, or some RNA-protein pairs in the original data set can be used as the training data set, and the original data set can also be divided into training data set and verification data set in proportion And the test data set, and then you can use the training data set and verification data set to adjust the model parameters of each model, train multiple models with better performance, and use the test data set to test the generalization performance of each optimized model.
  • the RPI1807 data set there are 3243 RNA-protein pairs in the data set, including 1807 pairs of positive examples and 1436 pairs of negative examples.
  • the data set may be divided into training data set, verification data set and test data set according to the ratio of 7:2:1.
  • the ratio of positive and negative cases in each data set can be consistent with the distribution of the overall data set, that is, the ratio is 1807:1436, which is about 1.25:1.
  • 1250 positive examples and 1000 negative examples can be selected as the training data set
  • 360 positive examples and 280 negative examples can be selected as the verification data set
  • 180 positive examples and 140 negative examples can be selected as the test data set.
  • the number of RNA-protein pairs in the training data set is only illustrative, and any number of RNA-protein pairs can be obtained to train each interaction prediction model multiple times to improve the performance of each interaction prediction model. performance.
  • RNA-protein pair can be labeled, and the obtained label value is "1", which means that the RNA-protein pair has an interaction.
  • Negative RNA-protein pairs can be labeled, and the resulting label value is "0", which means that the RNA-protein pair has no interaction.
  • Step S520 Use the training data set as the input of the interaction prediction model, iteratively update the model parameters of each interaction prediction model, and complete the training of all model parameters when the iteration termination condition is satisfied, so as to use the training
  • a good interaction prediction model predicts the interaction of the RNA-protein pair to be predicted.
  • the training data set can be input into at least three interaction prediction models, and the model parameters are adjusted by using the backpropagation algorithm to obtain multiple weakly supervised models.
  • model parameters can be weight parameters, bias parameters, penalty factors, etc.
  • each model parameter may be iteratively updated using a stochastic gradient descent algorithm. According to the principle of backpropagation, the objective function is continuously calculated, and the model parameters are updated according to the objective function. When the objective function converges to the minimum value, the training of the model parameters is completed. The model parameters can also be updated iteratively in reverse, and when the preset number of iterations is met, the training of all model parameters is completed. It should be noted that at least three interaction prediction models can be trained simultaneously, or at least three interaction prediction models can be trained sequentially, which is not specifically limited in the present disclosure. However, each interaction prediction model is trained separately.
  • a set of hyperparameters can be initialized first, and at least three interaction prediction models are continuously trained using the training data set to obtain the first model.
  • the hyperparameters can be the learning rate, the number of CNN layers, the size of the convolution kernel, etc.
  • the verification data set can be input into the trained first model to verify the prediction accuracy of the first model.
  • the prediction accuracy reaches the preset accuracy threshold
  • the current first model can be used as the second model to obtain the final training model.
  • the final performance of the model can be tested using the test dataset on the trained model.
  • a set of hyperparameters can be reset, and the interaction prediction model can be trained and verified again in sequence using the training data set and the verification data set , when the prediction accuracy obtained by the trained interaction prediction model on the verification data set reaches the preset accuracy threshold, a new test data set can be used to test the final performance of the prediction model.
  • each RNA-protein pair in the test data set can be input into the weakly supervised model to judge the accuracy of the weakly supervised model. If the accuracy of the model is greater than the preset accuracy threshold, the weakly supervised model training is completed.
  • the prediction results of multiple weak supervision models can be fused to obtain a strong supervision model, and as the final prediction model, the interaction between unknown RNA-protein pairs can be predicted by using this model.
  • the test data set can also be used to judge the Matthews correlation coefficient of the weakly supervised model.
  • the Matthews correlation coefficient refers to the correlation coefficient between the actual classification and the predicted classification. Its value range is [0, 1]. The result is exactly right.
  • the test data set can also be used to judge the specific rate and recall rate of the weakly supervised model, which is not specifically limited in this disclosure. It can be understood that if the accuracy rate of the weakly supervised model is not greater than the preset accuracy rate threshold, a new training data set can be obtained to train the model parameters of each interaction prediction model again, so as to continuously improve the model performance.
  • a strong supervision model is obtained by training multiple weakly supervised models such as a traditional machine learning model, a convolutional neural network model, and a recurrent neural network model, and fusing the prediction results of the multiple weakly supervised models.
  • the strong supervision model is simple and effective, and has low requirements on hardware equipment, and has a wide range of applications.
  • the SVM model, the CNN model and the BiLSTM model can be trained respectively.
  • Feature extraction is performed on each RNA-protein pair in the training data set, and the extracted sequence features can be sequentially input into the SVM model.
  • the original sequence feature set consists of 560 k-mer subsequences and 370 frequent itemset features
  • the feature calculation can be performed on the original sequence feature set according to the k-mer subsequences of RNA-protein pairs to obtain a 930-dimensional The eigenvalue vector of , and use it as the input of the SVM model to train the model parameters of the model.
  • the model parameters of the SVM model may include a penalty factor C, a gamma parameter, and the like.
  • the kernel function can adopt polynomial kernel function.
  • the penalty item can be reduced, and the penalty factor C is set to 0.8.
  • the kernel function can also use the radial basis kernel function. Among them, the gamma parameter defaults to 1/n_features. By adjusting the parameters through experiments, it can be seen that when the gamma parameter value is 0.1, the performance of the model is better and the learning effect is better. After the model parameters are determined, the training data set can be trained using the model parameters to obtain an SVM model with better performance.
  • the CNN model can be used for feature extraction and classification to achieve end-to-end RNA-protein prediction.
  • the CNN model can be used to extract the feature information of adjacent bases and amino acids, but the relationship between bases and amino acids at a certain interval cannot be extracted, so the BiLSTM model can be used to extract sequence information.
  • each RNA-protein pair in the training data set can be vectorized, and the obtained RNA sequence representation vector and protein sequence representation vector in each RNA-protein pair can be input into the CNN model and the BiLSTM model. Further, it can be taken as an example to use the i-th RNA-protein pair to train each prediction model.
  • the sequence characteristics of the RNA-protein pair can be input into the SVM model, and the predicted value y 1 can be output.
  • the representation vector of the RNA sequence and the representation vector of the protein sequence in the RNA-protein pair can be input into the two CNN sub-models respectively, and the outputs of the two CNN sub-models are spliced to finally output the predicted value y 2 .
  • the representation vector of the RNA-protein pair can be input into the BiLSTM model, and the predicted value y 3 can be output.
  • the output value of each interaction prediction model can be marked as 0 or 1, 0 means that the RNA-protein pair has no interaction, and 1 means that the RNA-protein pair has interaction.
  • the representation vector of the RNA sequence can be the One-Hot vector of the sequence obtained by the One-Hot vector of multiple bases in the sequence, or it can be the The 3-mer One-Hot vector of the sequence may also be the 4-mer One-Hot vector of the sequence, which is not specifically limited in this disclosure. It can be understood that, for each RNA-protein pair in the training data set, three interaction prediction values can be obtained through the three interaction prediction models.
  • RNA sequence and the protein sequence can be input into two CNN sub-networks respectively, and the network architecture of the two CNN sub-networks is the same.
  • the RNA sequence may be used as an input of the first CNN sub-network as an example for illustration.
  • the first CNN sub-network may include 5 convolutional layers with a convolution kernel size of 1 ⁇ 3, 4 normalization layers, 4 downsampling layers, concatenation layers, fully connected layers and random deactivation layers.
  • the convolution layer C1 can use a 1 ⁇ 3 convolution kernel for feature extraction, and the number of output channels is 32.
  • the BN (Batch Normalization) operation can be performed on the sequence features extracted by each channel to characterize the sequence, thereby accelerating network training.
  • the downsampling layer P1 can use the 1 ⁇ 2 convolution kernel to take the maximum value of the feature points in the neighborhood to reduce the number of parameters to be learned by the network.
  • the RNA sequence can be down-sampled layer by layer to obtain the feature vector of the RNA sequence.
  • the protein sequence can be down-sampled layer by layer to obtain the feature vector of the protein sequence.
  • the feature vector of the RNA sequence and the feature vector of the protein sequence can be spliced end to end by using the splicing layer of the CNN model, and the feature vector of the spliced RNA sequence and the feature vector of the protein sequence can be added in the fully connected layer, and random inactivation can be added.
  • the activation function of the last layer can choose the Sigmoid activation function.
  • RNA sequences and protein sequences can be fed into two identical BiLSTM subnetworks respectively. Exemplarily, it can be described by taking RNA sequence as an input of the first BiLSTM subnetwork as an example. If the RNA sequence is "CCAUCUGA", the representation vector of the RNA sequence can be an 8*4 matrix obtained from the One-Hot vector of 8 bases in the sequence, or it can be a 3-mer One-Hot vector of the sequence , can also be the 4-mer One-Hot vector of the sequence.
  • each base vector can be sequentially input into the BiLSTM network to obtain the feature vector of the RNA sequence.
  • BiLSTM network is composed of forward LSTM network and backward LSTM network.
  • LSTM network is a time recurrent neural network, which is suitable for processing and predicting important events with relatively long intervals and delays in time series.
  • the One-Hot vector of 8 bases can be forwardly input into the forward LSTM network sequentially, and the first hidden vector of the RNA sequence is output.
  • the One-Hot vector of 8 bases can be reversely and sequentially input into the backward LSTM network, and the second hidden vector of the RNA sequence can be output.
  • the first hidden vector and the second hidden vector of the RNA sequence are spliced to finally obtain the feature vector of the RNA sequence.
  • the One-Hot vector of the first base "C” can be input into the vector LSTM network first, and the hidden features of the One-Hot vector of "C” can be extracted through the forward LSTM network, and the output at this moment is as t The implicit vector of moments.
  • the hidden vector at time t and the One-Hot vector of the second base "C" at time t+1 can be spliced, and the spliced vector is input into the forward LSTM network, and the spliced vector is hidden.
  • the extraction of outputs the hidden vector at time t+1.
  • the One-Hot vector of the base at the current moment can be sequentially spliced with the hidden vector passed down at the previous moment, and the feature extraction of the spliced vector is performed through the forward LSTM network until the eighth base "
  • the One-Hot vector of A" is input into the forward LSTM network, and the hidden vector at the last moment is output, that is, the first hidden vector of the RNA sequence is obtained.
  • the One-Hot vector of the eighth base “A” can be input into the backward LSTM network first, and the hidden features of the One-Hot vector of "A” can be extracted through the backward LSTM network, and the moment is output Such as the hidden vector at time t. Then, the hidden vector at time t and the One-Hot vector of the seventh base “G" at time t+1 can be spliced, and the spliced vector is input into the backward LSTM network, and the spliced vector is hidden Feature extraction, output the hidden vector at time t+1.
  • the One-Hot vector of the base at the current moment can be sequentially spliced with the hidden vector passed down at the previous moment, and the feature extraction of the spliced vector is performed through the backward LSTM network until the first base " The One-Hot vector of C" is input into the backward LSTM network, and the hidden vector at the last moment is output, that is, the second hidden vector of the RNA sequence is obtained. After splicing the first hidden vector and the second hidden vector of the RNA sequence, a feature vector of the RNA sequence is obtained.
  • the feature vector of the RNA sequence and the feature vector of the protein sequence can be spliced end to end by using the splicing layer of the BiLSTM model, and the bias can be added in the fully connected layer
  • the activation function of the last layer can choose the Sigmoid activation function.
  • the model parameters of the two models may be updated using a stochastic gradient descent algorithm.
  • the objective function such as cross-entropy loss function is continuously calculated, and the model parameters of each interaction prediction model are simultaneously updated according to the calculated loss value.
  • the training of all model parameters is completed.
  • the model parameters can also be updated iteratively in reverse, and when the preset number of iterations is met, the training of all model parameters is completed.
  • the preset number of iterations may be 20, and each interaction prediction model is constantly updating model parameters during the 20 reverse iterations.
  • the optimized model parameters can be obtained.
  • the objective function may also be minimized by alternate least squares method, Adam optimization algorithm, etc., and the model parameters are updated sequentially from the back to the front to optimize the model parameters.
  • each finally obtained model can be used to predict the interaction between unknown RNA-protein pairs, obtain multiple prediction values, and fuse multiple prediction values to obtain the final prediction result.
  • the prediction result of the RNA-protein interaction can be output to a terminal device for viewing by the user.
  • At least one RNA sequence may also be obtained, and at least three interaction prediction models may be used to search the database for protein sequences that interact with each input RNA sequence.
  • at least three interaction prediction models may be pre-trained with reference to FIG. 5 using the original data set.
  • all protein sequences participating in the training can be stored in the database.
  • the database can also include other protein sequences not involved in training, that is, the number of protein sequences in the database can be arbitrary, and the database can also include any number of RNA sequences, such as but not limited to All RNA sequences involved in training are not specifically limited in this disclosure.
  • each input RNA sequence can be combined with all protein sequences in the database to form several RNA-protein pairs.
  • at least three interaction prediction models can be used to predict the interaction of each RNA-protein pair according to step S220 to step S250.
  • feature extraction and vectorization processing can be performed on each RNA-protein pair, and the obtained sequence features, RNA sequence representation vectors and protein sequence representation vectors in the RNA-protein pair are input into at least three interaction prediction models, And get the interaction prediction value of each RNA-protein pair.
  • An interaction prediction value of 1 indicates that the RNA-protein pair has an interaction
  • an interaction prediction value of 0 indicates that the RNA-protein pair has no interaction.
  • all RNA-protein pairs with an interaction prediction value of 1 can be screened out, and the protein sequence in each RNA-protein pair can be output to the terminal device for users to view protein sequences that interact with the input RNA sequence .
  • At least one protein sequence can also be obtained, and at least three interaction prediction models can be used to search the database for RNA sequences that interact with each input protein sequence.
  • each input protein sequence can be combined with all RNA sequences in the database to form several RNA-protein pairs.
  • at least three interaction models can be used to predict the interaction of each RNA-protein pair according to step S220 to step S250.
  • feature extraction and vectorization can be performed on each RNA-protein pair, and the obtained sequence features, RNA sequence representation vector and protein sequence representation vector in each RNA-protein pair can be input into at least 3 interaction prediction models , and get the interaction predictions for each RNA-protein pair.
  • An interaction prediction value of 1 indicates that the RNA-protein pair has an interaction
  • an interaction prediction value of 0 indicates that the RNA-protein pair has no interaction. Then, all RNA-protein pairs with an interaction prediction value of 1 can be screened out, and the RNA sequence in each RNA-protein pair is output to the terminal device for users to view RNA sequences that interact with the input protein sequence .
  • the RNA-protein interaction prediction method by obtaining the RNA-protein pair to be predicted; performing feature extraction on the RNA-protein pair to be predicted, the RNA-protein interaction to be predicted is obtained.
  • the sequence characteristics of the protein pair vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted; based on the RNA-protein pair to be predicted.
  • the sequence features, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted use the interaction prediction model to obtain the interaction prediction value of the RNA-protein pair to be predicted; according to the interaction prediction The value determines the interaction between the RNA and protein.
  • RNA sequences and protein sequences can be fully mined, so as to accurately predict the interaction between RNA and proteins; on the other hand, the interaction prediction
  • the effective combination of the characteristics of the model can further improve the accuracy of predicting the interaction between RNA and proteins.
  • RNA-protein interaction prediction device 600 may include a data acquisition module 610, a feature extraction module 620, a data vectorization module 630, an interaction prediction module 640 and an interaction determination module 650, wherein:
  • a data acquisition module 610 configured to acquire the RNA-protein pair to be predicted
  • Feature extraction module 620 configured to perform feature extraction on the RNA-protein pair to be predicted, to obtain sequence features of the RNA-protein pair to be predicted;
  • the data vectorization module 630 is used to vectorize the RNA-protein pair to be predicted, and obtain the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted;
  • the interaction prediction module 640 is configured to use an interaction prediction model to obtain the to-be-predicted RNA-protein pair based on the sequence features, the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair to be predicted. Interaction prediction values for predicted RNA-protein pairs;
  • An interaction determination module 650 configured to determine the interaction between the RNA and the protein according to the interaction prediction value.
  • the feature extraction module 620 includes:
  • the feature set acquisition module is used to obtain the original sequence feature set
  • a feature determination module configured to determine the sequence features of the RNA-protein pair to be predicted according to the original sequence feature set.
  • the feature determination module includes:
  • a sequence conversion unit for converting the RNA sequence and protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively;
  • the first sequence search unit is configured to search for each k-mer subsequence in the original sequence feature set, and obtain the sequence feature of the RNA-protein pair to be predicted according to the search result.
  • the feature determination module also includes:
  • a sequence conversion unit for converting the RNA sequence and the protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences mer subsequence;
  • a sequence combination unit for combining the RNA k-mer subsequence and the protein k-mer subsequence to obtain a variety of RNA-protein k-mer subsequence pairs;
  • the second sequence search unit is configured to search for each RNA-protein k-mer subsequence pair in the original sequence feature set, and obtain the sequence feature of the RNA-protein pair to be predicted according to the search result.
  • the feature determination module also includes:
  • a sequence conversion unit for converting the RNA sequence and the protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively, and the k-mer subsequences include RNA k-mer subsequences and protein k-mer subsequences mer subsequence;
  • the first sequence search unit is used to search for each k-mer subsequence in the original sequence feature set to obtain the first sequence feature;
  • a sequence combination unit for combining the RNA k-mer subsequence and the protein k-mer subsequence to obtain a variety of RNA-protein k-mer subsequence pairs;
  • a second sequence search unit configured to search for each RNA-protein k-mer subsequence pair in the original sequence feature set to obtain a second sequence feature
  • a feature splicing unit configured to form the sequence feature of the RNA-protein pair to be predicted from the first sequence feature and the second sequence feature.
  • the data vectorization module 630 includes:
  • a sequence conversion unit for converting the RNA sequence and the protein sequence in the RNA-protein pair to be predicted into k-mer subsequences respectively, and the k-mer subsequences include M RNA k-mer subsequences and N a protein k-mer subsequence;
  • the first vector quantization unit is used to vectorize each RNA k-mer subsequence to obtain M RNA k-mer vectors;
  • the first splicing unit is used to splice the M RNA k-mer vectors to obtain the RNA sequence representation vector;
  • the second vectorization unit is used to vectorize each protein k-mer sequence to obtain N protein k-mer vectors;
  • the second splicing unit is used to splice the N protein k-mer vectors to obtain the protein sequence representation vector.
  • the data vectorization module 630 also includes:
  • a base vector acquisition unit configured to vectorize each base included in the RNA sequence in the RNA-protein pair to be predicted, to obtain multiple base vectors
  • a base vector splicing unit configured to splice the plurality of base vectors to obtain the RNA sequence representation vector
  • An amino acid vector acquisition unit configured to vectorize each amino acid included in the protein sequence in the RNA-protein pair to be predicted, to obtain multiple amino acid vectors
  • the amino acid vector splicing unit is used for splicing the plurality of amino acid vectors to obtain the protein sequence representation vector.
  • the interaction prediction module 640 is configured to be based on the sequence features of the RNA-protein pair to be predicted, the RNA sequence representation vector and the protein sequence in the RNA-protein pair to be predicted represents a vector, using at least three interaction prediction models to obtain multiple interaction prediction values of the RNA-protein pair to be predicted.
  • the interaction prediction module 640 includes:
  • the first prediction unit is configured to input the sequence characteristics of the RNA-protein pair to be predicted into the first interaction prediction model to obtain the first interaction prediction value;
  • the second prediction unit is configured to input the RNA sequence representation vector and the protein sequence representation vector of the RNA-protein pair to be predicted into the second interaction prediction model to obtain a second interaction prediction value;
  • At least one first interaction model and at least two second interaction prediction models are included; or at least two first interaction models and at least one second interaction prediction model are included.
  • the interaction prediction module 640 also includes:
  • the third prediction unit is used to input the sequence characteristics of the RNA-protein pair into the traditional machine learning model to obtain the first interaction prediction value;
  • the fourth prediction unit is used to input the RNA sequence representation vector and the protein sequence representation vector in the RNA-protein pair into the deep learning model to obtain the second interaction prediction value;
  • At least one of the traditional machine learning models and at least two of the deep learning models are included; or, at least two of the traditional machine learning models and at least one of the deep learning models are included.
  • the traditional machine learning model includes at least one of a support vector machine model, a logistic regression model, and a decision tree model
  • the deep learning model includes a convolutional neural network model and a recurrent neural network model at least one of the
  • the interaction determining module 650 includes:
  • a predicted value marking unit configured to mark the multiple interaction predicted values to obtain multiple marked values
  • an interaction determination unit configured to sum the multiple marker values, and determine the interaction between the RNA and the protein according to the sum result.
  • the feature set acquisition module includes:
  • the data set acquisition module is used to obtain the original data set
  • a feature extraction module configured to perform feature extraction on each RNA-protein pair in the original data set to obtain the original sequence feature set.
  • the feature extraction module includes:
  • the sequence generation unit is used to arrange and combine the basic units of RNA and protein to obtain k-mer subsequences
  • a variance calculation unit configured to calculate the average number of occurrences of each k-mer subsequence in each RNA-protein pair, and calculate the average value of each k-mer subsequence according to the average number of occurrences variance;
  • the data set determination unit is configured to determine the feature set of the original sequence according to the variance of each k-mer subsequence.
  • the variance calculation unit includes:
  • a counting subunit for traversing the original data set to determine the number of occurrences of each of the k-mer subsequences in each of the RNA-protein pairs;
  • the total count subunit is used to count the number of occurrences of each k-mer subsequence in each RNA-protein pair to obtain the occurrence of each k-mer subsequence in the original data set total number of times;
  • the average number of times determination subunit is used to calculate the average number of occurrences of each k-mer subsequence in each RNA-protein pair according to the total number of occurrences;
  • a variance calculation subunit for calculating the k-mer subsequences according to the average number of occurrences of each k-mer subsequence in each RNA-protein pair and the number of occurrences of each RNA-protein pair The variance of the mer subsequence.
  • the variance calculation subunit is configured to:
  • n is the number of RNA-protein pairs in the original data set
  • m is the average value of the number of occurrences of each k-mer subsequence in each RNA-protein pair
  • x n is each k-mer subsequence The number of occurrences of the nth RNA-protein pair.
  • the data set determination unit is configured to determine a k-mer subsequence that satisfies a preset condition according to the variance of each k-mer subsequence, and the k-mer subsequence that satisfies the preset condition
  • the conditional k-mer subsequences constitute the original sequence feature set.
  • the feature extraction module also includes:
  • a subsequence pair generating unit configured to convert the RNA sequence and the protein sequence in each RNA-protein pair into a k-mer subsequence, respectively, to obtain a k-mer subsequence pair;
  • the feature set acquisition unit is used to count the frequency of occurrence of each k-mer subsequence pair in the original data set, and the original sequence feature set is composed of k-mer subsequence pairs satisfying the preset condition of the frequency of occurrence.
  • the RNA-protein interaction prediction device 600 also includes:
  • a training module is used to train the interaction prediction model.
  • the training module includes:
  • a training data acquisition unit configured to acquire a training data set, which includes positive RNA-protein pairs and negative example RNA-protein pairs in the training data set;
  • a model parameter training unit configured to use the training data set as the input of the interaction prediction model, iteratively update the model parameters of the interaction prediction model, and complete the adjustment of all model parameters when the iteration termination condition is met. training, to use the trained interaction prediction model to predict the interaction of the RNA-protein pair to be predicted.
  • the RNA-protein interaction prediction device 600 also includes:
  • the data output module is used for outputting the prediction result of the interaction between the RNA and the protein.
  • Each module in the above-mentioned device can be a general-purpose processor, including: a central processing unit, a network processor, etc.; it can also be a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components. Each module may also be implemented by software, firmware, and other forms. Each processor in the above device may be an independent processor, or may be integrated together.
  • Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-mentioned method in this specification is stored.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code.
  • the program product When the program product is run on the electronic device, the program code is used to make the electronic device execute the above-mentioned functions of this specification. Steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Methods" section.
  • the program product may take the form of a portable compact disc read-only memory (CD-ROM) and include program code, and may run on an electronic device, such as a personal computer.
  • CD-ROM portable compact disc read-only memory
  • the program product of the present disclosure is not limited thereto.
  • a readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, apparatus or device.
  • a program product may take the form of any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer readable signal medium may include a data signal carrying readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable signal medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as "C" or similar programming language.
  • the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server to execute.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., using an Internet service provider). business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., a wide area network
  • Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
  • An electronic device 700 according to such an exemplary embodiment of the present disclosure is described below with reference to FIG. 7 .
  • the electronic device 700 shown in FIG. 7 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • electronic device 700 may take the form of a general-purpose computing device.
  • Components of the electronic device 700 may include, but are not limited to: at least one processing unit 710 , at least one storage unit 720 , a bus 730 connecting different system components (including the storage unit 720 and the processing unit 710 ), and a display unit 740 .
  • the storage unit 720 stores program codes, which can be executed by the processing unit 710, so that the processing unit 710 executes the steps described in the "Exemplary Methods" section above in this specification according to various exemplary embodiments of the present disclosure.
  • the processing unit 710 may execute any one or more method steps in FIG. 2 to FIG. 7 .
  • the storage unit 720 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 721 and/or a cache storage unit 722 , and may further include a read-only storage unit (ROM) 723 .
  • RAM random access storage unit
  • ROM read-only storage unit
  • Storage unit 720 may also include a program/utility tool 724 having a set (at least one) of program modules 725, such program modules 725 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
  • program modules 725 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
  • Bus 730 may represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local area using any of a variety of bus structures. bus.
  • the electronic device 700 can also communicate with one or more external devices 800 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable the user to interact with the electronic device 700, and/or communicate with Any device (eg, router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 750 .
  • the electronic device 700 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN) and/or a public network such as the Internet) through the network adapter 760 . As shown, the network adapter 760 communicates with other modules of the electronic device 700 through the bus 730 .
  • other hardware and/or software modules may be used in conjunction with electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the RNA-protein interaction prediction method described in this disclosure can be performed by the processing unit 710 of the electronic device.
  • the RNA-protein pair to be predicted/RNA sequence to be predicted/protein sequence to be predicted, raw data sets, and training data sets used to train each interaction prediction model can be input through the input interface 750.
  • the RNA-protein pair to be predicted, the original data set, and the training data set used to train each interaction prediction model are input through the user interface of the electronic device.
  • the prediction result of the to-be-predicted RNA-protein interaction can be output to the external device 800 through the output interface 750 for viewing by the user.
  • the technical solutions according to the embodiments of the present disclosure can be embodied in the form of software products, and the software products can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiment of the present disclosure.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.

Abstract

本公开提供一种RNA-蛋白质相互作用预测方法、装置、介质及电子设备;涉及人工智能技术领域。所述方法包括:获取待预测的RNA-蛋白质对;对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。

Description

RNA-蛋白质相互作用预测方法、装置、介质及电子设备 技术领域
本公开涉及人工智能技术领域,具体而言,涉及一种RNA-蛋白质相互作用预测方法、RNA-蛋白质相互作用预测装置、计算机可读存储介质以及电子设备。
背景技术
非编码RNA(noncoding RNA,ncRNA)参与很多复杂的细胞活动进程,在选择性剪切、染色质修饰和表观遗传等生命过程中发挥着重要作用,并与许多疾病有着密切的联系。研究表明,大部分非编码RNA通过与蛋白质相互作用实现其调控功能。因此,研究非编码RNA与蛋白质相互作用对于揭示非编码RNA在人类疾病和生命活动中的分子作用机制具有重要意义,已成为当下分析非编码RNA和蛋白质功能的重要途径之一。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
本公开提供一种RNA-蛋白质相互作用预测方法、RNA-蛋白质相互作用预测装置、计算机可读存储介质以及电子设备。
本公开提供一种RNA-蛋白质相互作用预测方法,包括:
获取待预测的RNA-蛋白质对;
对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;
根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
在本公开的一种示例性实施例中,所述对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征,包括:
获取原始序列特征集;
根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列;
在所述原始序列特征集中查找每种k-mer子序列,并根据查找结果得到所述待预测的 RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
在所述原始序列特征集中查找每种k-mer子序列,得到第一序列特征;
将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征;
由所述第一序列特征和第二序列特征组成所述待预测的RNA-蛋白质对的序列特征。
在本公开的一种示例性实施例中,所述向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,包括:
将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括M个RNA k-mer子序列和N个蛋白质k-mer子序列;
将每个RNA k-mer子序列向量化,得到M个RNA k-mer向量;
拼接所述M个RNA k-mer向量得到所述RNA序列表示向量;
将每个蛋白质k-mer序列向量化,得到N个蛋白质k-mer向量;
拼接所述N个蛋白质k-mer向量得到所述蛋白质序列表示向量。
在本公开的一种示例性实施例中,所述向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,包括:
将所述待预测的RNA-蛋白质对中的RNA序列包含的每个碱基向量化,得到多个碱基向量;
拼接所述多个碱基向量得到所述RNA序列表示向量;
将所述待预测的RNA-蛋白质对中的蛋白质序列包含的每个氨基酸向量化,得到多个氨基酸向量;
拼接所述多个氨基酸向量得到所述蛋白质序列表示向量。
在本公开的一种示例性实施例中,所述基于所述待预测的RNA-蛋白质对的序列特征、 待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值,包括:
基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用至少三个相互作用预测模型得到所述待预测的RNA-蛋白质对的多个相互作用预测值。
在本公开的一种示例性实施例中,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:
将所述待预测的RNA-蛋白质对的序列特征输入到第一相互作用预测模型中,得到第一相互作用预测值;
将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到第二相互作用预测模型中,得到第二相互作用预测值;
其中,包括至少一个所述第一相互作用模型和至少两个所述第二相互作用预测模型;或者,包括至少两个所述第一相互作用模型和至少一个所述第二相互作用预测模型。
在本公开的一种示例性实施例中,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的多个相互作用预测值,包括:
将所述待预测的RNA-蛋白质对的序列特征输入到传统机器学习模型中,得到第一相互作用预测值;
将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到深度学习模型中,得到第二相互作用预测值;
其中,包括至少一个所述传统机器学习模型和至少两个所述深度学习模型;或者,包括至少两个所述传统机器学习模型和至少一个所述深度学习模型。
在本公开的一种示例性实施例中,所述传统机器学习模型包括支持向量机模型、逻辑回归模型和决策树模型中的至少一种,深度学习模型包括卷积神经网络模型和循环神经网络模型中的至少一种。
在本公开的一种示例性实施例中,所述根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用,包括:
对所述多个相互作用预测值进行标记,得到多个标记值;
对所述多个标记值进行求和,并根据求和结果确定所述RNA和蛋白质之间的相互作用。
在本公开的一种示例性实施例中,所述获取原始序列特征集,包括:
获取原始数据集;
对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集。
在本公开的一种示例性实施例中,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,包括:
对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列;
计算每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,并根据所述出现次数的平均值计算所述每种k-mer子序列的方差;
根据所述每种k-mer子序列的方差大小确定所述原始序列特征集。
在本公开的一种示例性实施例中,所述计算每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,并根据所述出现次数的平均值计算所述每种k-mer子序列的方差,包括:
遍历所述原始数据集,确定所述每种k-mer子序列在所述每个RNA-蛋白质对的出现次数;
对所述每种k-mer子序列在所述每个RNA-蛋白质对的出现次数进行统计得到所述每种k-mer子序列在所述原始数据集中的出现总次数;
根据所述出现总次数计算得到所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值;
根据所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算所述每种k-mer子序列的方差。
在本公开的一种示例性实施例中,所述根据所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算所述每种k-mer子序列的方差,包括:
根据:
Figure PCTCN2021121103-appb-000001
计算所述每种k-mer子序列的方差s 2。其中,n为原始数据集中RNA-蛋白质对的个数,m为每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,x n为每种k-mer子序列在第n个RNA-蛋白质对的出现次数。
在本公开的一种示例性实施例中,所述根据所述每种k-mer序列的方差大小确定所述原始序列特征集,包括:
根据所述每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由所述满足预设条件的k-mer子序列组成所述原始序列特征集。
在本公开的一种示例性实施例中,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,还包括:
将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,得到k-mer子序列对;
统计所述每种k-mer子序列对在所述原始数据集中的出现频率,由满足出现频率预设条件的k-mer子序列对组成所述原始序列特征集。
在本公开的一种示例性实施例中,所述方法还包括:
对所述相互作用预测模型进行训练。
在本公开的一种示例性实施例中,所述对所述相互作用预测模型进行训练,包括:
获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对;
将所述训练数据集作为所述相互作用预测模型的输入,对所述相互作用预测模型的模型参数进行迭代更新,当满足迭代终止条件时,完成对所有模型参数的训练,以使用训练好的所述相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
在本公开的一种示例性实施例中,所述方法还包括:
输出所述RNA和蛋白质之间的相互作用的预测结果。
本公开提供一种RNA-蛋白质相互作用预测装置,包括:
数据获取模块,用于获取待预测的RNA-蛋白质对;
特征提取模块,用于对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
数据向量化模块,用于向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
相互作用预测模块,用于基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;
相互作用确定模块,用于根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
本公开提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任意一项所述的方法。
本公开提供一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行上述任意一项所述的方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1示出了可以应用本公开实施例的一种RNA-蛋白质相互作用预测方法及装置的示例性系统架构的示意图;
图2示意性示出了根据本公开的一个实施例的RNA-蛋白质相互作用预测方法的流程 图;
图3示意性示出了根据本公开的一个实施例的确定待预测RNA-蛋白质对的序列特征的流程图;
图4示意性示出了根据本公开的一个实施例的获取原始序列特征集的流程图;
图5示意性示出了根据本公开的一个实施例的训练相互作用预测模型的流程图;
图6示意性示出了根据本公开的一个实施例的RNA-蛋白质相互作用预测装置的框图;
图7示出了适于用来实现本公开实施例的电子设备的计算机系统的结构示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
图1示出了可以应用本公开实施例的一种RNA-蛋白质相互作用预测方法及装置的示例性应用环境的系统架构的示意图。
如图1所示,系统架构100可以包括终端设备101、102、103中的一个或多个,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。终端设备101、102、103可以是各种电子设备,包括但不限于台式计算机、便携式计算机、智能手机和平板电脑等。应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。例如,服务器105可以是一个服务器,也可以是多个服务器组成的服务器集群,还可以是云计算平台或者虚拟化中心。具体地,服务器105可以用于执行:获取待预测的RNA-蛋白质对;对所述RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的 RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述RNA-蛋白质对的相互作用预测值;根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
本公开实施例所提供的RNA-蛋白质相互作用预测方法一般由服务器105执行,相应地,RNA-蛋白质相互作用预测装置一般设置于服务器105中,服务器可以将待预测的RNA-蛋白质对相互作用的预测结果发送至终端设备,并由终端设备向用户进行展示。但本领域技术人员容易理解的是,本公开实施例所提供的RNA-蛋白质相互作用预测方法也可以由终端设备101、102、103中的一个或多个执行,相应的,RNA-蛋白质相互作用预测装置也可以设置于终端设备101、102、103中,例如,由终端设备执行后可以将预测结果直接显示在终端设备的显示屏上,也可以通过语音播报的方式将预测结果提供给用户,本示例性实施例中对此不做特殊限定。
以下对本公开实施例的技术方案进行详细阐述:
目前,可以使用实验方法来研究非编码RNA-蛋白质的相互作用(noncoding RNA-protein interactions,ncRPI)。传统实验方法可以通过实验得到有价值的数据,从而构建ncRNA-蛋白质相互作用网络,但是昂贵且耗时。
本示例实施方式提供了一种RNA-蛋白质相互作用预测方法,该方法可以应用于上述服务器105,也可以应用于上述终端设备101、102、103中的一个或多个,本示例性实施例中对此不做特殊限定。参考图2所示,该RNA-蛋白质相互作用预测方法可以包括以下步骤S210至步骤S250:
步骤S210.获取待预测的RNA-蛋白质对;
步骤S220.对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
步骤S230.向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
步骤S240.基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;
步骤S250.根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
在本公开示例实施方式所提供的RNA-蛋白质相互作用预测方法中,通过获取待预测的RNA-蛋白质对;对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。一方面,通过对RNA-蛋白质对进行特征提取和向量化处理,可以充分挖掘RNA序列和蛋白质序列之间的联系,以便于准确地预测 RNA和蛋白质的相互作用;另一方面,将相互作用预测模型的特性进行有效结合,可以进一步提高预测RNA与蛋白质相互作用的准确性。
下面,对于本示例实施方式的上述步骤进行更加详细的说明。
在步骤S210中,获取待预测的RNA-蛋白质对。
本示例实施方式中,可以获取至少一个待预测的RNA-蛋白质对。且每个待预测的RNA-蛋白质对中RNA和蛋白质之间的相互作用是未知的。示例性的,用户可以通过终端设备输入待预测的RNA-蛋白质对。例如,用户可以通过手动输入待预测的RNA-蛋白质对,也可以通过语音输入待预测的RNA-蛋白质对,本示例中对此不做具体限定。例如,可以输入一个RNA,再输入一个蛋白质,对两者的输入顺序不做限定。例如,RNA和蛋白质可以输入到不同的文本框中,也可以输入到同一个文本框中。例如,输入完成后,点击“开始预测”按钮,则开始执行本申请一些实施例中提供的预测步骤。
其中,RNA和蛋白质之间的相互作用是指蛋白质的功能是在与其它蛋白质和RNA的相互作用中体现出来的。例如,蛋白质与RNA的相互作用在蛋白质的合成中扮演重要角色。同时,RNA诸多功能的发挥也离不开与蛋白质的相互作用。相互作用可以为调控作用、指导作用等,在此不做限定,例如,存在相互作用的情况下,RNA可以指导蛋白质的合成,或者RNA可以调控蛋白质的功能发挥。RNA和蛋白质之间的相互作用也可以指二者可以通过物理性相互作用,调节彼此的生命周期和功能。示例性的,RNA编码序列可以指导蛋白质的合成,对应的,蛋白质也可以调控RNA的表达和功能。
获取待预测的RNA-蛋白质对后,可以使用多个相互作用预测模型对输入的每个待预测的RNA-蛋白质对的相互作用进行预测,并根据预测结果确定每个待预测的RNA-蛋白质对是否有相互作用。同时,还可以将待预测的RNA-蛋白质对相互作用的预测结果输出至终端设备以供用户查看。例如,可以将预测结果直接显示在终端设备的显示屏上,也可以通过语音播报的方式将预测结果提供给用户,本示例中对此不做具体限定。
在其它示例中,也可以获取至少一个待预测的RNA序列,并通过相互作用预测模型在数据库中查找与输入的每个待预测的RNA序列之间有相互作用的蛋白质序列。示例性的,用户通过终端设备输入待预测的RNA序列后,可以选取数据库中的至少一个蛋白质序列,由待预测的RNA序列与各个蛋白质序列组成多个RNA-蛋白质对,进而可以通过相互作用预测模型预测每个RNA-蛋白质对的相互作用,并根据预测结果输出可以与待预测的RNA序列发生相互作用的蛋白质序列。优选的,可以将若干种类的蛋白质序列预先存储至数据库中,以便于在预测RNA-蛋白质对的相互作用时进行调用。例如,可以将蛋白质序列存储在Redis数据库中,也可以存储在MySQL数据库中,进而可以实时查询并选取待预测的蛋白质序列。其中,Redis是一个key-value存储系统,存储在Redis数据库中时,可以包括:序列标识和对应的蛋白质序列形成的键值对(key-value),其中键(key)为序列标识,值(value)为对应的蛋白质序列。Redis作为一个高效的缓存技术,Redis能支持超过100K+每秒的读写频率,在数据读取以及存储的速度上具备一定的优势。MySQL是一种关联数据库管理系 统,关联数据库将数据保存在不同的表中,而不是将所有数据进行统一存储,增加存储速度并提高灵活性,在数据存储方面具有稳定的优势,可以避免数据发生丢失。
可以理解的是,也可以将若干种类的RNA序列预先存储至数据库中,以便于在预测RNA-蛋白质对的相互作用时进行调用。因此,还可以获取至少一个待预测的蛋白质序列,并通过相互作用预测模型在数据库中查找与输入的每个待预测的蛋白质序列之间有相互作用的RNA序列。类似的,用户通过终端设备输入蛋白质序列后,可以选取数据库中的至少一个RNA序列,由待预测的蛋白质序列与各个RNA序列组成多个RNA-蛋白质对,进而可以通过相互作用预测模型预测每个RNA-蛋白质对的相互作用,并根据预测结果输出可以与待预测的蛋白质序列发生相互作用的RNA序列,本公开对此不做具体限定。
在步骤S220中,对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征。
一种示例实施方式中,可以以获取至少一个待预测的RNA-蛋白质对并对其相互作用进行预测为例进行说明。在通过相互作用预测模型对每个待预测的RNA-蛋白质对的相互作用进行预测之前,需要获取相互作用预测模型的输入特征。示例性的,可以对待预测的RNA-蛋白质对进行特征提取,也就是对该待预测的RNA-蛋白质对中的RNA序列和蛋白质序列依次进行特征提取,得到对应的RNA序列特征和蛋白质序列特征,并由RNA序列特征和蛋白质序列特征组成该待预测的RNA-蛋白质对的序列特征,可以将该序列特征作为相互作用预测模型的输入。也可以对待预测的RNA-蛋白质对进行向量化处理,也就是将该RNA-蛋白质对中的RNA序列和蛋白质序列分别进行向量化表示,得到对应的RNA序列表示向量和蛋白质表示向量,并将RNA序列表示向量和蛋白质表示向量分别作为相互作用预测模型的输入。也可以同时对待预测的RNA-蛋白质对进行特征提取和向量化处理,得到该待预测的RNA-蛋白质对的序列特征、该待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质表示向量,可以将该待预测的RNA-蛋白质对的序列特征、该待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质表示向量均作为相互作用预测模型的输入,本公开对此不做具体限定。
参考图3所示,可以根据步骤S310和步骤S320对待预测的RNA-蛋白质对进行特征提取。
在步骤S310中.获取原始序列特征集。
一种示例实施方式中,可以获取原始数据集,并对原始数据集中的每个RNA-蛋白质对进行特征提取,以得到原始序列特征集。举例而言,可以将RPI1807数据集作为原始数据集,该数据集可以包含3243个RNA-蛋白质对,在3243个RNA-蛋白质对中包含了1807对正例和1436对负例。其中,正例可以表示该RNA-蛋白质对中的RNA和蛋白质之间有相互作用,负例可以表示该RNA-蛋白质对中的RNA和蛋白质之间没有相互作用。可以理解的是,其它示例中也可以将RPI2241数据集、RPI369数据集等作为原始数据集进行实验,本公开对此不做具体限定。
获取原始数据集后,参考图4所示,可以根据步骤S410至步骤S430对原始数据集中的RNA-蛋白质对进行特征提取,得到原始序列数据集。
步骤S410.对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列。
例如,以碱基为RNA的基本单元。对于RNA序列,可以包括4种碱基,分别为腺嘌呤(A)、尿嘧啶(U)、鸟嘌呤(G)和胞嘧啶(C)。可以对4种碱基进行排列组合得到RNA序列的所有k-mer子序列。例如,以氨基酸为蛋白质的基本单元。对于蛋白质序列,可以包括20种氨基酸,并将20种氨基酸依次编码为A、G、V、I、L、F、P、Y、M、T、S、H、N、Q、W、R、K、D、E、C。示例性的,可以先按照氨基酸的理化性质,将20种氨基酸分为{A,G,V}、{I,L,F,P}、{Y,M,T,S}、{H,N,Q,W}、{R,K}、{D,E}和{C}共7类,并对每一类氨基酸进行重新编码,如可以依次编码为1、2、3、4、5、6和7。举例而言,蛋白质序列ALQDVG可以转换为124611。然后可以对7类氨基酸进行排列组合得到氨基酸序列的所有k-mer子序列。其它示例中,也可以按照氨基酸的组成成分将20种氨基酸进行分类,还可以不做分类而直接基于20种氨基酸进行排列组合得到氨基酸序列的k-mer子序列,本公开对此不做具体限定。
其中,k-mer子序列是指由k个碱基或k类氨基酸为一组构成的k联体。对应的,在本公开示例实施方式中,k-mer子序列可以包括RNA k-mer子序列和蛋白质k-mer子序列。示例性的,k-mer子序列可以是指由4种碱基进行排列组合得到的RNA k-mer子序列,对于某个k值可以得到4 k种k-mer子序列。k-mer子序列也可以是指由7类氨基酸进行排列组合得到的蛋白质k-mer子序列,对于某个k值可以得到7 k种k-mer子序列。可以理解的是,将20种氨基酸分为7类仅仅是示意性的,也可以不进行分类。类似的,对RNA序列的4种碱基也可以根据实际需要进行分类。
在本公开示例实施方式中,k值可以取一个或多个,k的具体取值可以根据实际情况进行调整,在此不做限定。一种示例中,可以以k取3和4两个值为例进行说明。RNA序列的3-mer子序列共有4 3=64种,4-mer子序列共有4 4=256种。蛋白质序列的3-mer子序列共有7 3=343种,4-mer子序列共有7 4=2401种。例如,AAA、AUC是RNA序列的两种3-mer子序列,AAAA、AAAU是RNA序列的两种4-mer子序列。111、112是蛋白质序列的两种3-mer子序列,1111、1122是蛋白质序列的两种4-mer子序列。其他示例中,k也可以只取3或只取4,本公开对此不做具体限定。
步骤S420.计算每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,并根据所述出现次数的平均值计算所述每种k-mer子序列的方差。
一种示例实施方式中,可以根据步骤S410得到RNA序列和蛋白质序列的所有3-mer子序列和4-mer子序列,即包括RNA序列的64种3-mer子序列、256种4-mer子序列和蛋白质序列的343种3-mer子序列、2401种4-mer子序列。可以计算每种3-mer子序列或4-mer子序列在原始数据集每个RNA-蛋白质对中出现次数的平均值,并根据出现次数的平均值计算每种3-mer子序列或4-mer子序列的方差。其中,在计算每种3-mer子序列或4-mer子序 列在原始数据集每个RNA-蛋白质对中出现次数的平均值之前,需要将原始数据集中每个RNA-蛋白质对的RNA序列和蛋白质序列转化为3-mer子序列和4-mer子序列。例如,对于RNA序列“AGAUGG”,该序列的3-mer子序列可以包括“AGA”、“GAU”、“AUG”和“UGG”,该序列的4-mer子序列可以包括“AGAU”、“GAUG”和“AUGG”,也即可以通过正向重叠读取该RNA序列得到对应的3-mer子序列或4-mer子序列。类似的,也可以通过反向重叠读取该RNA序列得到对应的3-mer子序列或4-mer子序列。例如,该序列的3-mer子序列也可以包括“GGU”、“GUA”、“UAG”和“AGA”,该序列的4-mer子序列也可以包括“GGUA”、“GUAG”和“UAGA”。在一些实施方式中,还可以通过非重叠的方式读取该RNA序列得到对应的3-mer子序列或4-mer子序列,例如,该序列的3-mer子序列也可以包括“AGA”和“UGG”,本公开对此不做具体限定。
示例性的,通过遍历原始数据集,可以确定每种3-mer子序列和/或4-mer子序列在每个RNA-蛋白质对中的出现次数。对每种3-mer子序列和/或4-mer子序列在每个RNA-蛋白质对中的出现次数进行统计可以得到该子序列在原始数据集中的出现总次数,根据出现总次数可以计算得到每种3-mer子序列和/或4-mer子序列在每个RNA-蛋白质对中出现频次的平均值。最后,可以根据每种3-mer子序列和/或4-mer子序列在每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算每种子序列的方差。
举例而言,对于第i种k-mer子序列,该子序列可以是RNA序列或蛋白质序列的3-mer子序列,也可以是RNA序列或蛋白质序列的4-mer子序列。可以先统计该子序列在RPI1807数据集中的出现总次数。例如,可以循环RPI1807数据集中的n个RNA-蛋白质对(n=3243),统计得到该子序列在每个RNA-蛋白质对中的出现次数依次为x 1,x 2,…,x n,将x 1,x 2,…,x n叠加后得到该子序列在RPI1807数据集中的出现总次数,记为num i。进而可以根据出现总次数num i计算得到该子序列在每个RNA-蛋白质对中出现次数的平均值m i,即根据:
Figure PCTCN2021121103-appb-000002
计算得到第i种k-mer子序列在每个RNA-蛋白质对中出现次数的平均值。可以通过第i种k-mer子序列在每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算出该子序列的方差,即根据:
Figure PCTCN2021121103-appb-000003
计算第i种k-mer子序列的方差s 2。其中,n为RPI1807数据集中RNA-蛋白质对的个数,m i为该子序列在每个RNA-蛋白质对中出现次数的平均值,x n为该子序列在第n个RNA-蛋白质对的出现次数,类似的,x 1为该子序列在第一个RNA-蛋白质对的出现次数,x 2为该子序列在第二个RNA-蛋白质对的出现次数。
步骤S430.根据所述每种k-mer子序列的方差大小确定所述原始序列特征集。
计算得到每种k-mer子序列的方差后,可以根据每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由满足预设条件的k-mer子序列组成原始序列特征集。示 例性的,可以分别将RNA序列所有的3-mer子序列、4-mer子序列和蛋白质序列所有的3-mer子序列、4-mer子序列根据方差大小进行排序,如降序排序,可以选取排名靠前的k-mer子序列组成原始序列特征集。例如,可以选取排名靠前的560种k-mer子序列,并由这560种k-mer子序列组成原始序列特征集。其中,可以包括前60种RNA序列的3-mer子序列、前200种RNA序列的4-mer子序列、前200种蛋白质序列的3-mer子序列、前100种蛋白质序列的4-mer子序列。可以理解的是,选取的k-mer子序列的数目仅仅是示意性的,根据实际需求可以选取任意数目的k-mer子序列。其他示例中,也可以预设方差阈值,将方差大于该阈值的k-mer子序列筛选出来,并由筛选得到的k-mer子序列组成原始序列特征集。例如,预设的方差阈值为3时,可以选取方差大于3的k-mer子序列组成原始序列特征集。需要说明的是,进行特征选择时,可以优选方差较大的特征,方差较大表示该特征的数据差异性较大,也即利用该特征可以更好的区分样本,进而可以提高相互作用预测模型的分类能力和预测能力。
另一种示例实施方式中,也可以通过统计每种k-mer子序列在原始数据集中的出现频率,并根据出现频率计算每种k-mer子序列的方差,进而根据每种k-mer序列的方差大小确定原始序列特征集。
示例性的,可以统计每种k-mer子序列在原始数据集中的出现频次,根据出现频次计算得到每种k-mer子序列在原始数据集中的出现频率。例如,可以计算出现频次与原始数据集中RNA-蛋白质对的总个数之间的比值得到子序列在原始数据集中的出现频率。通过遍历原始数据集,对每种k-mer子序列是否出现在每个RNA-蛋白质对中进行标记。可以根据每种k-mer子序列在原始数据集中的出现频率和在每个RNA-蛋白质对中的标记值计算得到每种k-mer子序列的方差。
举例而言,对于第i种k-mer子序列,可以先统计该子序列在RPI1807数据集中的出现频次。例如,可以循环RPI1807数据集中的N个RNA-蛋白质对(N=3243),若该子序列出现在当前RNA-蛋白质对中,则出现频次加1,若未出现在当前RNA-蛋白质对中,则出现频次不变。将统计得到的第i种k-mer子序列在RPI1807数据集中的出现频次记为num i,进而可以根据出现频次num i计算得到该子序列在RPI1807数据集中的出现频率,也即在每个RNA-蛋白质对中的出现频率Freq i。即:
Figure PCTCN2021121103-appb-000004
确定第i种k-mer子序列在RPI1807数据集中的出现频率后,可以通过遍历RPI1807数据集,查看该子序列在每个RNA-蛋白质对中的出现情况,并对出现情况进行标记,记作
Figure PCTCN2021121103-appb-000005
也就是说,若该子序列出现在第n个RNA-蛋白质对中,则标记值
Figure PCTCN2021121103-appb-000006
若未出现在第n个RNA-蛋白质对中,则标记值
Figure PCTCN2021121103-appb-000007
统计得到该子序列在RPI1807数据集中的出现频率Freq i和在第n个RNA-蛋白质对中的标记值
Figure PCTCN2021121103-appb-000008
后,可以根据:
Figure PCTCN2021121103-appb-000009
对第i种k-mer子序列在RPI1807数据集的每个RNA-蛋白质对中的标记值与该k-mer子序列在RPI1807数据集中的出现频率的差值的平方进行求和,计算得到第i种k-mer子序列在RPI1807数据集中的方差Var i。其中,
Figure PCTCN2021121103-appb-000010
为第i种k-mer子序列在第n个RNA-蛋白质对中的标记值,Freq i为第i种k-mer子序列在RPI1807数据集中的出现频率,N为RPI1807数据集中RNA-蛋白质对的总数量。
计算得到每种k-mer子序列的方差后,示例性的,可以分别将RNA序列所有的3-mer子序列、4-mer子序列和蛋白质序列所有的3-mer子序列、4-mer子序列根据方差大小进行排序,如降序排序,可以选取排名靠前的k-mer子序列组成原始序列特征集。其他示例中,也可以预设方差阈值,将方差大于该阈值的k-mer子序列筛选出来,并由筛选得到的k-mer子序列组成原始序列特征集。
本公开示例实施方式中,可以提取原始数据集中每个RNA-蛋白质对的k-mer特征,由提取到的RNA序列的k-mer特征和蛋白质序列的k-mer特征组成原始序列特征集。以RNA序列的k-mer特征为例,k-mer特征中可以包含RNA序列的单体组分信息(即包含的各个碱基)和序列顺序信息。因此,使用k-mer特征可以更好的刻画一个RNA序列,也即可以根据k-mer特征更加准确的确定一个RNA序列,同时也可以通过k-mer特征区分不同的RNA序列。为了进一步挖掘RNA序列和蛋白质序列之间的联系,也可以提取原始数据集中的每个RNA-蛋白质对的频繁项集特征,由提取到的频繁项集特征组成原始序列特征集。频繁项集特征可以将RNA序列的kmer特征和蛋白质序列的kmer特征组合到一起。因此,使用频繁项集特征可以更好的区分有相互作用和没有相互作用的RNA-蛋白质对。还可以同时提取k-mer特征和频繁项集特征,并由二者共同组成原始序列特征集,通过结合k-mer特征和频繁项集特征的特性,可以更加准确地预测未知RNA-蛋白质对中RNA和蛋白质之间的相互作用,本公开对此不做具体限定。
其中,频繁项集特征是指在原始数据集中具备一定支持度的RNA k-mer子序列与蛋白质k-mer子序列组合成的k-mer子序列对,支持度是指同时包含A和B的事务占所有事务的比例。例如,对于子序列对(AAU,137),表示由一个RNA的3-mer子序列AAU和一个蛋白质的3-mer子序列137组合成的3-mer子序列对。该子序列对的支持度即为原始数据集中同时包含子序列AAU和137的RNA-蛋白质对占所有RNA-蛋白质对的比例。
一种示例实施方式中,可以将每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,得到k-mer子序列对。统计每种k-mer子序列对在原始数据集中的出现频率,将满足出现频率预设条件的k-mer子序列对作为频繁项集特征并组成原始序列特征集。
举例而言,可以先将RPI1807数据集中所有正例RNA-蛋白质对的RNA序列和蛋白质序列分别转化为正例3-mer子序列和正例4-mer子序列。类似的,将该数据集中所有负例RNA-蛋白质对的RNA序列和蛋白质序列分别转化为负例3-mer子序列和负例4-mer子序列。通过遍历RPI1807数据集,可以找出该数据集中所有的正、负例RNA 3-mer子序列、 正、负例RNA 4-mer子序列、正、负例蛋白质3-mer子序列和正、负例蛋白质4-mer子序列。将该数据集中的RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列进行两两交叉组合,得到多种3-mer子序列对和4-mer子序列对。示例性的,可以将正例RNA 3-mer子序列和正例蛋白质3-mer子序列进行交叉组合,得到正例3-mer子序列对。可以将负例RNA 3-mer子序列和负例蛋白质3-mer子序列进行交叉组合,得到负例3-mer子序列对。可以将正例RNA 4-mer子序列和正例蛋白质4-mer子序列进行交叉组合,得到正例4-mer子序列对。可以将负例RNA 4-mer子序列和负例蛋白质4-mer子序列进行交叉组合,得到负例4-mer子序列对。
可以统计每一种子序列对在该数据集中的出现频率。例如,对于任一种正例3-mer子序列对,可以根据:
Figure PCTCN2021121103-appb-000011
计算得到该正例3-mer子序列对在该数据集中的出现频率Freq。其中,num为该正例3-mer子序列对在该数据集中的出现次数,NUM为所有正例3-mer子序列对在该数据集中的出现总次数。
计算得到每种k-mer子序列对在原始数据集中的出现频率后,示例性的,可以分别将所有的3-mer子序列对和4-mer子序列对根据出现频率的大小分别进行排序,如降序排序,可以选取排名靠前的k-mer子序列对组成频繁项集。例如,将所有的正例3-mer子序列对进行降序排序,可以选取前m个3-mer子序列对组成频繁项集A1。将所有的正例4-mer子序列对进行降序排序,可以选取前n个4-mer子序列对组成频繁项集A2。将所有的负例3-mer子序列对进行降序排序,可以选取前p个3-mer子序列对组成频繁项集A3。将所有的负例4-mer子序列对进行降序排序,可以选取前q个4-mer子序列对组成频繁项集A4。进而由这四种频繁项集A1、A2、A3和A4组成原始序列特征集。其他示例中,也可以预设出现频率阈值,将出现频率大于该阈值的k-mer子序列对筛选出来,并由筛选得到的k-mer子序列对作为频繁项集特征并组成原始序列特征集,本公开对此不做具体限定。
另一种示例实施方式中,可以将每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,由k-mer子序列组成第一候选项集,k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列。举例而言,可以先将RPI1807数据集中每个RNA-蛋白质对的RNA序列和蛋白质序列分别转化为3-mer子序列和4-mer子序列。通过遍历RPI1807数据集,可以找出该数据集中所有的RNA 3-mer子序列、RNA 4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列,并由该数据集中的所有3-mer子序列和4-mer子序列组成第一候选项集C1。
可以统计第一候选项集C1中每种k-mer子序列在原始数据集中的出现频率。示例性的,对于第j种k-mer子序列,该子序列可以是RNA序列或蛋白质序列的3-mer子序列,也可以是RNA序列或蛋白质序列的4-mer子序列。可以先统计该子序列在RPI1807数据集中的出现次数。例如,可以循环RPI1807数据集中的N个RNA-蛋白质对,若该子序列出现在 当前RNA-蛋白质对中,则出现次数加1,若未出现在当前RNA-蛋白质对中,则出现次数不变。将统计得到的第j种k-mer子序列在RPI1807数据集中的出现次数记为num j,进而可以根据出现次数uum j计算得到该子序列在RPI1807数据集中的出现频率Freq j。即:
Figure PCTCN2021121103-appb-000012
类似的,可以计算出第一候选项集C1中每种3-mer子序列或4-mer子序列在RPI1807数据集中的出现频率。然后,可以根据预设出现频率阈值对所有的3-mer子序列和4-mer子序列进行筛选。例如,可以选取出现频率大于第一阈值的RNA 3-mer子序列、出现频率大于第二阈值的RNA 4-mer子序列、出现频率大于第三阈值的蛋白质3-mer子序列和出现频率大于第四阈值的蛋白质4-mer子序列共同组成频繁项集L1。其中,第一阈值、第二阈值、第三阈值和第四阈值可以相同,也可以不相同,本公开对此不做具体限定。其他示例中,也可以将RNA序列的3-mer子序列、4-mer子序列的出现频率以及蛋白质序列的3-mer子序列、4-mer子序列的出现频率分别进行降序排序,由排名靠前的子序列组成频繁项集L1,本公开对此不做具体限定。
可以将频繁项集L1中的RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列分别进行两两交叉组合,得到多种3-mer子序列对和4-mer子序列对,并由组合得到的多种子序列对组成第二候选项集C2。举例而言,若频繁项集中包括[AAU,AUC,137,123,AAUU,AGUC,1737,1234],可以将“AAU”和“137”组合得到3-mer子序列对“AAU_137”,也可以将“AAU”和“123”组合得到3-mer子序列对“AAU_123”。类似的,通过将RNA 3-mer子序列和蛋白质3-mer子序列进行交叉组合还可以得到3-mer子序列对“AUC_137”、“AUC_123”,通过将RNA 4-mer子序列和蛋白质4-mer子序列进行交叉组合可以得到4-mer子序列对“AAUU_1737”、“AAUU_1234”、“AGUC_1737”和“AGUC_1234”。
可以统计第二候选项集C2中每种子序列对在RPI1807数据集中的出现频率。示例性的,对于第f种k-mer子序列对,该子序列对可以是3-mer子序列对,也可以是4-mer子序列对。可以先统计该子序列对在RPI1807数据集中的出现次数。例如,可以循环RPI1807数据集中的N个RNA-蛋白质对,若该子序列对出现在当前RNA-蛋白质对中,则出现次数加1,若未出现在当前RNA-蛋白质对中,则出现次数不变。将统计得到的第f种k-mer子序列对在RPI1807数据集中的出现频次记为num f,进而可以根据出现频次num f计算得到该子序列对在RPI1807数据集中的出现频率,即得到了该子序列对的支持度support f。即:
Figure PCTCN2021121103-appb-000013
类似的,可以计算出第二候选项集C2中每种子序列对的支持度。
计算得到每种子序列对的支持度后,可以根据每种子序列对的支持度大小确定满足预设条件的子序列对,并由满足预设条件的子序列对组成原始序列特征集。示例性的,可以预设支持度阈值,将支持度大于该阈值的子序列对筛选出来,并由筛选得到的子序列对组 成原始序列特征集。例如,支持度大于该阈值的子序列对共有370个,这370个子序列对即为370个频繁项集特征,可以由370个频繁项集特征组成原始序列特征集。其他示例中,也可以根据支持度大小将第二候选项集C2中的所有子序列对进行降序排序,并选取排名靠前的子序列对组成原始序列特征集,本公开对此不做具体限定。
本公开示例实施方式中,频繁项集特征可以将RNA序列的kmer特征和蛋白质序列的kmer特征组合到一起,使用频繁项集特征可以更好的区分有相互作用和没有相互作用的RNA-蛋白质对。因此,由频繁项集特征组成原始序列特征集,并基于该原始序列特征集对待预测的RNA-蛋白质对进行特征提取时,根据提取到的该待预测的RNA-蛋白质对的序列特征可以更加准确的确定该待预测的RNA-蛋白质对是否有相互作用。
步骤S320.根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
一种示例实施方式中,可以将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,得到原始序列特征集后,可以在原始序列特征集中查找每种k-mer子序列,并根据查找结果得到待预测的RNA-蛋白质对的序列特征。其中,待预测的RNA-蛋白质对的序列特征可以是指由RNA序列特征和蛋白质序列特征组成的完整序列特征。
示例性的,原始序列特征集可以由560种k-mer子序列组成。例如,560种k-mer子序列可以为[CCC,…,AGU,CCCC,…,CUGG,777,…,373,7774,…,7571]。将待预测的RNA-蛋白质对转化为3-mer子序列和4-mer子序列,得到RNA 3-mer子序列、RNA4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列。可以基于原始序列特征集对待预测的RNA-蛋白质对中的RNA序列和蛋白质序列进行特征计算,得到待预测的该RNA-蛋白质对的序列特征。
具体地,对于包括560个特征维度的原始序列特征集[CCC,…,AUA,CCCC,…,CUGG,777,…,373,7774,…,7571],每个特征维度的特征对应于一种k-mer子序列。例如,子序列CCC为第一个特征维度的特征,子序列7571为第560个特征维度的特征。可以在原始序列特征集中查找该待预测的RNA-蛋白质对的所有3-mer子序列和4-mer子序列,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在。若存在则该特征维度上的特征值为1,不存在则为0。举例而言,若待预测的RNA-蛋白质对中的RNA序列为“CCACCCCAAUA”,蛋白质序列为“123373777373”,对应的RNA 3-mer子序列包括CCA、…、CCC、…、CCA、…、AUA,RNA 4-mer子序列包括CCAC、…、CCCC、…、AAUA,蛋白质3-mer子序列包括123、…、373、…、777、…、373,蛋白质4-mer子序列包括1233、…、7377、…、7373。每种子序列在原始序列特征集中的查找结果可以参考表1所示。
表1
Figure PCTCN2021121103-appb-000014
Figure PCTCN2021121103-appb-000015
由表1可知,原始序列特征集中的第一个特征维度的特征CCC同时也是该RNA-蛋白质对的一个3-mer子序列。因此,可以确定原始序列特征集中第一个特征维度上的特征CCC存在,对应的可以将特征值记为1。再例如,原始序列特征集中的特征CUGG在该待预测的RNA-蛋白质对的RNA序列中并不存在,则可以将原始序列特征集中第260个特征维度上的特征值记为0。最后,可以计算得到一个560维的特征值向量[1,…,1,1,…,0,1,…,1,0,…,0],该特征值向量即为待预测的RNA-蛋白质对的序列特征。可以理解的是,该特征值向量中包含的每个特征值与原始序列特征集中每个特征维度的特征值是一一对应的。
该示例实施方式中,通过提取原始数据集中每个RNA-蛋白质对的k-mer特征,由提取到的RNA序列的k-mer特征和蛋白质序列的k-mer特征组成原始序列特征集。基于该原始序列特征集对待预测的RNA-蛋白质对进行特征提取得到该待预测的RNA-蛋白质对的序列特征,以RNA序列的k-mer特征为例,k-mer特征中可以包含RNA序列的单体组分信息(即包含的各个碱基)和序列顺序信息。因此,使用k-mer特征可以更好的刻画一个RNA序列,也即可以根据k-mer特征更加准确的确定一个RNA序列,同时也可以通过k-mer特征区分不同的RNA序列。
另一种示例实施方式中,得到原始序列特征集后,可以将待预测的RNA-蛋白质对中RNA序列和蛋白质序列分别转化为k-mer子序列,并将RNA k-mer子序列和蛋白质k-mer子序列进行交叉组合,得到多种RNA-蛋白质k-mer子序列对。示例性的,得到该待预测的RNA-蛋白质对的RNA 3-mer子序列、蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列后,可以RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列分别进行两两交叉组合,得到多种3-mer子序列对和4-mer子序列对。可以在原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到RNA-蛋白质对的序列特征。
示例性的,原始序列特征集可以由370个频繁项集特征组成。例如,370个频繁项集特征可以为[CCA_121,…,UCUG_1312,…,AAU_122,…,CUUU_1312,…]。将待预测的RNA-蛋白质对转化为3-mer子序列和4-mer子序列,得到RNA 3-mer子序列、RNA4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列。可以将RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列进行配对,得到多种3-mer子序列对和4-mer子序列对。然后,可以基于原始序列特征集对待预测的RNA-蛋白质对中的RNA序列和蛋白质序列进行特征计算,得到该RNA-蛋白质对的序列特征。
具体地,对于包括370个特征维度的原始序列特征集[CCA_121,…,UCUG_1312,…,AAU_122,…,CUUU_1312,…],每个特征维度的特征对应于一种k-mer子序列对。例如,子序列对CUG_122为第一个特征维度的特征。可以在原始序列特征集中查找该RNA-蛋白质对的所有子序列对,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在。若存在则该特征维度上的特征值为1,不存在则为0。举例而言,若待预测的 RNA-蛋白质对中的RNA序列为“CCAUCUGAAU”,蛋白质序列为“1312137122”。可以看出,该RNA-蛋白质对的子序列对中的CCA_121、UCUG_1312、AAU_122存在于原始序列特征集中,因此,可以将原始序列特征中对应特征维度上的特征值记为1。每种子序列对在原始序列特征集中的查找结果可以参考表2所示。
表2
Figure PCTCN2021121103-appb-000016
最后,可以计算得到一个370维的特征值向量[1,…,1,…,1、…,0,…],该特征值向量即为待预测的RNA-蛋白质对的序列特征。类似的,该特征值向量中包含的每个特征值与原始序列特征集中每个特征维度的特征值是一一对应的。
再一种示例实施方式中,得到原始序列特征集后,可以将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,在原始序列特征集中查找每种k-mer子序列,得到第一序列特征。然后,可以将RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对,在原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征。最后,可以由第一序列特征和第二序列特征组成待预测的RNA-蛋白质对的序列特征。
示例性的,原始序列特征集可以包括两个特征子集,两个特征子集分别包括560种k-mer子序列[CCC,…,CCCC,…,777,…,7774,…]和370个频繁项集特征[CCA_121,…,UCUG_1312,…,AAU_122,…,CUUU_1312,…]。可以将待预测的RNA-蛋白质对转化为RNA 3-mer子序列、RNA 4-mer子序列、蛋白质3-mer子序列和蛋白质4-mer子序列,同时,也可以将RNA 3-mer子序列和蛋白质3-mer子序列、RNA 4-mer子序列和蛋白质4-mer子序列进行配对,得到多种3-mer子序列对和4-mer子序列对。然后,可以基于原始序列特征集对待预测的RNA-蛋白质对中的RNA序列和蛋白质序列进行特征计算,得到该待预测的RNA-蛋白质对的序列特征。
具体地,可以在原始序列特征集中查找该待预测的RNA-蛋白质对的所有子序列和子序列对,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在。例如,通过查找该待预测的RNA-蛋白质对的所有子序列可以计算得到一个560维的特征值向量[1,…,1,…,1,…,0,…],即第一序列特征。通过查找该待预测的RNA-蛋白质对的所有子序列对可以计算得到一个370维的特征值向量[1,…,1,…,1,…,0,…],即第二序列特征。例如,可以通过将两个特征值向量进行拼接,得到一个930维的特征值向量即为该待预测的RNA-蛋白质对的序列特征。也可以直接将两个特征值向量同时输入相互作用预测模型中,本公开对此不做具体限定。
其他示例中,也可以由560种k-mer子序列和370个频繁项集特征组成一个930维的 原始序列特征集。例如,原始序列特征集为[CCC,…,CCCC,…,777,…,7774,…,CCA_121,…,UCUG_1312,…,AAU_122,…,CUUU_1312,…]。可以在原始序列特征集中查找该RNA-蛋白质对的所有子序列和子序列对,并根据查找结果确定原始序列特征集中每个特征维度上的特征是否存在,计算得到一个930维的特征值向量[1,…,1,…,1,…,0,…,1,…,1,…,1,…,0,…],该特征值向量即为待预测的RNA-蛋白质对的序列特征。
在步骤S230中,向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量。
对待预测的RNA-蛋白质对进行特征提取后,可以将得到的序列特征作为第一相互作用预测模型的输入。为了获取RNA序列和蛋白质序列的序列信息以及临近碱基、氨基酸之间的信息,还可以对待预测的RNA-蛋白质对进行向量化处理,并将得到的向量作为第二相互作用预测模型的输入。
一种示例实施方式中,可以将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列。例如,可以将RNA序列不重叠的划分为M个RNA k-mer子序列,可以将蛋白质序列不重叠的划分为N个蛋白质k-mer子序列。举例而言,若RNA序列为AUCUGAAAU,则可以划分为三个RNA k-mer子序列,分别为AUC、UGA和AAU。可以理解的是,将RNA序列和蛋白质序列不重叠的划分为多个k-mer子序列,是为了将RNA序列和蛋白质序列进行向量化表示,也就是将RNA序列中的碱基和蛋白质序列中的氨基酸以k联体的方式进行向量化表示。类似的,其他示例中,也可以将待预测的RNA-蛋白质对中的RNA序列包含的每个碱基向量化,得到多个碱基向量,拼接多个碱基向量得到RNA序列表示向量。同时,可以将待预测的RNA-蛋白质对中的蛋白质序列包含的每个氨基酸向量化,得到多个氨基酸向量,拼接多个氨基酸向量得到蛋白质序列表示向量。还可以将RNA序列重叠的划分为P个RNA k-mer子序列,可以将蛋白质序列重叠的划分为Q个蛋白质k-mer子序列,本公开对此不做具体限定。然后,可以分别将RNA序列表示向量和蛋白质序列表示向量输入第二相互作用预测模型中。
具体地,可以先将RNA序列和蛋白质序列的每个k-mer子序列进行编码。示例性的,当k=3时,RNA 3-mer子序列可以有64种,蛋白质3-mer子序列可以有343种。可以将每种RNA 3-mer子序列和蛋白质3-mer子序列依次进行Embedding(向量映射)编码,将每个3-mer子序列分别用一个低维向量表示,可以得到对应的多个3-mer子序列向量。一种示例中,可以将每个3-mer子序列进行One-Hot(独热)编码,One-Hot编码又称作一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有独立的寄存器位,并且在任意时候,寄存器中只有一位有效。举例而言,对于第i种RNA 3-mer子序列,即索引为整数i的RNA 3-mer子序列,通过编码可以得到一个64维的One-Hot向量,并将该向量中第i个元素设为1,其它元素均设为0,形如[0,1,0,0,…,0]。对于第j个蛋白质3-mer子序列,即索引为整数j的蛋白质3-mer子序列,通过编码可以得到一个 343维的One-Hot向量,并将该向量中第j个元素设为1,其它元素均设为0。类似的,每种RNA 3-mer子序列和蛋白质3-mer子序列可以对应一个3-mer One-Hot向量。其他示例中,也可以利用稠密向量来表示每个3-mer子序列。例如,可以利用Word2vec算法将每个3-mer子序列映射到向量空间中,每个3-mer子序列可以用该向量空间中的一个子序列向量表示。还可以利用BERT(来自变换器的双向编码表示)预训练模型对每个3-mer子序列进行编码,得到对应的多个3-mer子序列向量。具体地,可以获取大规模的RNA序列数据,使用BERT预训练模型进行训练,训练完成后,将某个RNA序列输入训练好的模型后即可得到该RNA序列的高维向量,本公开对此不做具体限定。
得到RNA序列和蛋白质序列的所有3-mer One-Hot向量后,可以将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为3-mer子序列,如得到M个RNA 3-mer子序列和N个蛋白质3-mer子序列。然后,通过查询可以确定M个RNA 3-mer子序列对应的M个3-mer One-Hot向量,将M个3-mer One-Hot向量依次进行拼接,如可以在行方向上进行拼接,得到一个M*64的二维矩阵,如:
[[0,1,0,0,….,0];[0,0,0,1,….,0];…;[1,0,0,0,….,0]]
该二维矩阵即为RNA序列的3-mer One-Hot表示向量。通过查询也可以确定N个蛋白质3-mer子序列对应的N个3-mer One-Hot向量,将N个3-mer One-Hot向量在行方向依次进行拼接,可以得到一个N*343的二维矩阵,该二维矩阵即为蛋白质序列的3-mer One-Hot表示向量。可以理解的是,也可以将M个3-mer One-Hot向量或N个3-mer One-Hot向量进行列拼接,还可以进行直接(即尾部)拼接得到序列的3-mer One-Hot向量,本公开对此不做具体限定。
本公开示例实施方式中,通过将待预测的RNA-蛋白质对中的RNA序列和蛋白质序列向量化,可以将RNA序列表示向量和蛋白质序列表示向量作为深度学习模型的输入,以进一步发现数据中出现次数很少或者新的特征组合,进而揭示隐式特征之间的相互作用。
另一种示例实施方式中,也可以将RNA-蛋白质对中的RNA序列包含的每个碱基向量化,得到多个碱基向量,拼接多个碱基向量得到RNA序列表示向量。同时,可以将RNA-蛋白质对中的蛋白质序列包含的每个氨基酸向量化,得到多个氨基酸向量,拼接多个氨基酸向量得到蛋白质序列表示向量。
具体地,可以先将RNA序列中可能包含的4种碱基(A、C、G和U)以及蛋白质序列中可能包含的7类氨基酸(1、2、3、4、5、6和7)进行Embedding编码,将每种碱基和每类氨基酸分别用一个低维向量表示,可以得到对应的多个向量。一种示例中,可以将每种碱基和每类氨基酸进行One-Hot编码。举例而言,碱基A可以用One-Hot向量[1,0,0,0]表示,U可以用[0,0,0,1]表示。第1类蛋白质可以编码为[1,0,0,0,0,0,0],第7类蛋白质可以编码为[0,0,0,0,0,0,1]。其他示例中,也可以利用稠密向量来表示每个3-mer子序列。例如,可以利用Word2vec算法将每种碱基和每类氨基酸映射到向量空间中,得到对应的向量表示。类似的,还可以利用Doc2vec算法、Glove算法等将每种碱 基和每类氨基酸转化为向量,本公开对此不做具体限定。
对于待预测的RNA-蛋白质对,可以将其中的RNA序列用单个碱基的One-Hot向量表示,得到一个L*4的矩阵。其中,L为该RNA序列的序列长度,即该RNA序列中包含的碱基个数。示例性的,若该RNA序列为“ACUGAUGC”,可以得到8个碱基的One-Hot向量。参考表3所示,每一列表示一个碱基的One-Hot向量表示。例如,可以将这8个One-Hot向量在行方向上进行拼接,就可以得到一个8*4的矩阵,该矩阵即为该RNA序列的One-Hot向量表示。
表3
序列 A C U G A U G C
A 1 0 0 0 1 0 0 0
C 0 1 0 0 0 0 0 1
G 0 0 0 1 0 0 1 0
U 0 0 1 0 0 1 0 0
在步骤S240中,基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值。
一种示例实施方式中,可以使用至少三个相互作用预测模型对待预测的RNA-蛋白质对的相互作用进行预测。示例性的,相互作用预测模型可以均为传统机器学习模型,也可以均为深度学习模型,还可以是同时包括传统机器学习模型和深度学习模型在内的至少三个相互作用预测模型。其中,传统机器学习模型是指使用原始形式处理自然数据。例如,构成一个模式识别或机器学习系统需要利用专业知识从原始数据中(如图像的像素值)提取特征,并转换成一个适当的特征表示。示例性的,传统机器学习模型可以包括线性回归模型、逻辑回归模型、支持向量机模型、决策树模型、K-近邻(K-Nearest Neighbor,KNN)模型、随机森林模型和朴素贝叶斯模型等。而深度学习模型具有自动提取特征的能力,可以由多个处理层组成复杂计算模型,从而自动获取数据的表示与多个抽象级别,是一种针对特征表示的学习。示例性的,深度学习模型可以包括卷积神经网络模型和循环神经网络模型等。
需要说明的是,当相互作用预测模型均为传统机器学习模型时,可以不对待预测的RNA-蛋白质对进行向量化处理,而仅将对待预测的RNA-蛋白质对进行特征提取得到的序列特征作为各个传统机器学习模型的输入。当相互作用预测模型均为深度学习模型时,可以不对待预测的RNA-蛋白质对进行特征提取,而仅将对待预测的RNA-蛋白质对进行向量化处理得到的RNA序列表示向量和蛋白质序列表示向量作为各个深度学习模型的输入。
一种示例实施方式中,得到待预测的RNA-蛋白质对的序列特征和该待预测RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量后,可以将待预测的RNA-蛋白质对的序列特征输入到第一相互作用预测模型中,得到第一相互作用预测值。同时,可以将待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到第二相互作用预 测模型中,得到第二相互作用预测值。其中,使用的相互作用预测模型可以包括至少一个第一相互作用预测模型和至少两个第二相互作用预测模型,或者,也可以包括至少两个第一相互作用预测模型和至少一个第二相互作用预测模型。
本公开示例实施方式中,通过训练至少三个相互作用预测模型(即弱监督模型),得到多个学习结果,并将多个学习结果进行融合得到强监督模型。其中,将各个学习结果进行融合可以获得比单个模型更好的学习效果。可以理解的是,可以将强监督模型看作是一个整体模型,实际上其中包含多个弱监督模型。因此,可以将第一相互作用预测模型和第二相互作用预测模型相结合,并对多个模型的输出进行融合学习。而且,在模型训练过程中每个模型是单独训练的。示例性的,可以通过反向传播算法优化各个模型的模型参数,由优化后的各个模型得到强监督模型,从而可以解决使用单一模型进行预测时存在的泛化能力较低,容易过拟合等问题。
一种示例实施方式中,可以将待预测的RNA-蛋白质对的序列特征输入到传统机器学习模型中,得到第一相互作用预测值。可以将待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到深度学习模型中,得到第二相互作用预测值。需要说明的是,每种深度学习模型可以至少包括两个子深度学习模型,分别用于处理RNA-蛋白质对中的RNA序列和蛋白质序列。例如,可以将RNA序列的表示向量输入第一子深度学习模型中,得到第一序列特征。同时,可以将蛋白质序列的表示向量输入第二子深度学习模型中,得到第二序列特征。最后,可以通过每种深度学习模型的全连接层融合第一序列特征和序列第二特征,以根据融合后的特征得到第二相互作用预测值。其中,使用的相互作用预测模型可以包括至少一个传统机器学习模型和至少两个深度学习模型,或者,也可以包括至少两个传统机器学习模型和至少一个深度学习模型。
举例而言,传统机器学习模型可以包括SVM模型、LR(Logistic Regression,逻辑回归)模型、随机森林模型中的至少一种,深度学习模型可以包括CNN模型和循环神经网络模型中的至少一种,其中,循环神经网络模型可以是LSTM模型、BiLSTM(Bi-directional LSTM,双向长短记忆网络)模型。
可以以SVM模型、CNN模型和BiLSTM模型进行预测RNA-蛋白质对的相互作用为例进行说明。示例性的,可以由560种k-mer子序列和370个频繁项集特征组成一个930维的原始序列特征集,根据待预测RNA-蛋白质对的k-mer子序列对原始序列特征集进行特征计算,得到一个930维的特征值向量,并将其作为SVM模型的输入,得到相互作用预测值y 1。同时,还可以将待预测RNA-蛋白质对向量化,分别得到RNA序列和蛋白质序列的k-mer One-Hot向量,并将其作为CNN模型和BiLSTM模型的输入,得到相互作用预测值y 2、y 3
在本公开示例实施方式中,首先,对于SVM模型,可以手工设计特征提取方式以及设置模型参数,使得模型解释性较强。CNN模型和BiLSTM模型具有较好的泛化能力,可以发现数据中出现次数很少或者新的特征组合,进而揭示隐式特征之间的相互作用。其次, CNN模型可以更好地捕捉特征但忽略了特征所在的位置信息,即无法提取一定间隔的碱基、氨基酸之间的相互关系。而BiLSTM模型具有更好的记忆能力,可以利用数据的序列信息与位置信息,弥补CNN模型在记忆能力上的缺陷。最后,SVM模型简单,且可解释性好,使用CNN模型、BiLSTM模型与SVM模型对未知RNA-蛋白质对的相互作用进行预测时,可以增强整体模型对RPI进行预测的可解释性。总的来说,本公开可以将各个相互作用预测模型的特性进行有效结合,从而可以提高整体模型的预测能力。
在步骤S250中,根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
得到至少三个相互作用预测值后,可以融合多个相互作用预测值,并根据融合结果确定RNA和蛋白质之间的相互作用。
一种示例实施方式中,可以预设预测值阈值,根据该阈值对每个相互作用预测值进行标记。当相互作用预测值大于预测值阈值时,标记为1。当相互作用预测值小于预测值阈值时,标记为0。例如,设置预测值阈值为0.6,若相互作用预测值大于等于0.6时,将对应的相互作用预测模型的预测结果标记为1,也就是通过该相互作用预测模型预测到输入的RNA-蛋白质对有相互作用。反之,则将对应的相互作用预测模型的预测结果标记为0。将至少三个相互作用模型的预测结果标记完成后,可以对各个标记值进行求和,即根据:
Figure PCTCN2021121103-appb-000017
计算得到最终的相互作用预测值T。其中,t i为第i个相互作用预测模型的预测结果的标记值,n为相互作用预测模型的数量。当n为奇数时,若T≥n/2,表明待预测的RNA-蛋白质对有相互作用。若T<n/2,则表明该RNA-蛋白质对没有相互作用。当n为偶数时,若T≥n/2,表明待预测的RNA-蛋白质对有相互作用,若T<n/2,表明待预测的RNA-蛋白质对没有相互作用。其他示例中,当n为偶数时,也可以设置为若T>n/2,表明待预测的RNA-蛋白质对有相互作用,若T≤n/2,表明待预测的RNA-蛋白质对没有相互作用,本公开对此不做具体限定。
另一种示例实施方式中,可以获取多个相互作用预测模型的权重参数,对多个相互作用预测值和对应的权重参数进行融合计算,如可以对多个相互作用预测值进行加权求和计算,并根据计算结果确定RNA-蛋白质对的相互作用。举例而言,使用SVM模型、CNN模型和BiLSTM模型对RNA-蛋白质对的相互作用进行预测时,可以根据:
y out=α*y 1+β*y 2+γ*y 3
计算得到RNA-蛋白质对的相互作用预测值y out。其中,y 1为SVM模型的输出值,y 2为CNN模型的输出值,y 3为BiLSTM模型的输出值,α、β、γ分别为SVM模型、CNN模型和BiLSTM模型的权重参数,y out可以为0-1之间的任意数值。示例性的,可以以0.5为边界值,当y out>0.5时,可以将预测结果标记为1,即表示该RNA-蛋白质对有相互作用。当y out≤0.5时,可以将预测结果标记为0,即表示该RNA-蛋白质对无相互作用。本公开示例实施方式中,权重参数可以通过手工设置,例如,可以为准确率较大的相互作用预测 模型设置较大的权重,本公开对此不做具体限定。
本公开示例实施方式中,参考图5所示,可以根据步骤S510和步骤S520对至少三个相互作用预测模型预先进行训练,以实现各个预测模型中所有模型参数的优化,可以根据训练得到的最终模型对相互作用未知的RNA-蛋白质对进行预测。
步骤S510.获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对。
可以将原始数据集中的所有RNA-蛋白质对作为训练数据集,也可以将原始数据集中的部分RNA-蛋白质对作为训练数据集,还可以将原始数据集按比例分为训练数据集、验证数据集和测试数据集,进而可以使用训练数据集和验证数据集调整各个模型的模型参数,训练得到性能较优的多个模型,可以使用测试数据集测试优化后的各个模型的泛化性能。以RPI1807数据集为例,该数据集中共有3243个RNA-蛋白质对,具体包含1807对正例和1436对负例。示例性的,可以按照7:2:1的比例将该数据集划分成训练数据集、验证数据集和测试数据集。其中,每个数据集中正例和负例的比例可以与整体数据集的分布保持一致,即比例为1807:1436,约为1.25:1。举例而言,可以选取1250个正例和1000个负例作为训练数据集,选取360个正例和280个负例作为验证数据集,选取180个正例和140个负例作为测试数据集。可以理解的是,训练数据集中RNA-蛋白质对的数目仅仅是示意性的,可以获取任意数目的RNA-蛋白质对对每个相互作用预测模型进行多次训练,以提高每个相互作用预测模型的性能。需要说明的是,可以对正例RNA-蛋白质对进行标记,得到的标签值为“1”,即表示RNA-蛋白质对有相互作用。可以对负例RNA-蛋白质对进行标记,得到的标签值为“0”,即表示RNA-蛋白质对没有相互作用。
步骤S520.将所述训练数据集作为所述相互作用预测模型的输入,对各个相互作用预测模型的模型参数进行迭代更新,当满足迭代终止条件时,完成对所有模型参数的训练,以使用训练好的所述相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
示例性的,可以将训练数据集输入至少三个相互作用预测模型中,利用反向传播算法对模型参数进行调整,得到多个弱监督模型。例如,模型参数可以是权重参数、偏置参数、惩罚因子等。示例性的,可以利用随机梯度下降算法对各个模型参数进行迭代更新。根据反向传播原理,不断计算目标函数,并根据目标函数更新模型参数。当目标函数收敛至最小值时,完成对模型参数的训练。也可以反向迭代式更新模型参数,当满足预设的迭代次数时,完成对所有模型参数的训练。需要说明的是,可以同时训练至少三个相互作用预测模型,也可以对至少三个相互作用预测模型依次进行训练,本公开对此不做具体限定。但是,每个相互作用预测模型是单独训练的。
对模型参数进行优化时,可以先初始化一组超参数,利用训练数据集对至少三个相互作用预测模型进行不断训练,得到第一模型。其中,超参数可以是学习率、CNN层数、卷积核尺寸等。然后,可以将验证数据集输入到训练好的第一模型中,以对第一模型的预测 精度进行验证。当预测精度达到预设精度阈值时,可以将当前的第一模型作为第二模型,即得到最终训练模型。最后,可以在该训练模型上使用测试数据集对模型的最终性能进行测试。可以理解的是,若根据测试数据集得出第二模型的预测效果较差,则可以重新设定一组超参数,利用训练数据集、验证数据集再次对相互作用预测模型依次进行训练和验证,当训练好的相互作用预测模型在验证数据集上得到的预测精度达到预设精度阈值时,可以利用新的测试数据集对该预测模型的最终性能进行测试。
得到预测精度较优的弱监督模型后,可以将测试数据集中的各个RNA-蛋白质对输入弱监督模型中,以对弱监督模型的准确率进行判断。若该模型的准确率大于预设准确率阈值时,弱监督模型训练完成。可以将多个弱监督模型的预测结果进行融合,从而得到强监督模型,并作为最终的预测模型,利用该模型可以对未知RNA-蛋白质对之间的相互作用进行预测。其他示例中,也可以利用测试数据集对弱监督模型的马修斯相关系数进行判断。马修斯相关系数是指实际分类与预测分类之间的相关系数,它的取值范围为[0,1],取值越大表示预测值和真实值越相关,取值为1时表示预测结果完全正确。若该模型的马修斯相关系数大于预设阈值时,表示弱监督模型训练完成。还可以利用测试数据集对弱监督模型的特异率、召回率等进行判断,本公开对此不做具体限定。可以理解的是,若弱监督模型的准确率未大于预设准确率阈值时,可以获取新的训练数据集再次对各个相互作用预测模型的模型参数进行训练,以不断提高模型性能。
本公开示例实施方式中,通过训练传统机器学习模型、卷积神经网络模型及循环神经网络模型等多个弱监督模型,并将多个弱监督模型的预测结果进行融合,得到强监督模型。该强监督模型简洁有效,而且对硬件设备要求较低,适用范围较广。
一种具体实施方式中,可以分别对SVM模型、CNN模型和BiLSTM模型进行训练。
对训练数据集中的每个RNA-蛋白质对进行特征提取,可以将提取到的序列特征依次输入SVM模型中。示例性的,原始序列特征集由560种k-mer子序列和370个频繁项集特征组成,可以根据RNA-蛋白质对的k-mer子序列对原始序列特征集进行特征计算,得到一个930维的特征值向量,并将其作为SVM模型的输入,以对该模型的模型参数进行训练。其中,SVM模型的模型参数可以包括惩罚因子C、gamma参数等。在SVM模型中,核函数可以采用多项式核函数。为了增加该模型的泛化能力,可以减小惩罚项,将惩罚因子C设置为0.8。核函数也可以采用径向基核函数,其中,gamma参数默认为1/n_features,通过实验调整参数可知,当gamma参数值为0.1时,模型的性能较优,学习效果更好。模型参数确定后,可以利用模型参数对训练数据集进行训练,得到性能较优的SVM模型。
由对SVM模型进行训练可知,传统机器学习模型需要手工设计特征提取方式以及设置模型参数,使得模型解释性较强,但模型的泛化能力较弱。因此,可以采用CNN模型进行特征提取及分类,实现端到端的RNA-蛋白质预测。另外,利用CNN模型可以实现临近碱基、氨基酸特征信息的提取,但无法提取一定间隔的碱基、氨基酸之间的相互关系,所以可以采用BiLSTM模型提取序列信息。
示例性的,可以对训练数据集中的每个RNA-蛋白质对进行向量化处理,将得到的每个RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入CNN模型和BiLSTM模型中。进一步地,可以以利用第i个RNA-蛋白质对对各个预测模型进行训练为例。
可以将该RNA-蛋白质对的序列特征输入SVM模型中,并输出预测值y 1。可以将该RNA-蛋白质对中RNA序列的表示向量和蛋白质序列的表示向量分别输入两个CNN子模型中,并将两个CNN子模型的输出进行拼接,最终输出预测值y 2。类似的,可以将该RNA-蛋白质对的表示向量输入BiLSTM模型中,并输出预测值y 3。可以将各个相互作用预测模型的输出值标记为0或1,0表示该RNA-蛋白质对没有相互作用,1表示该RNA-蛋白质对有相互作用。其中,对于CNN模型和BiLSTM模型的输入,以RNA序列为例,RNA序列的表示向量可以是由该序列中多个碱基的One-Hot向量得到的序列的One-Hot向量,也可以是该序列的3-mer One-Hot向量,还可以是该序列的4-mer One-Hot向量,本公开中对此不做具体限定。可以理解的是,对于训练数据集中的每个RNA-蛋白质对,通过这三个相互作用预测模型均可以得到三个相互作用预测值。
参考表4所示,为CNN模型的一种网络架构。可以看出,可以将RNA序列和蛋白质序列分别输入两个CNN子网络中,两个CNN子网络的网络架构相同。示例性的,可以以RNA序列作为第一CNN子网络的输入为例进行说明。第一CNN子网络可以包含卷积核大小为1×3的5个卷积层、4个归一化层、4个下采样层、拼接层、全连接层和随机失活层。示例性的,卷积层C1可以利用1×3的卷积核进行特征提取,输出通道数为32。可以对每个通道提取到的序列特征进行BN(Batch Normalization,批归一化)操作,以将序列特征化,从而加速网络训练。下采样层P1可以利用1×2的卷积核对邻域内特征点取最大值,以减少网络要学习的参数数量。可以对RNA序列逐层进行下采样操作,得到RNA序列的特征向量。类似的,可以对蛋白质序列逐层进行下采样操作,得到蛋白质序列的特征向量。可以利用CNN模型的拼接层将RNA序列的特征向量与蛋白质序列的特征向量首尾拼接,在全连接层增加偏置输出拼接后的RNA序列的特征向量与蛋白质序列的特征向量,并添加随机失活层,以使用随机丢弃一部分神经网络节点的神经网络进行训练。另外,最后一层的激活函数可以选择Sigmoid激活函数。
表4
Figure PCTCN2021121103-appb-000018
Figure PCTCN2021121103-appb-000019
参考表5所示,为BiLSTM模型的一种网络架构。可以看出,可以将RNA序列和蛋白质序列分别输入两个相同的BiLSTM子网络中。示例性的,可以以RNA序列作为第一BiLSTM子网络的输入为例进行说明。若RNA序列为“CCAUCUGA”,该RNA序列的表示向量可以是由该序列中8个碱基的One-Hot向量得到一个8*4的矩阵,也可以是该序列的3-mer One-Hot向量,还可以是该序列的4-mer One-Hot向量。示例性的,当输入为由8个碱基的One-Hot向量得到一个8*4的矩阵时,可以将每个碱基向量依次输入BiLSTM网络中,以得到该RNA序列的特征向量。其中,BiLSTM网络前向LSTM网络与后向LSTM网络组合而成,LSTM网络是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。
具体地,可以将8个碱基的One-Hot向量正向依次输入前向LSTM网络中,输出该RNA序列的第一隐含向量。同时可以将8个碱基的One-Hot向量反向依次输入后向LSTM网络中,输出该RNA序列的第二隐含向量。将该RNA序列的第一隐含向量和第二隐含向量进行拼接,最终得到该RNA序列的特征向量。例如,可以先将第一个碱基“C”的One-Hot向量输入向量LSTM网络,可以通过前向LSTM网络对“C”的One-Hot向量进行隐含特征的提取,输出该时刻如t时刻的隐含向量。然后,可以将t时刻的隐含向量和t+1时刻第二碱基“C”的One-Hot向量进行拼接,并将拼接向量输入前向LSTM网络中,并对该拼接向量进行隐含特征的提取,输出t+1时刻的隐含向量。类似的,可以依次将当前时刻的碱基的One-Hot向量与上一时刻传递下来的隐含向量进行拼接,通过前向LSTM网络对拼接向量进行特征提取,直到最后将第八个碱基“A”的One-Hot向量输入前向LSTM网络中,并输出最后时刻的隐含向量,即得到了该RNA序列的第一隐含向量。同时,可以先将第八个碱基“A”的One-Hot向量输入后向LSTM网络中,可以通过后向LSTM网络对“A”的One-Hot向量进行隐含特征的提取,输出该时刻如t时刻的隐含向量。然后,可以将t时刻的隐含向量和t+1时刻第七个碱基“G”的One-Hot向量进行拼接,并将拼接向量输入后向LSTM网络中,并对该拼接向量进行隐含特征的提取,输出t+1时刻的隐含向量。类似的,可以依次将当前时刻的碱基的One-Hot向量与上一时刻传递下来的隐含向量进行拼接,通过后向LSTM网络对拼接向量进行特征提取,直到最后将第一个碱基“C”的One-Hot 向量输入后向LSTM网络中,并输出最后时刻的隐含向量,即得到了该RNA序列的第二隐含向量。将该RNA序列的第一隐含向量和第二隐含向量拼接后,得到该RNA序列的特征向量。通过BiLSTM模型获取RNA序列的特征向量时,可以获取RNA序列的前向和后向的所有序列信息,进而也可以提取具有一定间隔的碱基之间的相互关系,以便于更准确的预测输入的RNA序列和蛋白质序列之间的相互作用。
通过两个BiLSTM子网络分别得到RNA序列的特征向量和蛋白质序列的特征向量后,可以利用BiLSTM模型的拼接层将RNA序列的特征向量与蛋白质序列的特征向量首尾拼接,在全连接层增加偏置输出拼接后的RNA序列的特征向量与蛋白质序列的特征向量,并添加随机失活层,以使用随机丢弃一部分神经网络节点的神经网络进行训练。另外,最后一层的激活函数可以选择Sigmoid激活函数。
表5
Figure PCTCN2021121103-appb-000020
示例性的,在对CNN模型和BiLSTM模型进行训练时,可以利用随机梯度下降算法更新两个模型的模型参数。根据反向传播原理,不断计算目标函数如交叉熵损失函数,并根据计算得到的损失值同时更新每个相互作用预测模型的模型参数。当目标函数收敛至最小值时,完成对所有模型参数的训练。也可以反向迭代式更新模型参数,当满足预设的迭代次数时,完成对所有模型参数的训练。示例性的,预设的迭代次数可以为20次,各个相互作用预测模型在进行20次反向迭代的过程中,一直在不断的更新模型参数。迭代完成后,可以得到优化后的模型参数。其他示例中,也可以交替最小二乘法、Adam优化算法等最小化目标函数,从后向前依次更新模型参数,对模型参数进行优化。
各个模型训练完成后,可以利用最终得到的各个模型对未知RNA-蛋白质对之间的相互作用进行预测,得到多个预测值,并融合多个预测值得到最终的预测结果。最后,可以将该RNA-蛋白质对相互作用的预测结果输出至终端设备以供用户查看。
本公开示例实施方式中,也可以获取至少一个RNA序列,并通过至少三个相互作用预测模型在数据库中查找与输入的每个RNA序列之间有相互作用的蛋白质序列。具体地,获取原始数据集后,可以参考图5利用原始数据集对至少三个相互作用预测模型预先进行训练。训练完成后,可以将参与训练的所有蛋白质序列存储至数据库中。需要说明的是,数据库中也可以包括其他未参与训练的蛋白质序列,也即数据库中的蛋白质序列的种类数目 可以是任意的,数据库中还可以包括任意数目的RNA序列,如可以包括但不限于参与训练的所有RNA序列,本公开中对此不做具体限定。
当用户输入至少一个RNA序列后,可以将输入的每个RNA序列与数据库中的所有蛋白质序列组合成若干个RNA-蛋白质对。进一步的,可以根据步骤S220至步骤S250使用至少三个相互作用预测模型对每个RNA-蛋白质对的相互作用进行预测。具体地,可以对每个RNA-蛋白质对进行特征提取和向量化处理,将得到的序列特征、RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入至少三个相互作用预测模型中,并得到每个RNA-蛋白质对的相互作用预测值。相互作用预测值为1表示该RNA-蛋白质对有相互作用,相互作用预测值为0表示RNA-蛋白质对没有相互作用。然后,可以将所有相互作用预测值为1的RNA-蛋白质对筛选出来,并将每个RNA-蛋白质对中的蛋白质序列输出至终端设备,以供用户查看与输入RNA序列有相互作用的蛋白质序列。
类似的,本公开示例实施方式中,还可以获取至少一个蛋白质序列,并通过至少三个相互作用预测模型在数据库中查找与输入的每个蛋白质序列之间有相互作用的RNA序列。示例性的,当用户输入至少一个蛋白质序列后,可以将输入的每个蛋白质序列与数据库中的所有RNA序列组合成若干个RNA-蛋白质对。进一步的,可以根据步骤S220至步骤S250使用至少三个相互作用模型对每个RNA-蛋白质对的相互作用进行预测。具体地,可以对每个RNA-蛋白质对进行特征提取和向量化处理,将得到的序列特征、每个RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入至少3个相互作用预测模型中,并得到每个RNA-蛋白质对的相互作用预测值。相互作用预测值为1表示该RNA-蛋白质对有相互作用,相互作用预测值为0表示RNA-蛋白质对没有相互作用。然后,可以将所有相互作用预测值为1的RNA-蛋白质对筛选出来,并将每个RNA-蛋白质对中的RNA序列输出至终端设备,以供用户查看与输入蛋白质序列有相互作用的RNA序列。
在本公开示例实施方式所提供的RNA-蛋白质相互作用预测方法中,通过获取待预测的RNA-蛋白质对;对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。一方面,通过对RNA-蛋白质对进行特征提取和向量化处理,可以充分挖掘RNA序列和蛋白质序列之间的联系,以便于准确地预测RNA和蛋白质的相互作用;另一方面,将相互作用预测模型的特性进行有效结合,可以进一步提高预测RNA与蛋白质相互作用的准确性。
应当注意,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行, 以及/或者将一个步骤分解为多个步骤执行等。
进一步的,本示例实施方式中,还提供了一种RNA-蛋白质相互作用预测装置。该装置可以应用于一服务器或终端设备。参考图6所示,该RNA-蛋白质相互作用预测装置600可以包括数据获取模块610、特征提取模块620、数据向量化模块630、相互作用预测模块640以及相互作用确定模块650,其中:
数据获取模块610,用于获取待预测的RNA-蛋白质对;
特征提取模块620,用于对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
数据向量化模块630,用于向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
相互作用预测模块640,用于基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;
相互作用确定模块650,用于根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
在一种可选的实施方式中,特征提取模块620包括:
特征集获取模块,用于获取原始序列特征集;
特征确定模块,用于根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,特征确定模块包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列;
第一序列查找单元,用于在所述原始序列特征集中查找每种k-mer子序列,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,特征确定模块还包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
序列组合单元,用于将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
第二序列查找单元,用于在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,特征确定模块还包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
第一序列查找单元,用于在所述原始序列特征集中查找每种k-mer子序列,得到第一 序列特征;
序列组合单元,用于将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
第二序列查找单元,用于在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征;
特征拼接单元,用于由所述第一序列特征和第二序列特征组成所述待预测的RNA-蛋白质对的序列特征。
在一种可选的实施方式中,数据向量化模块630包括:
序列转化单元,用于将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括M个RNA k-mer子序列和N个蛋白质k-mer子序列;
第一向量化单元,用于将每个RNA k-mer子序列向量化,得到M个RNA k-mer向量;
第一拼接单元,用于拼接所述M个RNA k-mer向量得到所述RNA序列表示向量;
第二向量化单元,用于将每个蛋白质k-mer序列向量化,得到N个蛋白质k-mer向量;
第二拼接单元,用于拼接所述N个蛋白质k-mer向量得到所述蛋白质序列表示向量。
在一种可选的实施方式中,数据向量化模块630还包括:
碱基向量获取单元,用于将所述待预测的RNA-蛋白质对中的RNA序列包含的每个碱基向量化,得到多个碱基向量;
碱基向量拼接单元,用于拼接所述多个碱基向量得到所述RNA序列表示向量;
氨基酸向量获取单元,用于将所述待预测的RNA-蛋白质对中的蛋白质序列包含的每个氨基酸向量化,得到多个氨基酸向量;
氨基酸向量拼接单元,用于拼接所述多个氨基酸向量得到所述蛋白质序列表示向量。
在一种可选的实施方式中,相互作用预测模块640被配置为用于基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用至少三个相互作用预测模型得到所述待预测的RNA-蛋白质对的多个相互作用预测值。
在一种可选的实施方式中,相互作用预测模块640包括:
第一预测单元,用于将所述待预测的RNA-蛋白质对的序列特征输入到第一相互作用预测模型中,得到第一相互作用预测值;
第二预测单元,用于将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到第二相互作用预测模型中,得到第二相互作用预测值;
其中,包括至少一个所述第一相互作用模型和至少两个所述第二相互作用预测模型;或者,包括至少两个所述第一相互作用模型和至少一个所述第二相互作用预测模型。
在一种可选的实施方式中,相互作用预测模块640还包括:
第三预测单元,用于将所述RNA-蛋白质对的序列特征输入到传统机器学习模型中,得 到第一相互作用预测值;
第四预测单元,用于将所述RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到深度学习模型中,得到第二相互作用预测值;
其中,包括至少一个所述传统机器学习模型和至少两个所述深度学习模型;或者,包括至少两个所述传统机器学习模型和至少一个所述深度学习模型。
在一种可选的实施方式中,所述传统机器学习模型包括支持向量机模型、逻辑回归模型、和决策树模型中的至少一种,深度学习模型包括卷积神经网络模型和循环神经网络模型中的至少一种。
在一种可选的实施方式中,相互作用确定模块650包括:
预测值标记单元,用于对所述多个相互作用预测值进行标记,得到多个标记值;
相互作用确定单元,用于对所述多个标记值进行求和,并根据求和结果确定所述RNA和蛋白质之间的相互作用。
在一种可选的实施方式中,特征集获取模块包括:
数据集获取模块,用于获取原始数据集;
特征提取模块,用于对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集。
在一种可选的实施方式中,特征提取模块包括:
序列生成单元,用于对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列;
方差计算单元,用于计算每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,并根据所述出现次数的平均值计算所述每种k-mer子序列的方差;
数据集确定单元,用于根据所述每种k-mer子序列的方差大小确定所述原始序列特征集。
在一种可选的实施方式中,方差计算单元包括:
次数统计子单元,用于遍历所述原始数据集,确定所述每种k-mer子序列在所述每个RNA-蛋白质对的出现次数;
总次数统计子单元,用于对所述每种k-mer子序列在所述每个RNA-蛋白质对的出现次数进行统计得到所述每种k-mer子序列在所述原始数据集中的出现总次数;
次数均值确定子单元,用于根据所述出现总次数计算得到所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值;
方差计算子单元,用于根据所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算所述每种k-mer子序列的方差。
在一种可选的实施方式中,方差计算子单元被配置为用于根据:
Figure PCTCN2021121103-appb-000021
计算所述每种k-mer子序列的方差s 2。其中,n为原始数据集中RNA-蛋白质对的个数, m为每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,x n为每种k-mer子序列在第n个RNA-蛋白质对的出现次数。
在一种可选的实施方式中,数据集确定单元被配置为用于根据所述每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由所述满足预设条件的k-mer子序列组成所述原始序列特征集。
在一种可选的实施方式中,特征提取模块还包括:
子序列对生成单元,用于将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,得到k-mer子序列对;
特征集获取单元,用于统计所述每种k-mer子序列对在所述原始数据集中的出现频率,由满足出现频率预设条件的k-mer子序列对组成所述原始序列特征集。
在一种可选的实施方式中,RNA-蛋白质相互作用预测装置600还包括:
训练模块,用于对所述相互作用预测模型进行训练。
在一种可选的实施方式中,训练模块包括:
训练数据获取单元,用于获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对;
模型参数训练单元,用于将所述训练数据集作为所述相互作用预测模型的输入,对所述相互作用预测模型的模型参数进行迭代更新,当满足迭代终止条件时,完成对所有模型参数的训练,以使用训练好的所述相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
在一种可选的实施方式中,RNA-蛋白质相互作用预测装置600还包括:
数据输出模块,用于输出所述RNA和蛋白质之间的相互作用的预测结果。
上述RNA-蛋白质相互作用预测装置中各模块的具体细节已经在对应的RNA-蛋白质相互作用预测方法中进行了详细的描述,因此此处不再赘述。
上述装置中各模块可以是通用处理器,包括:中央处理器、网络处理器等;还可以是数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。各模块也可以由软件、固件等形式来实现。上述装置中的各处理器可以是独立的处理器,也可以集成在一起。
本公开的示例性实施方式还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在电子设备上运行时,程序代码用于使电子设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。该程序产品可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在电子设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
本公开的示例性实施方式还提供了一种能够实现上述方法的电子设备。下面参照图7来描述根据本公开的这种示例性实施方式的电子设备700。图7显示的电子设备700仅仅是一个示例,不应对本公开实施方式的功能和使用范围带来任何限制。
如图7所示,电子设备700可以以通用计算设备的形式表现。电子设备700的组件可以包括但不限于:至少一个处理单元710、至少一个存储单元720、连接不同系统组件(包括存储单元720和处理单元710)的总线730和显示单元740。
存储单元720存储有程序代码,程序代码可以被处理单元710执行,使得处理单元710执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。例如,处理单元710可以执行图2至图7中任意一个或多个方法步骤。
存储单元720可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)721和/或高速缓存存储单元722,还可以进一步包括只读存储单元(ROM)723。
存储单元720还可以包括具有一组(至少一个)程序模块725的程序/实用工具724,这样的程序模块725包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线730可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备700也可以与一个或多个外部设备800(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备700交互的设备通信,和/或与使得该电子设备700能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口750进行。并且,电子设备700还可以通过网络适配器760与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器760通过总线730与电子设备700的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备700使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
在一些实施例中,可以由电子设备的处理单元710执行本公开中所述的RNA-蛋白质相互作用预测方法。在一些实施例中,可以通过输入接口750输入待预测的RNA-蛋白质对/待预测的RNA序列/待预测的蛋白质序列、原始数据集、以及用于训练各个相互作用预测模型的训练数据集等。例如,通过电子设备的用户交互界面输入待预测的RNA-蛋白质对、原始数据集、以及用于训练各个相互作用预测模型的训练数据集等等。在一些实施例中,可以通过输出接口750将该待预测的RNA-蛋白质对的相互作用的预测结果输出至外部设备800以供用户查看。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开示例性实施方式的方法。
此外,上述附图仅是根据本公开示例性实施方式的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (24)

  1. 一种RNA-蛋白质相互作用预测方法,其特征在于,包括:
    获取待预测的RNA-蛋白质对;
    对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
    向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
    基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;
    根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
  2. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征,包括:
    获取原始序列特征集;
    根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征。
  3. 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列;
    在所述原始序列特征集中查找每种k-mer子序列,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
  4. 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
    将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
    在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,并根据查找结果得到所述待预测的RNA-蛋白质对的序列特征。
  5. 根据权利要求2所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述原始序列特征集确定所述待预测的RNA-蛋白质对的序列特征,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括RNA k-mer子序列和蛋白质k-mer子序列;
    在所述原始序列特征集中查找每种k-mer子序列,得到第一序列特征;
    将所述RNA k-mer子序列和蛋白质k-mer子序列进行组合,得到多种RNA-蛋白质k-mer子序列对;
    在所述原始序列特征集中查找每种RNA-蛋白质k-mer子序列对,得到第二序列特征;
    由所述第一序列特征和第二序列特征组成所述待预测的RNA-蛋白质对的序列特征。
  6. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer子序列,所述k-mer子序列包括M个RNA k-mer子序列和N个蛋白质k-mer子序列;
    将每个RNA k-mer子序列向量化,得到M个RNA k-mer向量;
    拼接所述M个RNA k-mer向量得到所述RNA序列表示向量;
    将每个蛋白质k-mer序列向量化,得到N个蛋白质k-mer向量;
    拼接所述N个蛋白质k-mer向量得到所述蛋白质序列表示向量。
  7. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,包括:
    将所述待预测的RNA-蛋白质对中的RNA序列包含的每个碱基向量化,得到多个碱基向量;
    拼接所述多个碱基向量得到所述RNA序列表示向量;
    将所述待预测的RNA-蛋白质对中的蛋白质序列包含的每个氨基酸向量化,得到多个氨基酸向量;
    拼接所述多个氨基酸向量得到所述蛋白质序列表示向量。
  8. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值,包括:
    基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用至少三个相互作用预测模型得到所述待预测的RNA-蛋白质对的多个相互作用预测值。
  9. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值,包括:
    将所述待预测的RNA-蛋白质对的序列特征输入到第一相互作用预测模型中,得到第一相互作用预测值;
    将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到第二相互作用预测模型中,得到第二相互作用预测值;
    其中,包括至少一个所述第一相互作用模型和至少两个所述第二相互作用预测模型;或者,包括至少两个所述第一相互作用模型和至少一个所述第二相互作用预测模型。
  10. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值,包括:
    将所述待预测的RNA-蛋白质对的序列特征输入到传统机器学习模型中,得到第一相互作用预测值;
    将所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量输入到深度学习模型中,得到第二相互作用预测值;
    其中,包括至少一个所述传统机器学习模型和至少两个所述深度学习模型;或者,包括至少两个所述传统机器学习模型和至少一个所述深度学习模型。
  11. 根据权利要求10所述的RNA-蛋白质相互作用预测方法,其特征在于,所述传统机器学习模型包括支持向量机模型、逻辑回归模型和决策树模型中的至少一种,深度学习模型包括卷积神经网络模型和循环神经网络模型中的至少一种。
  12. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用,包括:
    对所述多个相互作用预测值进行标记,得到多个标记值;
    对所述多个标记值进行求和,并根据求和结果确定所述RNA和蛋白质之间的相互作用。
  13. 根据权利要求2任一项所述的RNA-蛋白质相互作用预测方法,其特征在于,所述获取原始序列特征集,包括:
    获取原始数据集;
    对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集。
  14. 根据权利要求13所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,包括:
    对RNA和蛋白质的基本单元分别进行排列组合得到k-mer子序列;
    计算每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,并根据所述出现次数的平均值计算所述每种k-mer子序列的方差;
    根据所述每种k-mer子序列的方差大小确定所述原始序列特征集。
  15. 根据权利要求14所述的RNA-蛋白质相互作用预测方法,其特征在于,所述计算每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,并根据所述出现次数的平均值计算所述每种k-mer子序列的方差,包括:
    遍历所述原始数据集,确定所述每种k-mer子序列在所述每个RNA-蛋白质对的出现次数;
    对所述每种k-mer子序列在所述每个RNA-蛋白质对的出现次数进行统计得到所述每种k-mer子序列在所述原始数据集中的出现总次数;
    根据所述出现总次数计算得到所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值;
    根据所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算所述每种k-mer子序列的方差。
  16. 根据权利要求15所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值和在每个RNA-蛋白质对的出现次数计算所述每种k-mer子序列的方差,包括:
    根据:
    Figure PCTCN2021121103-appb-100001
    计算所述每种k-mer子序列的方差s 2。其中,n为原始数据集中RNA-蛋白质对的个数,m为每种k-mer子序列在所述每个RNA-蛋白质对中出现次数的平均值,x n为每种k-mer子序列在第n个RNA-蛋白质对的出现次数。
  17. 根据权利要求14所述的RNA-蛋白质相互作用预测方法,其特征在于,所述根据所述每种k-mer序列的方差大小确定所述原始序列特征集,包括:
    根据所述每种k-mer子序列的方差大小确定满足预设条件的k-mer子序列,并由所述满足预设条件的k-mer子序列组成所述原始序列特征集。
  18. 根据权利要求13所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述原始数据集中的每个RNA-蛋白质对进行特征提取,得到所述原始序列特征集,还包括:
    将所述每个RNA-蛋白质对中的RNA序列和蛋白质序列分别转化为k-mer 子序列,得到k-mer子序列对;
    统计所述每种k-mer子序列对在所述原始数据集中的出现频率,由满足出现频率预设条件的k-mer子序列对组成所述原始序列特征集。
  19. 根据权利要求1-18任一项所述的RNA-蛋白质相互作用预测方法,其特征在于,所述方法还包括:
    对所述相互作用预测模型进行训练。
  20. 根据权利要求19所述的RNA-蛋白质相互作用预测方法,其特征在于,所述对所述相互作用预测模型进行训练,包括:
    获取训练数据集,所述训练数据集中包括正例RNA-蛋白质对和负例RNA-蛋白质对;
    将所述训练数据集作为所述相互作用预测模型的输入,对所述相互作用预测模型的模型参数进行迭代更新,当满足迭代终止条件时,完成对所有模型参数的训练,以使用训练好的所述相互作用预测模型对所述待预测的RNA-蛋白质对的相互作用进行预测。
  21. 根据权利要求1所述的RNA-蛋白质相互作用预测方法,其特征在于,所述方法还包括:
    输出所述RNA和蛋白质之间的相互作用的预测结果。
  22. 一种RNA-蛋白质相互作用预测装置,其特征在于,包括:
    数据获取模块,用于获取待预测的RNA-蛋白质对;
    特征提取模块,用于对所述待预测的RNA-蛋白质对进行特征提取,得到所述待预测的RNA-蛋白质对的序列特征;
    数据向量化模块,用于向量化所述待预测的RNA-蛋白质对,得到所述待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量;
    相互作用预测模块,用于基于所述待预测的RNA-蛋白质对的序列特征、待预测的RNA-蛋白质对中的RNA序列表示向量和蛋白质序列表示向量,使用相互作用预测模型得到所述待预测的RNA-蛋白质对的相互作用预测值;
    相互作用确定模块,用于根据所述相互作用预测值确定所述RNA和蛋白质之间的相互作用。
  23. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-21任一项所述方法。
  24. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-21任一项所述的方法。
PCT/CN2021/121103 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备 WO2023044931A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180002693.6A CN116897396A (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备
PCT/CN2021/121103 WO2023044931A1 (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/121103 WO2023044931A1 (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备

Publications (1)

Publication Number Publication Date
WO2023044931A1 true WO2023044931A1 (zh) 2023-03-30

Family

ID=85719932

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121103 WO2023044931A1 (zh) 2021-09-27 2021-09-27 Rna-蛋白质相互作用预测方法、装置、介质及电子设备

Country Status (2)

Country Link
CN (1) CN116897396A (zh)
WO (1) WO2023044931A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050053999A1 (en) * 2000-11-14 2005-03-10 Gough David A. Method for predicting G-protein coupled receptor-ligand interactions
CN106446602A (zh) * 2016-09-06 2017-02-22 中南大学 蛋白质分子中rna结合位点的预测方法及系统
CN111192631A (zh) * 2020-01-02 2020-05-22 中国科学院计算技术研究所 用于构建用于预测蛋白质-rna相互作用结合位点模型的方法和系统
CN112270958A (zh) * 2020-10-23 2021-01-26 大连民族大学 一种基于分层深度学习miRNA-lncRNA互作关系的预测方法
CN113053462A (zh) * 2021-03-11 2021-06-29 同济大学 基于双向注意力机制的rna与蛋白质绑定偏好预测方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050053999A1 (en) * 2000-11-14 2005-03-10 Gough David A. Method for predicting G-protein coupled receptor-ligand interactions
CN106446602A (zh) * 2016-09-06 2017-02-22 中南大学 蛋白质分子中rna结合位点的预测方法及系统
CN111192631A (zh) * 2020-01-02 2020-05-22 中国科学院计算技术研究所 用于构建用于预测蛋白质-rna相互作用结合位点模型的方法和系统
CN112270958A (zh) * 2020-10-23 2021-01-26 大连民族大学 一种基于分层深度学习miRNA-lncRNA互作关系的预测方法
CN113053462A (zh) * 2021-03-11 2021-06-29 同济大学 基于双向注意力机制的rna与蛋白质绑定偏好预测方法和系统

Also Published As

Publication number Publication date
CN116897396A (zh) 2023-10-17

Similar Documents

Publication Publication Date Title
Sohail Genetic algorithms in the fields of artificial intelligence and data sciences
Yu et al. Multi-source causal feature selection
US20200356851A1 (en) Systems and methods for large scale semantic indexing with deep level-wise extreme multi-label learning
US20210158164A1 (en) Finding k extreme values in constant processing time
US20180341642A1 (en) Natural language processing with knn
Wang et al. Novel and efficient randomized algorithms for feature selection
Cheng et al. TreeNet: Learning Sentence Representations with Unconstrained Tree Structure.
Meng et al. Classifier ensemble selection based on affinity propagation clustering
CN113806543A (zh) 一种基于残差跳跃连接的门控循环单元的文本分类方法
Chen et al. Feature selection of parallel binary moth-flame optimization algorithm based on spark
WO2023044931A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
CN116562286A (zh) 一种基于混合图注意力的智能配置事件抽取方法
WO2023044927A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
Ding et al. ABC-based stacking method for multilabel classification
CN110019815B (zh) 利用knn的自然语言处理
Liang et al. Incremental deep forest for multi-label data streams learning
WO2023050204A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
Phan et al. Biomedical named entity recognition based on hybrid multistage CNN-RNN learner
Wei et al. Biomedical named entity recognition via a hybrid neural network model
WO2023097515A1 (zh) Rna-蛋白质相互作用预测方法、装置、介质及电子设备
US20240136020A1 (en) Rna-protein interaction prediction method and apparatus, and medium and electronic device
Chauleva et al. Secondary structure prediction of RNA using convolutional neural networks
Hilmiaji et al. Identifying Emotion on Indonesian Tweets using Convolutional Neural Networks
Jabbar et al. Grey wolf optimization algorithm for hierarchical document clustering
WO2023130200A1 (zh) 向量模型训练方法、负样本生成方法、介质及设备

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180002693.6

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 17916540

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21958049

Country of ref document: EP

Kind code of ref document: A1