CN111091865B - Method, device, equipment and storage medium for generating MoRFs prediction model - Google Patents

Method, device, equipment and storage medium for generating MoRFs prediction model Download PDF

Info

Publication number
CN111091865B
CN111091865B CN201911330914.3A CN201911330914A CN111091865B CN 111091865 B CN111091865 B CN 111091865B CN 201911330914 A CN201911330914 A CN 201911330914A CN 111091865 B CN111091865 B CN 111091865B
Authority
CN
China
Prior art keywords
morfs
feature vector
protein
sub
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911330914.3A
Other languages
Chinese (zh)
Other versions
CN111091865A (en
Inventor
汤一凡
崔朝辉
赵立军
张霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201911330914.3A priority Critical patent/CN111091865B/en
Publication of CN111091865A publication Critical patent/CN111091865A/en
Application granted granted Critical
Publication of CN111091865B publication Critical patent/CN111091865B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for generating a MoRFs prediction model, wherein the method comprises the following steps: obtaining a plurality of MoRFs fragments and a plurality of non-MoRFs fragments, wherein each MoRFs fragment consists of a plurality of first positions and each non-MoRFs fragment comprises a plurality of second positions; extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point; and training a pre-constructed initial prediction model by using the first feature vector and the second feature vector to generate a target prediction model, wherein the target prediction model is used for predicting whether the locus in the protein belongs to the MoRFs segment. Thus, by means of the target prediction model, the positions belonging to the fragments of the MoRFs on the protein can be conveniently, quickly and accurately predicted.

Description

Method, device, equipment and storage medium for generating MoRFs prediction model
Technical Field
The application relates to the technical field of biological information, in particular to a method and a device for generating a prediction model of Molecular Recognition Features (MoRFs), and a method, a device, equipment and a storage medium for predicting the MoRFs.
Background
Generally, some proteins can be folded and wound to form a specific three-dimensional structure under natural conditions, and the biological functions of the proteins can be analyzed and determined according to the three-dimensional structure of the proteins; while other Proteins cannot form a definite three-dimensional structure under natural conditions, and are called Intrinsic Disordered Proteins (IDPs). The IDPs cannot determine biological functions by analyzing their three-dimensional spatial structures due to the uncertainty of the three-dimensional spatial structures.
In IDPs, moRFs can convert disordered protein sequences into ordered protein sequences, represent the binding sites of the IDPs and other proteins, and analyze the biological functions of the IDPs, so that the determination of the MoRFs in the IDPs has great significance for analyzing the biological functions of the IDPs. Based on this, it is desirable to provide a method capable of rapidly and accurately identifying the MoRFs in IDPs, so as to analyze and determine the biological functions of IDPs.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present application provide a method, an apparatus, and a storage medium for generating a prediction model of MoRFs, where whether each point in IDPs belongs to a fragment of MoRFs can be identified conveniently, quickly, and accurately through the prediction model of MoRFs.
In a first aspect, a method for generating a MoRFs prediction model is provided, including:
obtaining a plurality of molecular recognition feature MoRFs fragments and a plurality of non-MoRFs fragments, wherein each MoRFs fragment consists of a plurality of first sites, and each non-MoRFs fragment comprises a plurality of second sites;
extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point;
and training a pre-constructed initial prediction model by using the first characteristic vector and the second characteristic vector to generate a target prediction model, wherein the target prediction model is used for predicting whether the locus in the protein belongs to the fragment of the MoRFs.
Optionally, the number of the first feature vectors is the same as the number of the second feature vectors.
Optionally, the obtaining a plurality of MoRFs segments and a plurality of non-MoRFs segments includes:
screening a plurality of the Morfs fragments from an IDPs sequence library of intrinsic disordered proteins;
and selecting a plurality of non-Morfs segments which are separated from each Morfs segment by a first preset length.
Optionally, the extracting a first feature vector corresponding to each of the first location points and a second feature vector corresponding to each of the second location points includes:
for each MoRFs fragment, obtaining a first position specificity score matrix PSSM corresponding to the protein of the MoRFs fragment by using a protein comparison tool;
taking each first locus in the MoRFs segment as a center, and outwards expanding a second preset length based on the first PSSM to obtain a first sub-feature vector corresponding to each first locus; and with each second locus in the non-MorFs segment as a center, outwards expanding the second preset length based on the first PSSM to obtain a second sub-feature vector corresponding to each second locus;
obtaining a third sub-feature vector of the protein in which the MoRFs fragments are located according to the occurrence frequency and physicochemical properties of the amino acids of the protein in which the MoRFs fragments are located;
obtaining the first feature vector corresponding to each first position point based on the third sub-feature vector and the first sub-feature vector corresponding to each first position point; and obtaining the second feature vector corresponding to each second bit point based on the third sub-feature vector and the second sub-feature vector corresponding to each second bit point.
Optionally, the method further comprises:
obtaining a protein to be predicted, wherein the protein to be predicted comprises N loci, and N is an integer greater than 1;
extracting an ith feature vector corresponding to an ith site of the protein to be predicted, wherein i =1,2, \8230;, N;
and obtaining an ith prediction result according to the ith feature vector and the target prediction model, wherein the ith prediction result is used for representing whether the ith locus belongs to a MoRFs fragment.
Optionally, the extracting an ith feature vector corresponding to an ith site of the protein to be predicted includes:
obtaining a second PSSM corresponding to the protein to be predicted by using the protein comparison tool, and outwards expanding the second preset length based on the second PSSM by taking the ith locus as a center to obtain a fourth sub-feature vector corresponding to the ith locus;
obtaining a fifth sub-feature vector of the protein to be predicted according to the occurrence frequency and physicochemical properties of the amino acid of the protein to be predicted;
obtaining the ith feature vector corresponding to the ith position point based on the fourth sub-feature vector and the fifth sub-feature vector;
the obtaining of the ith prediction result according to the ith feature vector and the target prediction model specifically includes:
and inputting the ith feature vector into the target prediction model, and outputting the ith prediction model.
In a second aspect, an apparatus for generating a MoRFs prediction model is further provided, including:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of molecular recognition characteristic MoRFs fragments and a plurality of non-MoRFs fragments, each MoRFs fragment consists of a plurality of first sites, and each non-MoRFs fragment comprises a plurality of second sites;
the first extraction module is used for extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point;
and the generating module is used for training a pre-constructed initial prediction model by using the first characteristic vector and the second characteristic vector to generate a target prediction model, and the target prediction model is used for predicting whether the locus in the protein belongs to the MoRFs segment.
Optionally, the number of the first feature vectors is the same as the number of the second feature vectors.
Optionally, the first obtaining module includes:
a first acquisition unit, which is used for screening a plurality of the Morfs fragments from an IDPs sequence library of intrinsic disordered proteins;
and the second acquisition unit is used for selecting a plurality of non-Morfs segments which are separated from each Morfs segment by a first preset length.
Optionally, the first extraction module includes:
a third obtaining unit, configured to obtain, for each of the MoRFs segments, a first location-specific score matrix PSSM corresponding to a protein in which the MoRFs segment is located by using a protein comparison tool;
a fourth obtaining unit, configured to obtain a first sub-feature vector corresponding to each first locus by taking each first locus in the MoRFs segment as a center and extending a second preset length outward based on the first PSSM; and taking each second locus in the non-Morfs segment as a center, and outwards expanding the second preset length based on the first PSSM to obtain a second sub-feature vector corresponding to each second locus;
a fifth obtaining unit, configured to obtain a third sub-feature vector of the protein in which the MoRFs fragments are located according to the amino acid occurrence frequency and physicochemical properties of the protein in which each of the MoRFs fragments is located;
a sixth obtaining unit, configured to obtain the first feature vector corresponding to each first location point based on the third sub-feature vector and the first sub-feature vector corresponding to each first location point; and obtaining the second feature vector corresponding to each second bit point based on the third sub-feature vector and the second sub-feature vector corresponding to each second bit point.
Optionally, the apparatus further comprises:
the second obtaining module is used for obtaining the protein to be predicted, wherein the protein to be predicted comprises N loci, and N is an integer greater than 1;
the second extraction module is used for extracting the ith characteristic vector corresponding to the ith site of the protein to be predicted, wherein i =1,2, \8230;
and the third acquisition module is used for acquiring an ith prediction result according to the ith feature vector and the target prediction model, wherein the ith prediction result is used for representing whether the ith locus belongs to the MoRFs fragment.
Optionally, the second extraction module includes:
a seventh obtaining unit, configured to obtain, by using the protein comparison tool, a second PSSM corresponding to the protein to be predicted, and outwards expand the second preset length based on the second PSSM with the ith locus as a center, so as to obtain a fourth sub-feature vector corresponding to the ith locus;
an eighth obtaining unit, configured to obtain a fifth sub-feature vector of the protein to be predicted according to an amino acid occurrence frequency and physicochemical properties of the protein to be predicted;
a ninth obtaining unit, configured to obtain, based on the fourth sub-feature vector and the fifth sub-feature vector, an ith feature vector corresponding to the ith location;
the third obtaining module is specifically configured to:
and inputting the ith feature vector into the target prediction model, and outputting the ith prediction model.
In a third aspect, an apparatus for generating a MoRFs prediction model is further provided, where the apparatus includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method provided in the first aspect according to instructions in the program code.
In a fourth aspect, a storage medium is further provided, where the storage medium is used to store program codes, and the program codes are used to execute the method provided in the first aspect.
Compared with the prior art, the method has the advantages that:
in the embodiment of the application, firstly, a plurality of Morfs segments and a plurality of non-Morfs segments are obtained, wherein each Morfs segment is composed of a plurality of first sites, and each non-Morfs segment comprises a plurality of second sites; then, extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point; then, training a pre-constructed initial prediction model by using the plurality of first feature vectors and the plurality of second feature vectors to generate a target prediction model, wherein the target prediction model is used for predicting whether the locus in the protein belongs to the MoRFs segment. Therefore, by the method provided by the embodiment of the application, the sites belonging to the Morfs segments on the protein can be conveniently, quickly and accurately predicted by only extracting the feature vectors of all the sites on the protein and using the trained target prediction model, so that the Morfs segments are identified, particularly IDPs with uncertain spatial three-dimensional structures, and the biological functions of the IDPs can be determined by the Morfs segments on the IDPs, so that a data basis is provided for quickly and accurately determining the biological functions of the IDPs.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a schematic flowchart of a method for generating a MoRFs prediction model according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of an example of implementing step 101 provided in an embodiment of the present application;
fig. 3 is a flowchart illustrating an example of implementing step 102 provided by the embodiment of the present application;
fig. 4 is a schematic flowchart of a method for predicting MoRFs according to an embodiment of the present disclosure;
fig. 5 is a flowchart illustrating an example of implementing step 402 provided by an embodiment of the present application;
fig. 6 is a schematic structural diagram of a generating apparatus of a MoRFs prediction model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a generating device of a MoRFs prediction model according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, the biological function of a protein is usually determined by analyzing the spatial three-dimensional structure of the protein, but the biological function of IDPs with uncertain spatial three-dimensional structures cannot be determined by analyzing the spatial three-dimensional structure. The inventor finds that the Morfs fragment is usually present on the IDPs, and the Morfs fragment can convert IDPs with no determined sequence into ordered sequences, thereby revealing the biological function of the IDPs. It can be seen that identifying the MoRFs fragments is of great significance for analysis, classification and other studies of IDPs.
However, the MoRFs segments on the protein cannot be accurately identified at present, and based on this, the embodiment of the present application provides a method for generating a MoRFs prediction model, by obtaining a plurality of MoRFs segments and a plurality of non-MoRFs segments, each MoRFs segment is composed of a plurality of first sites, and each non-MoRFs segment includes a plurality of second sites; extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point; then, training a pre-constructed initial prediction model by using the plurality of first feature vectors and the plurality of second feature vectors to generate a target prediction model, wherein the target prediction model is used for predicting whether the locus in the protein belongs to the MoRFs segment. Therefore, by the method provided by the embodiment of the application, the sites belonging to the Morfs segments on the protein can be conveniently, quickly and accurately predicted only by extracting the feature vectors of each site on the protein and by means of the trained target prediction model, so that the Morfs segments are identified, especially IDPs with uncertain three-dimensional spatial structure, and the biological functions of the IDPs can be determined by the Morfs segments on the IDPs, so that a data basis is provided for quickly and accurately determining the biological functions of the IDPs.
In the present embodiment, a site refers to an amino acid in a protein sequence, that is, each amino acid in a protein sequence is referred to as a site (which may also be referred to as a residue).
Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a method for generating a MoRFs prediction model according to an embodiment of the present disclosure. Referring to fig. 1, in this embodiment, the method may specifically include the following steps 101 to 103:
step 101, a plurality of molecular recognition characteristic MoRFs segments and a plurality of non-MoRFs segments are obtained, each MoRFs segment is composed of a plurality of first sites, and each non-MoRFs segment includes a plurality of second sites.
It will be appreciated that the biological function of a protein can be determined by analysing MoRFs fragments on the protein, and in general, one or more MoRFs fragments, each typically comprising from 10 to 70 first sites, can be included on an IDPs. The plurality of MoRFs segments may serve as a data basis for generating training samples for training the initial predictive model.
In order to ensure the comprehensiveness of the training sample, in addition to the first site on the MoRFs segment to participate in the training, other sites not on the MoRFs segment are also needed to participate in the training. Therefore, in step 101, a plurality of non-MoRFs segments are also obtained, where a non-MoRFs segment refers to a segment of the protein other than a MoRFs segment, and a site included in the non-MoRFs segment is referred to as a second site.
It can be understood that, in order to ensure the balance of the training samples and make the target prediction model obtained by training more robust, when the plurality of MoRFs segments and the plurality of non-MoRFs segments are obtained in step 101, it may be ensured that the total number of first positions included in all the MoRFs segments is the same as the total number of second positions included in all the non-MoRFs segments. In this way, it is ensured that one half of data sources used for training the MoRFs prediction model belong to the first locus known to belong to the MoRFs segment, and the other half of data sources belong to the second locus known not to belong to the MoRFs segment, so that the trained MoRFs prediction model can more accurately complete the prediction of the MoRFs locus to a certain extent.
As an example, the step 101 of acquiring a plurality of MoRFs segments and a plurality of non-MoRFs segments may be specifically implemented in the manner shown in fig. 2 described below. Referring to fig. 2, for example, the following steps 1011 to 1012 may be included:
step 1011, screening a plurality of Morfs fragments from the sequence library of the IDPs of the intrinsic disordered protein;
it will be appreciated that, since identification of MoRFs fragments is particularly critical to IDPs, samples from the IDPs sequence library are selected for training in this embodiment. The library of IDPs may be, for example: dispert version 8.0 pool of intrinsically disordered protein sequences.
During specific implementation, the IDPs can be obtained in an IDPs sequence library through manual experiments and biological literature proofreading; from these IDPs, the MoRFs fragments were then determined. For example: 364 IDPs can be obtained from the Dispent version 8.0 inherently disordered protein sequence library, and 702 Morfs fragments are determined from the 364 IDPs, the 702 Morfs fragments comprising 15,542 sites in total.
In step 1012, a plurality of non-MoRFs segments are selected that are separated from each of the MoRFs segments by a first predetermined length.
The first preset length may be, for example, 12 positions, and may be specifically set according to requirements, which is not limited in this embodiment.
For example: assuming that IDPs 1, IDPs 2 and IDPs 3 each include 200 sites, 3 MoRFs fragments were obtained based on step 1011: a MoRFs segment 1, a MoRFs segment 2, and a MoRFs segment 3, wherein the MoRFs segment 1 includes 30 th sites from 10 th to 39 th, the MoRFs segment 2 includes 51 th sites from 50 th to 100 th, and the MoRFs segment 3 includes 37 th sites from 130 th to 166 th, then, assuming that the first prediction length is 24 sites, it can be determined through step 1012: for IDPs 1, non-Morfs fragment 1 is at position 64 to 200; for IDPs 2, non-MoRFs fragment 2 is positions 1 through 25 and 125 through 200; for IDPs 3, non-MoRFs fragment 3 is positions 1 through 105 and 191 through 200. Thus, the determined plurality of MoRFs fragments include (30 +51+ 37) =118 first sites, and the plurality of non-MoRFs fragments include (137 +25+76+105+ 10) =353 second sites.
In order to ensure the balance of the training samples, the total number of first sites included in the multiple MoRFs segments may be counted, sites with the same total number as the first sites are randomly selected from the second sites to serve as second sites participating in training of the MoRFs prediction model, and the segment where the second site determined by the selection is located is recorded as the non-MoRFs segment obtained in step 1012. For example, the 353 second loci in the above example are randomly screened to determine 118 second loci to participate in the training of the MoRFs prediction model.
It can be seen that, by the implementation manner shown in fig. 2, the step 101 may be implemented to obtain a plurality of MoRFs segments including the first location and a plurality of non-MoRFs segments including the second location, so as to provide a data base for subsequently providing rich and complete training samples and training an accurate MoRFs prediction model.
And 102, extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point.
It can be understood that the first feature vector is used to characterize the features of the corresponding first location, and the first feature vector and the first location are in one-to-one correspondence, and the number of the first feature vector and the number of the first location are the same. The second feature vector is used for representing the features of the corresponding second position point, the second feature vector and the second position point are in one-to-one correspondence, and the number of the second feature vector and the number of the second position point are the same. If the number of the first sites is the same as that of the second sites, the number of the first feature vectors is the same as that of the second feature vectors.
In the specific implementation, the characteristics of the MoRFs are considered, in order to accurately describe the characteristics of the sites on the MoRFs fragments, the amino acid occurrence frequency and the physicochemical property characteristics of the protein sequence in which the MoRFs fragments or non-MoRFs are located and the homologous evolution characteristics of each first site or second site are fused, so that the feature vector capable of representing each site more abundantly is obtained, and a targeted, richer and complete training sample is provided for training a MoRFs prediction model.
As an example, step 102 may be implemented in a manner described below with reference to fig. 3. Referring to fig. 3, for example, the following steps 1021 to 1024 may be included:
step 1021, for each MorFs fragment, a first Position Specificity Scoring Matrix (PSSM) corresponding to the protein of the MorFs fragment is obtained by using a protein comparison tool.
It can be understood that, in order to embody the homologous evolution characteristics of the proteins, the PSSM is adopted to analyze and process the proteins where the MoRFs fragments are located, so that the accuracy of the processing result can be greatly improved. It should be noted that, when the non-MoRFs segments are obtained in step 101, screening is performed on the proteins where the MoRFs segments are located, so that it can be determined that the selected MoRFs segments and the non-MoRFs segments belong to the same protein, and the proteins where the MoRFs segments are located cover all the MoRFs segments and the non-MoRFs segments.
In a specific implementation, the implementation process of step 1021 may specifically include: s11, searching a protein sequence database for homologous proteins of the protein where the MoRFs fragment is located; s12, performing multi-sequence comparison on the protein of the MoRFs fragment and the amino acid sequence of the homologous protein to obtain a first PSSM of the protein of the MoRFs fragment.
The protein sequence database refers to a database for analyzing biological information using computer functions. The amino acid sequences are compared using computer algorithms to predict the structure and function of the protein. For example: the protein sequence database may be a non-redundant protein sequence database comprising 152,910,397 proteins, and the information on the amino acid sequence corresponding to each protein may include, for example: the amino acid at each position in the amino acid sequence, whether the amino acid sequence has the function of binding with saccharide, whether the amino acid sequence has the function of binding with lipid, and the like.
It is understood that a homologous protein refers to a protein from a different species of organism with similar corresponding amino acid sequences.
As an example, a comparison tool of HMM-HMM of Homology detection iteration (English: homogenous detection by iterative HMM-HMM compare, HHblits for short) may be used as the "protein comparison tool" in this step 1021 to obtain the first PSSM corresponding to the protein in which each MoRFs fragment is located.
As another example, the specific process of obtaining the first PSSM of the protein in which the MoRFs fragment is located may also include: firstly, searching a plurality of homologous proteins of the protein of the MoRFs fragment from a protein sequence database, and acquiring the amino acid sequences of the homologous proteins from the protein sequence database; subsequently, the amino acid sequence of the retrieved homologous protein and the amino acid sequence of the protein in which the MoRFs fragment is present can be subjected to multiple sequence alignment to obtain a first PSSM of the amino acid sequence of the protein in which the MoRFs fragment is present.
The multiple sequence Alignment may specifically employ a Position-Specific iterative-Basic Local Alignment Search Tool (abbreviated as PSI-BLAST). Assuming that PSI-BLAST sets a cut-off value of 0.001 for a maximum number of iterations to 4, the amino acid sequence of the homologous protein and the amino acid sequence of the protein in which the MoRFs fragments are located are aligned in multiple sequences using the PSI-BLAST, the resulting first PSSM can be, for example, as shown in table 1 below:
TABLE 1 first PSSM
Figure BDA0002329529740000101
Figure BDA0002329529740000111
Wherein "A, R, N, D, C, Q, E, \8230;, V" in the horizontal direction represents 20 amino acids in the amino acid sequence constituting the protein; longitudinal 1M, 2K, 3I, 4S, 5F, 6H, \8230; "the position sequence number of each position in the amino acid sequence of the protein representing the homology and the amino acid at the position; the middle number is the position specificity score, which indicates how likely the amino acid is at that position (also referred to as the degree of predisposition or conservation), and typically ranges from-13 to +13. For example: -6 "(bold and underlined score) at the intersection of the second line and the third column in the above table, indicating that the amino acid R is likely to occur at the first position in the amino acid sequence of the protein in which the MoRFs fragment is located, and is scored as-6; for another example: the "4" at the crossing position of the fifth row and the fifth column in the table (the score of bold and underlined) indicates that the amino acid sequence of the protein in which the MoRFs fragment is present has a probability of the occurrence of the amino acid D at the fourth position of-4.
It can be understood that, if the length of the protein in which the MoRFs fragment is located is n, the first PSSM corresponding to the protein in which the MoRFs fragment is located is obtained as an n × 20 matrix.
Step 1022, taking each first locus in the MoRFs segment as a center, and extending a second preset length outwards based on the first PSSM to obtain a first sub-feature vector corresponding to each first locus; and taking each second locus in the non-MoRFs segment as a center, and outwards expanding a second preset length based on the first PSSM to obtain a second sub-feature vector corresponding to each second locus.
It is understood that step 1022 is performed once for each first location in each MoRFs segment to obtain a first sub-feature vector corresponding to the first location. Similarly, step 1022 is performed once for each second locus in each non-MoRFs segment to obtain a second sub-feature vector corresponding to the second locus.
For example, assuming that the length of the protein in which the MoRFs fragments are located is 20, the first PSSM is:
Figure BDA0002329529740000112
the second preset length is 2, the MoRFs segment includes a third row to a twelfth row, that is, the first locus is a position corresponding to the 3 rd row, a position corresponding to the 4 th row, \8230 \ 8230;, and a position corresponding to the 12 th row, and then, when the first locus is a position corresponding to the 3 rd row, the corresponding first sub-feature vector may be expanded upward by two rows with the 3 rd row center, and expanded downward by two rows, which may be specifically expressed as: [ (a) 1 ,b 1 ,c 1 ,d 1 ,e 1 ),(a 2 ,b 2 ,c 2 ,d 2 ,e 2 ),(a 3 ,b 3 ,c 3 ,d 3 ,e 3 ),……,(a 20 ,b 20 ,c 20 ,d 20 ,e 20 )]. When the first locus is a position corresponding to the 4 th row, the corresponding first sub-feature vector may be expanded by two rows upward and two rows downward with the 4 th row as a center, which may be specifically expressed as: [ (b) 1 ,c 1 ,d 1 ,e 1 ,f 1 ),(b 2 ,c 2 ,d 2 ,e 2 ,f 2 ),(b 3 ,c 3 ,d 3 ,e 3 ,f 3 ),……,(b 20 ,c 20 ,d 20 ,e 20 ,f 20 )]. The first sub-feature vectors of other first sites can be referred to the generation manner and the descriptions of the first sub-feature vectors corresponding to the two first sites, which are not described herein again. Wherein, each first sub-feature vector is a feature vector of (2 × 2+ 1) × 20=100 dimensions.
The second preset length may also be 5, and then the first sub-feature vector is a feature vector of (5 × 2+ 1) × 20=220 dimensions. The second predetermined level may be designed according to the experience of the skilled person, and is not particularly limited in the embodiments of the present application.
The implementation manner of "obtaining the second sub-feature vectors corresponding to the second location points based on the first PSSM by outwardly extending the second preset length with each second location point in the non-MoRFs segment as a center" refers to the above description about determining the first sub-feature vectors corresponding to the first location points, and is not described herein again.
The dimensions of the first sub-feature vector and the second sub-feature vector are the same, and different dimensions may be specifically set according to different requirements for protein features, for example: the first sub-feature vector and the second sub-feature vector may each be a 220-dimensional feature vector.
Since the first location point is known as a location on the MoRFs segment, the first sub-feature vector corresponding to the first location point can indicate that the location is a location on the MoRFs segment. Similarly, since the second location point is known as a location on the non-MoRFs segment, the second sub-feature vector corresponding to the second location point can indicate that the location point is a location on the non-MoRFs segment and does not belong to the MoRFs segment.
And step 1023, acquiring a third sub-feature vector of the protein in which each MoRFs fragment is located according to the amino acid appearance frequency and the physicochemical property of the protein in which each MoRFs fragment is located.
Among them, physicochemical properties are physicochemical properties of proteins. The frequency of amino acid occurrence refers to the frequency of occurrence of a certain amino acid in a protein sequence.
In a specific implementation, step 1023 may specifically include:
s21, calculating the normalization results of the hydrophilicity and the hydrophobicity of the 20 amino acids according to the following formula (1):
Figure BDA0002329529740000131
wherein R represents any one of 20 amino acids, P R,k For hydrophilicity or hydrophobicity of amino acid R, for example: k =0,P R,k Denotes the hydrophilicity of amino acid R, k =1,P R,k Represents the hydrophobicity of the amino acid R; p k Used to represent the average value of hydrophilicity or hydrophobicity of 20 amino acids; s. the k Represents the standard deviation of hydrophilicity or the standard deviation of hydrophobicity of 20 amino acids; n is a radical of R,k The results of the hydrophilicity normalization or the hydrophobicity normalization of each of the 20 amino acids are shown.
S22, setting the distance lag =1,2, \8230;. Lambda, wherein lambda is a preset parameter and can be 10, for example. Calculating the physicochemical parameter theta of the protein where the MoRFs fragments are located under each lag lag Specifically, θ can be calculated by the following formula (2) lag
Figure BDA0002329529740000132
Wherein n represents the length of the protein in which the MoRFs fragment is located, i represents the number of the protein sequence, X i Denotes the amino acid corresponding to the sequence i, N Xik Represents amino acid X i A hydrophilic normalization result or a hydrophobic normalization result.
S23, calculating the characteristic x of the protein in which the MoRFs fragments are located according to the following formula (3) u
Figure BDA0002329529740000133
Where u denotes the dimension in which the feature vector is generated, f j The frequency of amino acid j is shown, and the occurrence frequency of 20 amino acids is known; w is a preset parameter, for example, w =0.05 may be taken for x u In one case, for the first 20 dimensional features of the feature vector, i.e., u is between 1 and 20, x is calculated using the first of equations (3) u I.e., f u Representing the frequency of occurrence of the u-dimensional amino acid in the protein, f for a defined protein sequence u It is known that the occurrence frequency of 20 amino acids represents the characteristics of the protein in which the fragment of the Morfs is located; in another case, for the feature vectors having the 21 st to (20 + λ) th dimensions, i.e., u between 21 and (20 + λ), x is calculated using the second formula of formula (3) u That is, the proteins on which the MoRFs fragments are located are characterized by λ physicochemical parameters.
It can be seen that when each value of u =1,2, \8230;, 20+ λ, the corresponding characteristic x can be obtained by the above formula (3) u
S24, based on the characteristic x of the protein in which the MoRFs fragment is located u Obtaining a third sub-feature vector Φ of the protein where the MoRFs fragments are located, which can be specifically expressed as:
Φ (protein in which the MoRFs fragment is located) = [ x = 1 ,x 2 ,……,x 20 +λ]
Wherein, when λ =10, the third sub-eigenvector is an eigenvector of (20 + 10) =30 dimensions. The third sub-feature vector is used for characterizing the amino acid appearance frequency and physicochemical properties of the protein in which the MoRFs fragment is located.
Thus, step 1023 can be realized through the above steps S21 to S24, and the features of the positions of the MoRFs fragment in the amino acid sequence of the protein in which the positions are located can be obtained, so that a data base is provided for constructing a richer and more targeted training sample.
Step 1024, obtaining a first feature vector corresponding to each first position point based on the third sub-feature vector and the first sub-feature vector corresponding to each first position point; and obtaining a second feature vector corresponding to each second position point based on the third sub-feature vector and the second sub-feature vector corresponding to each second position point.
And for each protein in which each MoRFs fragment is located, corresponding to a third sub-feature vector, and each first position point on the MoRFs fragment corresponds to a first sub-feature vector, and then, taking each first position point as an object, fusing the corresponding first sub-feature vector with the corresponding third sub-feature vector of the protein in which the MoRFs fragment is located, and obtaining the first feature vector corresponding to the first position point. Similarly, for a protein where each non-MoRFs fragment is located corresponds to one third sub-feature vector, and each second locus on the non-MoRFs fragment corresponds to one second sub-feature vector, then, with each second locus as an object, the corresponding second sub-feature vector and the corresponding third sub-feature vector of the protein where the non-MoRFs fragment is located can be fused to obtain the second feature vector corresponding to the second locus.
For example: for MoRFs segment 1, included thereon are: a first locus 1, a first locus 2 and a first locus 3, wherein a third sub-feature vector 1 corresponding to the protein 1 in which the MoRFs fragment 1 is located is obtained according to step 1023, and a first sub-feature vector 1 corresponding to the first locus 1, a first sub-feature vector 2 corresponding to the first locus 2 and a first sub-feature vector 3 corresponding to the first locus 3 are respectively obtained according to steps 1021 to 1022; then, in step 1024, the first sub-feature vector 1 and the third sub-feature vector 1 may be fused to obtain a first feature vector 1 corresponding to the first location point 1; fusing the first sub-feature vector 2 and the third sub-feature vector 1 to obtain a first feature vector 2 corresponding to the first position point 2; and fusing the first sub-feature vector 3 and the third sub-feature vector 1 to obtain a first feature vector 3 corresponding to the first position point 3.
Another example is: for non-MoRFs segment 1, including thereon: a second locus 1, a second locus 2 and a second locus 3, obtaining a third sub-feature vector 1 corresponding to a protein 2 in which a non-Morfs fragment 1 is located according to step 1023, and obtaining a second sub-feature vector 1 corresponding to the second locus 1, a second sub-feature vector 2 corresponding to the second locus 2 and a second sub-feature vector 3 corresponding to the second locus 3 respectively according to steps 1021 to 1022; then, in step 1024, the second sub-feature vector 1 and the third sub-feature vector 1 may be fused to obtain a second feature vector 1 corresponding to the second location point 1; fusing the second sub-feature vector 2 and the third sub-feature vector 1 to obtain a second feature vector 2 corresponding to the second bit point 2; and fusing the second sub-feature vector 3 and the third sub-feature vector 1 to obtain a second feature vector 3 corresponding to the second bit point 3.
The first sub-feature vector and the third sub-feature vector are fused to obtain a first feature vector, which may specifically be: and splicing the first sub-feature vector and the third sub-feature vector to obtain the first feature vector, wherein the splicing sequence may not be specifically limited. For example: assuming that the first sub-feature vector is a feature vector a of 220 dimensions and the third sub-feature vector is a feature vector B of 30 dimensions, the first feature vector is a feature vector C of 250 dimensions, and the feature vector C may be represented as [ a, B ] or [ B, a ].
In this way, in the manner shown in fig. 3, the first feature vector corresponding to the first location point known to belong to the MoRFs segment and the second feature vector corresponding to the second location point known not to belong to the MoRFs segment can be extracted as training samples for training the MoRFs prediction model, so as to prepare for the subsequent training of the MoRFs prediction model.
Step 103, training a pre-constructed initial prediction model by using the first feature vector and the second feature vector to generate a target prediction model, wherein the target prediction model is used for predicting whether the locus in the protein belongs to the fragment of the Morfs.
It is understood that the initial prediction model may be a pre-constructed model for predicting whether a site in a protein belongs to a fragment of MoRFs. The initial prediction model may be a classification model, and then the input of the initial prediction model is a feature vector corresponding to a site on a protein, and the output includes two cases, one case, and the output is used to characterize that the site belongs to a MoRFs segment, for example: the output "yes", another case, the output is used to characterize that the site does not belong to a MoRFs segment, for example: the output is no.
In specific implementation, the first feature vector corresponding to each first position point and the second feature vector corresponding to each second position point may be input into the initial prediction model, and the initial prediction model may be adjusted by comparing the difference between the actual output result and the target output result. It should be noted that, when training is performed by using the next feature vector 1 in the training sample, the next feature vector 1 needs to be input to the newly adjusted initial prediction model 1 to obtain the actual output result of the training, and the newly adjusted initial prediction model 1 is continuously adjusted by using the difference between the actual output result of the training and the target output result to obtain the newly adjusted initial prediction model 2; when training is carried out by adopting the next characteristic vector 2 in the training sample, the next characteristic vector 2 needs to be input into the newly adjusted initial prediction model 2 to obtain an actual output result of the training, the newly adjusted initial prediction model 2 is continuously adjusted by using the difference between the actual output result of the training and a target output result, and a newly adjusted initial prediction model 3 is obtained; and so on until all the feature vectors (i.e. the first feature vectors corresponding to all the first sites and the second feature vectors corresponding to all the second sites) in the training sample participate in the training of the initial prediction model, or until the prediction accuracy of the newly adjusted initial prediction model reaches a preset accuracy threshold (for example, 98%), and at this time, the newly adjusted initial prediction model is the target prediction model.
As an example, if the first feature vector is input into the initial prediction model, and the target output result is known that the first bit point belongs to a MoRFs segment, if the actual output result indicates that the first bit point belongs to the MoRFs segment, the actual output result and the target output result are considered to be consistent, and the initial prediction model is not adjusted; and if the actual output result indicates that the first locus does not belong to the MoRFs segment, the actual output result is not consistent with the target output result, and the initial prediction model is adjusted.
As another example, if the second feature vector is input into the initial prediction model, and the target output result is known that the second locus does not belong to a MoRFs segment, if the actual output result indicates that the second locus belongs to a MoRFs segment, the actual output result and the target output result are considered to be inconsistent, and the initial prediction model is adjusted; if the actual output result indicates that the second locus does not belong to the MoRFs fragment, the actual output result is consistent with the target output result, and the initial prediction model is not adjusted.
For example, a Support Vector Machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner, calculates empirical risks by using a loss function, and adds a regularization term in solution to optimize structural risks, so that the method has strong robustness, and the SVM can perform nonlinear classification by using a kernel method (english) with high accuracy, so that the initial prediction model in the embodiment of the present application can adopt the SVM.
In order to make the process of training the SVM more accurate, the generalization capability of the SVM obtained by training is as good as possible, and the overfitting of the SVM in the training process is prevented, the soft-interval technology and the kernel function technology can be used for continuously optimizing the generalization performance of the SVM algorithm. It can be understood that the kernel function in the SVM is used to solve the problem that the low-dimensional data is linearly inseparable, and can map the low-dimensional data to a high-dimensional space so as to achieve the goal of divisibility. In the embodiment of the present application, the kernel Function of the SVM may adopt a Radial Basis Function (RBF), which is specifically shown in the following formula (4):
Figure BDA0002329529740000171
wherein, x and z are two eigenvectors in the training sample respectively, gamma is a parameter carried by the kernel function RBF, the distribution of the data after being mapped to a new high-dimensional space is determined, and the parameter gamma is mainly used for mapping the high-dimensional space of low-dimensional data.
Wherein, the use of the hard edge distance SVM in the linear inseparable problem will generate a classification error, so that a loss function can be introduced on the basis of maximizing the edge distance to construct a new optimization problem. SVM uses a hinge loss function and a relaxation variable ξ i After the sectional value of the hinge loss function is processed, the optimization problem form of the hard boundary SVM is used, and the optimization problem of the soft edge distance SVM is represented as follows:
Figure BDA0002329529740000172
s.t.,y i (w T x i +b)≥1-ξ ii not less than 0, i =1, 8230, n 8230, formula (6)
Equation (6) is a constraint in the case of the optimization solution of equation (5). Wherein w in formula (5) and formula (6) represents a normal vector of the hyperplane; b in equation (6) represents the intercept, x, of the hyperplane i Representing arbitrary feature vectors, y, in training samples i Represents the feature vector x i Knowing the categories that should be classified, i.e., target output results; c denotes a regularization coefficient.
It can be seen that the SVM algorithm parameters are c and gamma respectively, and the larger c is, the stricter the classification of the SVM is, and errors cannot occur; conversely, the smaller c means the greater error tolerance; the larger the gamma value, the higher the dimensionality of the high dimensional space mapped to, indicating better training results, but also the more likely to cause overfitting, i.e., low generalization ability.
Based on this, in the embodiment of the application, the values of gamma and c are continuously adjusted, and cross validation is continuously performed, so that the proper values of gamma and c are determined, and then prediction of the next feature vector in the training sample is performed. When the training effect on the initial prediction model is evaluated, 5-fold cross validation can be used, a training sample set (namely, a set comprising all first feature vectors and all second feature vectors) is divided into 5 training sample subsets, 4 of the training sample subsets are selected to be used for training, 1 of the training sample subsets is selected to be used for testing, and the testing is carried out to obtain an evaluation index score of a Matthews correlation coefficient (MCC for short). And averaging MCC obtained by 5 training sample subsets to obtain a final evaluation score of the cross validation.
MCC, among other things, considers true positives, false positives, true negatives, and false negatives, and is generally considered a balanced measure, even though the categories true positive, false positive, true negative, and false negative may vary in size. The formula for calculating the MCC from the confusion matrix is as follows:
Figure BDA0002329529740000181
wherein, TP is called true positive, that is, the target output result indicates that the locus belongs to the fragment of the MoRFs, and the actual output result indicates that the locus also belongs to the fragment of the MoRFs; FP is called false negative, namely, the target output result shows that the locus does not belong to the fragment of the MoRFs, and the actual output result shows that the locus belongs to the fragment of the MoRFs; TN is called true negative, namely, the target output result shows that the site does not belong to the fragment of the MoRFs, and the actual output result shows that the site does not belong to the fragment of the MoRFs; FN is called false positive, i.e., the target output indicates that the site belongs to MoRFs segment, and the actual output indicates that the site does not belong to MoRFs segment.
Therefore, by the method for generating the MoRFs prediction model provided by the embodiment of the application, the target prediction model (namely, the generated MoRFs prediction model) is constructed and trained, and the target prediction model is used for predicting whether the locus on the protein belongs to the MoRFs segment. Therefore, the sites belonging to the Morfs segments on the protein can be conveniently, quickly and accurately predicted by only extracting the feature vectors of each site on the protein and using the trained target prediction model, so that the Morfs segments are identified, particularly IDPs with uncertain spatial three-dimensional structures, and the biological functions of the IDPs can be determined by the Morfs segments on the IDPs, so that the method provided by the embodiment of the application provides a data basis for quickly and accurately determining the biological functions of the IDPs.
On the basis of the embodiment shown in fig. 1, the embodiment of the present application further provides a method for predicting MoRFs on proteins by using the target prediction model generated in step 103. Fig. 4 shows a flow diagram of a MoRFs prediction method, which may include, for example, the following steps 401 to 403, see fig. 4:
step 401, obtaining a protein to be predicted, wherein the protein to be predicted comprises N loci, and N is an integer greater than 1;
step 402, extracting an ith feature vector corresponding to an ith site of a protein to be predicted, wherein i =1,2, \8230;
and 403, obtaining an ith prediction result according to the ith feature vector and the target prediction model, wherein the ith prediction result is used for representing whether the ith locus belongs to the MoRFs segment.
It can be understood that, for a protein to be predicted, which includes N sites, if it needs to predict whether each site on the protein to be predicted is a MoRFs by using the target prediction model generated in fig. 1, first, an i-th feature vector of an i-th site on the protein to be predicted needs to be extracted. Referring to fig. 5, step 402 may specifically include:
step 4021, obtaining a second PSSM corresponding to the protein to be predicted by using a protein comparison tool, and outwards expanding a second preset length based on the second PSSM by taking an ith locus as a center to obtain a fourth sub-feature vector corresponding to the ith locus;
step 4022, obtaining a fifth sub-feature vector of the protein to be predicted according to the occurrence frequency and physicochemical properties of the amino acid of the protein to be predicted;
and 4023, acquiring an ith feature vector corresponding to the ith position based on the fourth sub-feature vector and the fifth sub-feature vector.
Wherein, the protein comparison tool is the same as that in the embodiment shown in fig. 3, and the implementation manner of step 4022 may specifically refer to "S21 to S24" in step 1023 in the embodiment shown in fig. 3; the second predetermined length is also the same as the second predetermined length in the embodiment shown in fig. 3. For a specific implementation, reference may be made to the related description of the embodiment shown in fig. 3, which is not described herein again.
It should be noted that, the manner of "obtaining the ith feature vector corresponding to the ith position based on the fourth sub-feature vector and the fifth sub-feature vector" in step 4023 should be consistent with the manner of "obtaining the first feature vector corresponding to each first position based on the third sub-feature vector and the first sub-feature vector corresponding to each first position" and "obtaining the second feature vector corresponding to each second position based on the third sub-feature vector and the second sub-feature vector corresponding to each second position" in step 1024.
In step 403, an ith prediction result is obtained according to the ith feature vector and the target prediction model, which specifically includes: and inputting the ith characteristic vector into a target prediction model and outputting the ith prediction model.
In some specific implementations, the above steps 402 to 403 may be performed on N sites of the protein to be predicted, that is, each site on the protein to be predicted obtains a corresponding feature vector, and obtains a corresponding prediction result by using the target prediction model, so that when there are M consecutive sites in the N prediction results that each of the corresponding sites represents that the corresponding site belongs to a MoRFs segment, it may be determined that the M sites constitute one MoRFs segment of the protein to be predicted, where M is generally greater than or equal to 10 sites and less than or equal to 70 sites.
Therefore, by the method for predicting the Morfs, a target prediction model can be generated, only the feature vectors corresponding to the positions on the protein to be predicted need to be extracted, and whether the positions on the protein belong to the Morfs segments can be accurately predicted by means of the trained target prediction model, so that the Morfs segments can be identified, particularly IDPs with uncertain three-dimensional spatial structures, and the biological functions of the IDPs can be determined by the Morfs segments on the IDPs, so that a data basis is provided for quickly and accurately determining the biological functions of the IDPs.
Correspondingly, an embodiment of the present application further provides a device for generating a MoRFs prediction model, as shown in fig. 6, the device may specifically include:
a first obtaining module 601, configured to obtain a plurality of molecular identification feature MoRFs segments and a plurality of non-MoRFs segments, where each of the MoRFs segments is composed of a plurality of first sites, and each of the non-MoRFs segments includes a plurality of second sites;
a first extracting module 602, configured to extract a first feature vector corresponding to each first location and a second feature vector corresponding to each second location;
a generating module 603, configured to train a pre-constructed initial prediction model with the first feature vector and the second feature vector, and generate a target prediction model, where the target prediction model is used to predict whether a locus in a protein belongs to a MoRFs segment.
Optionally, the number of the first feature vectors is the same as the number of the second feature vectors.
Optionally, the first obtaining module 601 includes:
a first acquisition unit, which is used for screening a plurality of the Morfs fragments from an IDPs sequence library of intrinsic disordered proteins;
the second obtaining unit is used for selecting a plurality of non-Morfs segments which are separated from each of the Morfs segments by a first preset length.
Optionally, the first extracting module 602 includes:
the third acquisition unit is used for acquiring a first position specificity score matrix PSSM corresponding to the protein of each MoRFs fragment by utilizing a protein comparison tool;
a fourth obtaining unit, configured to obtain a first sub-feature vector corresponding to each first locus by taking each first locus in the MoRFs segment as a center and extending a second preset length outward based on the first PSSM; and with each second locus in the non-MorFs segment as a center, outwards expanding the second preset length based on the first PSSM to obtain a second sub-feature vector corresponding to each second locus;
a fifth obtaining unit, configured to obtain a third sub-feature vector of the protein in which the MoRFs fragments are located according to the amino acid occurrence frequency and physicochemical properties of the protein in which each of the MoRFs fragments is located;
a sixth obtaining unit, configured to obtain the first feature vector corresponding to each first location point based on the third sub-feature vector and the first sub-feature vector corresponding to each first location point; and obtaining the second feature vector corresponding to each second bit point based on the third sub-feature vector and the second sub-feature vector corresponding to each second bit point.
Optionally, the apparatus further comprises:
the second obtaining module is used for obtaining the protein to be predicted, wherein the protein to be predicted comprises N loci, and N is an integer greater than 1;
the second extraction module is used for extracting the ith characteristic vector corresponding to the ith site of the protein to be predicted, wherein i =1,2, \8230;
and the third acquisition module is used for acquiring an ith prediction result according to the ith feature vector and the target prediction model, wherein the ith prediction result is used for representing whether the ith locus belongs to the MoRFs fragment.
Optionally, the second extraction module includes:
a seventh obtaining unit, configured to obtain, by using the protein comparison tool, a second PSSM corresponding to the protein to be predicted, and obtain, by taking the ith locus as a center, a fourth sub-feature vector corresponding to the ith locus by outwardly extending the second preset length based on the second PSSM;
an eighth obtaining unit, configured to obtain a fifth sub-feature vector of the protein to be predicted according to an amino acid occurrence frequency and physicochemical properties of the protein to be predicted;
a ninth obtaining unit, configured to obtain the ith feature vector corresponding to the ith location based on the fourth sub-feature vector and the fifth sub-feature vector;
the third obtaining module is specifically configured to:
and inputting the ith feature vector into the target prediction model, and outputting the ith prediction model.
The above description is related to the generating apparatus of the MoRFs prediction model, wherein specific implementation manners and achieved effects may refer to the description of the generating method embodiment of the MoRFs prediction model, and are not described herein again.
In addition, an embodiment of the present application further provides a device for generating a MoRFs prediction model, as shown in fig. 7, the device includes a processor 701 and a memory 702:
the memory 702 is used for storing a program code and transmitting the program code to the processor 701;
the processor 701 is configured to execute the method for generating the MoRFs prediction model according to instructions in the program code.
For a specific implementation manner and achieved effects of the generating device of the MoRFs prediction model, reference may be made to the description of the above-mentioned embodiment of the generating method of the MoRFs prediction model, and details are not described here again.
In addition, the embodiment of the application also provides a storage medium, wherein the storage medium is used for storing program codes, and the program codes are used for executing the generation method of the Morfs prediction model.
In the names of "first locus", "first feature vector", and the like, the "first" mentioned in the embodiments of the present application is used for name identification only, and does not represent the first in sequence. The same applies to "second" etc.
As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the apparatus embodiments and the apparatus embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The above-described embodiments of the apparatus and device are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only a preferred embodiment of the present application and is not intended to limit the scope of the present application. It should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the scope of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (9)

1. A method for generating a MoRFs prediction model is characterized by comprising the following steps:
obtaining a plurality of molecular recognition characteristic MoRFs fragments and a plurality of non-MoRFs fragments, wherein each MoRFs fragment consists of a plurality of first sites, and each non-MoRFs fragment comprises a plurality of second sites;
extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point;
training a pre-constructed initial prediction model by using the first characteristic vector and the second characteristic vector to generate a target prediction model, wherein the target prediction model is used for predicting whether the locus in the protein belongs to a MoRFs fragment;
wherein, the extracting a first feature vector corresponding to each first position point and a second feature vector corresponding to each second position point comprises:
for each MoRFs fragment, obtaining a first position specificity score matrix PSSM corresponding to the protein of the MoRFs fragment by using a protein comparison tool;
taking each first locus in the MoRFs segment as a center, and outwards expanding a second preset length based on the first PSSM to obtain a first sub-feature vector corresponding to each first locus; and taking each second locus in the non-Morfs segment as a center, and outwards expanding the second preset length based on the first PSSM to obtain a second sub-feature vector corresponding to each second locus;
obtaining a third sub-feature vector of the protein in which the MoRFs fragments are located according to the occurrence frequency and physicochemical properties of the amino acids of the protein in which the MoRFs fragments are located;
obtaining the first feature vector corresponding to each first position point based on the third sub-feature vector and the first sub-feature vector corresponding to each first position point; and obtaining the second feature vector corresponding to each second bit point based on the third sub-feature vector and the second sub-feature vector corresponding to each second bit point.
2. The method of claim 1, wherein the number of the first eigenvector and the number of the second eigenvector are the same.
3. The method of claim 1 or 2, wherein obtaining the plurality of MoRFs segments and the plurality of non-MoRFs segments comprises:
screening a plurality of the Morfs fragments from an IDPs sequence library of intrinsic disordered proteins;
and selecting a plurality of non-Morfs segments which are separated from each Morfs segment by a first preset length.
4. The method according to claim 1 or 2, characterized in that the method further comprises:
obtaining a protein to be predicted, wherein the protein to be predicted comprises N loci, and N is an integer greater than 1;
extracting an ith characteristic vector corresponding to an ith locus of the protein to be predicted, wherein i =1,2, \8230;
and obtaining an ith prediction result according to the ith feature vector and the target prediction model, wherein the ith prediction result is used for representing whether the ith locus belongs to a MoRFs fragment.
5. The method of claim 4,
the extracting of the ith feature vector corresponding to the ith site of the protein to be predicted comprises:
obtaining a second PSSM corresponding to the protein to be predicted by using the protein comparison tool, and outwards expanding the second preset length based on the second PSSM by taking the ith position as a center to obtain a fourth sub-feature vector corresponding to the ith position;
obtaining a fifth sub-feature vector of the protein to be predicted according to the occurrence frequency and physicochemical properties of the amino acid of the protein to be predicted;
obtaining the ith feature vector corresponding to the ith locus based on the fourth sub-feature vector and the fifth sub-feature vector;
the obtaining of the ith prediction result according to the ith feature vector and the target prediction model specifically includes:
inputting the ith feature vector into the target prediction model, and outputting the ith prediction model.
6. An apparatus for generating a MoRFs prediction model, comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of molecular recognition characteristic MoRFs fragments and a plurality of non-MoRFs fragments, each MoRFs fragment consists of a plurality of first sites, and each non-MoRFs fragment comprises a plurality of second sites;
the first extraction module is used for extracting a first feature vector corresponding to each first bit point and a second feature vector corresponding to each second bit point;
the generating module is used for training a pre-constructed initial prediction model by utilizing the first characteristic vector and the second characteristic vector to generate a target prediction model, and the target prediction model is used for predicting whether the locus in the protein belongs to a MoRFs segment;
the first extraction module is specifically configured to:
for each MoRFs fragment, obtaining a first position specificity score matrix PSSM corresponding to the protein of the MoRFs fragment by using a protein comparison tool;
taking each first locus in the MoRFs segment as a center, and outwards expanding a second preset length based on the first PSSM to obtain a first sub-feature vector corresponding to each first locus; and with each second locus in the non-MorFs segment as a center, outwards expanding the second preset length based on the first PSSM to obtain a second sub-feature vector corresponding to each second locus;
obtaining a third sub-feature vector of the protein in which the MoRFs fragments are located according to the occurrence frequency and physicochemical properties of the amino acids of the protein in which the MoRFs fragments are located;
obtaining the first feature vector corresponding to each first position point based on the third sub-feature vector and the first sub-feature vector corresponding to each first position point; and obtaining the second feature vector corresponding to each second bit point based on the third sub-feature vector and the second sub-feature vector corresponding to each second bit point.
7. The apparatus of claim 6, further comprising:
the second acquisition module is used for acquiring the protein to be predicted, wherein the protein to be predicted comprises N loci, and N is an integer greater than 1;
the second extraction module is used for extracting the ith characteristic vector corresponding to the ith site of the protein to be predicted, wherein i =1,2, \8230;
and the third acquisition module is used for acquiring an ith prediction result according to the ith feature vector and the target prediction model, wherein the ith prediction result is used for representing whether the ith locus belongs to the MoRFs fragment.
8. An apparatus for generating a MoRFs prediction model, the apparatus comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1 to 5 according to instructions in the program code.
9. A storage medium for storing program code for performing the method of any one of claims 1 to 5.
CN201911330914.3A 2019-12-20 2019-12-20 Method, device, equipment and storage medium for generating MoRFs prediction model Active CN111091865B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911330914.3A CN111091865B (en) 2019-12-20 2019-12-20 Method, device, equipment and storage medium for generating MoRFs prediction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911330914.3A CN111091865B (en) 2019-12-20 2019-12-20 Method, device, equipment and storage medium for generating MoRFs prediction model

Publications (2)

Publication Number Publication Date
CN111091865A CN111091865A (en) 2020-05-01
CN111091865B true CN111091865B (en) 2023-04-07

Family

ID=70396623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911330914.3A Active CN111091865B (en) 2019-12-20 2019-12-20 Method, device, equipment and storage medium for generating MoRFs prediction model

Country Status (1)

Country Link
CN (1) CN111091865B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955628A (en) * 2014-04-22 2014-07-30 南京理工大学 Subspace fusion-based protein-vitamin binding location point predicting method
WO2019041333A1 (en) * 2017-08-31 2019-03-07 深圳大学 Method, apparatus, device and storage medium for predicting protein binding sites
CN109635046A (en) * 2019-01-15 2019-04-16 金陵科技学院 A kind of protein molecule name analysis and recognition methods based on CRFs
CN109817275A (en) * 2018-12-26 2019-05-28 东软集团股份有限公司 The generation of protein function prediction model, protein function prediction technique and device
CN110033822A (en) * 2019-03-29 2019-07-19 华中科技大学 Protein coding method and protein post-translational modification site estimation method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955628A (en) * 2014-04-22 2014-07-30 南京理工大学 Subspace fusion-based protein-vitamin binding location point predicting method
WO2019041333A1 (en) * 2017-08-31 2019-03-07 深圳大学 Method, apparatus, device and storage medium for predicting protein binding sites
CN109817275A (en) * 2018-12-26 2019-05-28 东软集团股份有限公司 The generation of protein function prediction model, protein function prediction technique and device
CN109635046A (en) * 2019-01-15 2019-04-16 金陵科技学院 A kind of protein molecule name analysis and recognition methods based on CRFs
CN110033822A (en) * 2019-03-29 2019-07-19 华中科技大学 Protein coding method and protein post-translational modification site estimation method and system

Also Published As

Publication number Publication date
CN111091865A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN109948478B (en) Large-scale unbalanced data face recognition method and system based on neural network
TWI677852B (en) A method and apparatus, electronic equipment, computer readable storage medium for extracting image feature
CN110827924B (en) Clustering method and device for gene expression data, computer equipment and storage medium
Khan et al. Rafp-pred: Robust prediction of antifreeze proteins using localized analysis of n-peptide compositions
CN106202999B (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
CN112437053B (en) Intrusion detection method and device
CN112164426A (en) Drug small molecule target activity prediction method and device based on TextCNN
CN112489723B (en) DNA binding protein prediction method based on local evolution information
CN104766219A (en) User recommendation list generation method and system based on taking list as unit
Amilpur et al. Edeepssp: explainable deep neural networks for exact splice sites prediction
CN110956248A (en) Isolated forest-based mass data abnormal value detection algorithm
CN111048145B (en) Method, apparatus, device and storage medium for generating protein prediction model
CN105046106B (en) A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval
CN111783088B (en) Malicious code family clustering method and device and computer equipment
CN111177388B (en) Processing method and computer equipment
CN111091865B (en) Method, device, equipment and storage medium for generating MoRFs prediction model
CN111782805A (en) Text label classification method and system
CN104572820B (en) The generation method and device of model, importance acquisition methods and device
CN115936773A (en) Internet financial black product identification method and system
CN111009287B (en) SLiMs prediction model generation method, device, equipment and storage medium
WO2022176293A1 (en) Physical property prediction device and program
CN111026935B (en) Cross-modal retrieval reordering method based on adaptive measurement fusion
CN115497564A (en) Antigen identification model establishing method and antigen identification method
CN109801675B (en) Method, device and equipment for determining protein lipid function
CN112769540A (en) Method, system, equipment and storage medium for diagnosing side channel information leakage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant