CN110349628A - A kind of protein phosphorylation site recognition methods, system, device and storage medium - Google Patents

A kind of protein phosphorylation site recognition methods, system, device and storage medium Download PDF

Info

Publication number
CN110349628A
CN110349628A CN201910569671.2A CN201910569671A CN110349628A CN 110349628 A CN110349628 A CN 110349628A CN 201910569671 A CN201910569671 A CN 201910569671A CN 110349628 A CN110349628 A CN 110349628A
Authority
CN
China
Prior art keywords
amino acid
phosphorylation site
protein phosphorylation
feature vector
acid sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910569671.2A
Other languages
Chinese (zh)
Other versions
CN110349628B (en
Inventor
李占潮
邹小勇
戴宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Pharmaceutical University
Sun Yat Sen University
Original Assignee
Guangdong Pharmaceutical University
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Pharmaceutical University, Sun Yat Sen University filed Critical Guangdong Pharmaceutical University
Priority to CN201910569671.2A priority Critical patent/CN110349628B/en
Publication of CN110349628A publication Critical patent/CN110349628A/en
Application granted granted Critical
Publication of CN110349628B publication Critical patent/CN110349628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a kind of protein phosphorylation site recognition methods, system, device and storage mediums, this method comprises: obtaining the amino acid sequence segments of protein phosphorylation site to be identified;Logical operation is carried out to the binary coding of amino acid in the amino acid sequence segments, obtains the logical binary feature vector of the amino acid sequence segments;According to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, obtains core principle component logical binary feature vector;The core principle component logical binary feature vector is input in Random Forest model and is handled, the recognition result of the protein phosphorylation site is obtained.Theoretical calculation of the invention based on Random Forest model, it can rapidly and accurately identify a large amount of protein phosphorylation site information, and it is at low cost, facilitate the development of phosphorylation mechanism and phosphorylation and disease relationship research, protein phosphorylation site is widely applied and identifies field.

Description

A kind of protein phosphorylation site recognition methods, system, device and storage medium
Technical field
The present invention relates to protein phosphorylation site identification field more particularly to a kind of protein phosphorylation site identification sides Method, system, device and storage medium.
Background technique
Protein is the agent and executor of life entity biological function, and the protein after gene expression is known as precursor egg White, usually not no bioactivity, needing could become by a series of processing and modification has certain biological function Protein.Protein phosphorylation, which refers to, is transferred to bottom the phosphate group of atriphos last bit under the action of protein kinase catalysis Series reaction type on the specific amino of object protein, be processed after the protein translation being currently known it is a kind of most Common modified types.Studies have shown that protein phosphorylation is in cell Proliferation, development, differentiation and apoptosis, cell signalling, Important role is played during nervous activity, contraction of muscle, and metabolism and tumour generation etc., and is also to adjust With the main mechanism of control protein function.Therefore, the identification of protein phosphorylation site multiplicity complicated for parsing life entity Physiology and pathologic process, the researchs such as prevention, diagnosing and treating and the medicament research and development and design of disease have important work With.With the rapid development of various high throughput sequencing technologies, the protein sequence data of magnanimity has been produced.But only know Not very small amount of protein phosphorylation site information, greatly hinders the research of protein phosphorylation mechanism.And it is real Proved recipe method identifies that phosphorylation site is usually time-consuming, laborious, and needs expensive cost.
Summary of the invention
In view of this, the purpose of the embodiment of the present invention is that providing a kind of protein phosphorylation site recognition methods, system, dress It sets and storage medium.The recognition methods is based on theoretical calculation, can recognize a large amount of protein phosphorylation site information, efficiently, accurately, It is at low cost.
In a first aspect, the embodiment of the invention provides a kind of protein phosphorylation site recognition methods, comprising the following steps:
Obtain the amino acid sequence segments of protein phosphorylation site to be identified;
Logical operation is carried out to the binary coding of amino acid in the amino acid sequence segments, obtains the amino acid sequence The logical binary feature vector of column-slice section;
According to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, obtain core it is main at Divide logical binary feature vector;
The core principle component logical binary feature vector is input in Random Forest model and is handled, described in acquisition The recognition result of protein phosphorylation site.
Preferably, the binary coding to amino acid in the amino acid sequence segments carries out logical operation, obtains The logical binary feature vector of the amino acid sequence segments, comprising the following steps:
Logical AND operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains the first spy Levy vector set;
Logic or operation are carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtain the second spy Levy vector set;
Logic xor operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains third Set of eigenvectors;
The first eigenvector collection, the second feature vector set and third feature vector set head and the tail are connected, are obtained The logical binary feature vector of the amino acid sequence segments.
Preferably, described according to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, Obtain core principle component logical binary feature vector, comprising the following steps:
According to preset kernel function by logical binary maps feature vectors to higher dimensional space, high bit space nuclear moment is obtained Battle array;
Calculate the characteristic value of the nuclear matrix and the feature vector of characteristic value;
It chooses in the characteristic value the corresponding feature vector of preceding k larger characteristic values and carries out head and the tail connection, obtain the ammonia The corresponding core principle component logical binary feature vector of base acid sequence segment.
Preferably, the kernel function is gaussian kernel function.
Preferably, the Random Forest model before application, need to be by training and test, and detailed process includes following step It is rapid:
According to the corresponding protein amino acid sequence of protein phosphorylation site in data bank, the protein phosphoric acid is obtained Change the corresponding core principle component logical binary feature vector in site as input data positive sample, and by protein phosphorylation site Information is as output data positive sample;
The corresponding protein amino acid sequence of nonprotein phosphorylation site is obtained according to data bank, obtains the non-protein The corresponding core principle component logical binary feature vector of matter phosphorylation site is as input data negative sample, and by nonprotein phosphorus Polyadenylation sites information is as output data negative sample;
Selected part input data positive sample, input data negative sample, output data positive sample and the negative sample of output data This, is trained the Random Forest model;
Remaining input data positive sample, input data negative sample and corresponding output are chosen as a result, to described random gloomy Woods model is tested.
Preferably, the nonprotein phosphorylation site, is prepared by the following:
All lysine residues are searched in the protein sequence where protein phosphorylation site;
It is protein phosphorylation site marked in data bank when determining the lysine residue not, then is labeled as non-egg White matter phosphorylation site.
Second aspect, the embodiment of the invention provides a kind of protein phosphorylation site identifying systems, comprising:
Retrieval module, for obtaining the amino acid sequence segments of protein phosphorylation site to be identified;
Primary vector obtains module, carries out logic for the binary coding to amino acid in the amino acid sequence segments Operation, obtains the logical binary feature vector of the amino acid sequence segments;
Secondary vector obtains module, for carrying out core to the logical binary feature vector according to preset kernel function Principal component analysis obtains core principle component logical binary feature vector;
Identification module is carried out for the core principle component logical binary feature vector to be input in Random Forest model Processing, obtains the recognition result of the protein phosphorylation site.
The third aspect, the embodiment of the invention provides a kind of protein phosphorylation site identification devices, comprising:
At least one processor;
At least one processor, for storing at least one program;
When at least one described program is executed by least one described processor, so that at least one described processor is realized The protein phosphorylation site recognition methods.
Fourth aspect, the embodiment of the invention provides a kind of storage mediums, wherein it is stored with the executable instruction of processor, The executable instruction of the processor is when executed by the processor for executing the protein phosphorylation site recognition methods.
5th aspect, the embodiment of the invention provides a kind of protein phosphorylation site identifying systems, including amino acid sequence Column acquisition equipment and the computer equipment being connect with amino acid sequence acquisition equipment;Wherein,
The amino acid sequence acquires equipment, for acquiring the corresponding amino acid sequence of protein phosphorylation site to be identified Segment;
The computer equipment includes:
At least one processor;
At least one processor, for storing at least one program;
When at least one described program is executed by least one described processor, so that at least one described processor is realized The protein phosphorylation site recognition methods.
Implementing the present invention includes following the utility model has the advantages that the present invention is by the corresponding amino acid sequence piece of protein phosphorylation site Section is converted into core principle component logical binary feature vector, then with Random Forest model to core principle component logical binary feature to Amount is handled, to obtain the recognition result of protein phosphorylation site, which is based on theoretical calculation, can be quick It accurately identifies a large amount of protein phosphorylation site information, and at low cost, facilitates phosphorylation mechanism and phosphorylation and disease The development of relationship research.
Detailed description of the invention
Fig. 1 is a kind of step flow diagram of protein phosphorylation site recognition methods provided in an embodiment of the present invention;
Fig. 2 is a kind of structural block diagram of protein phosphorylation site identifying system provided in an embodiment of the present invention;
Fig. 3 is a kind of structural block diagram of protein phosphorylation site identification device provided in an embodiment of the present invention;
Fig. 4 is the structural block diagram of another protein phosphorylation site identifying system provided in an embodiment of the present invention.
Specific embodiment
The present invention is described in further detail in the following with reference to the drawings and specific embodiments.In for the examples below Number of steps is arranged only for the purposes of illustrating explanation, does not do any restriction to the sequence between step, each in embodiment The execution sequence of step can be adaptively adjusted according to the understanding of those skilled in the art.
As shown in Figure 1, the embodiment of the invention provides a kind of protein phosphorylation site recognition methods comprising the step of It is as follows:
S1, the corresponding amino acid sequence segments of protein phosphorylation site to be identified are obtained;
S2, logical operation is carried out to the corresponding binary coding of amino acid in the amino acid sequence segments, described in acquisition The corresponding logical binary feature vector of amino acid sequence segments;
S3, core master is obtained to logical binary feature vector progress core principle component analysis according to preset kernel function Ingredient logical binary feature vector;
S4, it the core principle component logical binary feature vector is input in Random Forest model handles, obtain The recognition result of the protein phosphorylation site.
Specifically, amino acid name and corresponding binary system in the corresponding amino acid sequence segments of protein phosphorylation site It encodes as follows:
Alanine A is indicated are as follows: [1 000000000000000000 0];
Cysteine C is indicated are as follows: [0 100000000000000000 0];
Aspartic acid D is indicated are as follows: [0 010000000000000000 0];
Glutamic acid E is indicated are as follows: [0 001000000000000000 0];
Phenylalanine F is indicated are as follows: [0 000100000000000000 0];
Glycine G is indicated are as follows: [0 000010000000000000 0];
Histidine H is indicated are as follows: [0 000001000000000000 0];
Isoleucine I is indicated are as follows: [0 000000100000000000 0];
Lysine K is indicated are as follows: [0 000000010000000000 0];
Leucine L is indicated are as follows: [0 000000001000000000 0];
Methionine M is indicated are as follows: [0 000000000100000000 0];
Aspartic acid N is indicated are as follows: [0 000000000010000000 0];
Proline P is indicated are as follows: [0 000000000001000000 0];
Glutamine Q is indicated are as follows: [0 000000000000100000 0];
Arginine R is indicated are as follows: [0 000000000000010000 0];
Serine S is indicated are as follows: [0 000000000000001000 0];
Threonine T is indicated are as follows: [0 000000000000000100 0];
Valine V is indicated are as follows: [0 000000000000000010 0];
Tryptophan W is indicated are as follows: [0 000000000000000001 0];
Tyrosine Y is indicated are as follows: [0 000000000000000000 1].
Logical operation includes logical AND operation, logic or operation and logic xor operation, and the rule of each logical operation is as follows:
Logical AND operation: And (0,0)=0;And (1,0)=0;And (0,1)=0;And (1,1)=1;
Logic or operation: Or (0,0)=0;Or (1,0)=1;Or (0,1)=1;Or (1,1)=1;
Logic xor operation: Xor (0,0)=0;Xor (1,0)=1;Xor (0,1)=1;Xor (1,1)=0.
Implementing the present invention includes following the utility model has the advantages that the present invention is by the corresponding amino acid sequence piece of protein phosphorylation site Section is converted into core principle component logical binary feature vector, then with Random Forest model to core principle component logical binary feature to Amount is handled, to obtain the recognition result of protein phosphorylation site, which is based on theoretical calculation, can be quick It accurately identifies a large amount of protein phosphorylation site information, and at low cost, facilitates phosphorylation mechanism and phosphorylation and disease The development of relationship research.
It is preferentially, described that logical operation is carried out to the corresponding binary coding of amino acid in the amino acid sequence segments, Obtain the corresponding logical binary feature vector of the amino acid sequence segments, comprising the following steps:
Logical AND operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains the first spy Levy vector set;
Logic or operation are carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtain the second spy Levy vector set;
Logic xor operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains third Set of eigenvectors;
The first eigenvector collection, the second feature vector set and third feature vector set head and the tail are connected, are obtained The logical binary feature vector of the amino acid sequence segments.Specifically, by taking following amino acid sequence segments as an example: alanine A, cysteine C, tyrosine Y, aspartic acid D and glutamic acid E, abbreviation ACYDE.The logical AND of amino acid sequence segments ACYDE Operating process is as follows: A logically being carried out logical AND operation with the binary coding of C, Y, D and E respectively with the rule of operation; C is subjected to logical AND operation with the binary coding of Y, D and E respectively;Y is subjected to logical AND with the binary coding of D and E respectively Operation;Logical AND operation is carried out to the binary coding of amino acid D and E;The vector head and the tail that all logical ANDs operate are connected It connects, the vector of binary features of composition logical AND operation.With the method for analogy, by logical operation rule to amino acid sequence segments ACYDE carries out logic or operation and logic xor operation, obtain the binary features of logic or operation and logic xor operation to Amount.
According to logical AND operation, logic or operation and the sequence of logic xor operation, by corresponding binary features Vector head and the tail connect, and obtain the corresponding logical binary feature vector BFV of amino acid sequence segments ACYDEi=[0 10 0 ... 00 1], protein phosphorylation site amino acid sequence segments ACYDE is characterized.
Preferably, described according to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, Obtain core principle component logical binary feature vector, comprising the following steps:
According to preset kernel function by logical binary maps feature vectors to higher dimensional space, high bit space nuclear moment is obtained Battle array;
Calculate the characteristic value and corresponding feature vector of the nuclear matrix;
It chooses the corresponding feature vector of preceding k the larger value in the characteristic value and carries out head and the tail connection, obtain the amino acid The corresponding core principle component logical binary feature vector of sequence fragment, k is positive integer.
Preferably, the kernel function is gaussian kernel function.
Specifically, using preset kernel function Φ (x) by logical binary feature vector BFViIt is mapped in higher dimensional space: BFVi→Φ(BFVi), then calculate nuclear matrix KMi,j=(Φ (BFVi),Φ(BFVj)), obtain centralization nuclear matrixWhereinAccording toCalculate feature Value λiWith corresponding feature vector αi.Finally, characteristic value is arranged from big to small, by the corresponding feature of preceding k characteristic value to Amount head and the tail connect, and form core principle component logical binary feature vector, characterize protein phosphorylation site sequence information.Kernel function Using gaussian kernel function:
Preferably, the Random Forest model before application, need to be by training and test, and detailed process includes following step It is rapid:
According to the corresponding protein amino acid sequence of protein phosphorylation site in data bank, the protein phosphoric acid is obtained Change the corresponding core principle component logical binary feature vector in site as input data positive sample, and by protein phosphorylation site Information is as output data positive sample;
The corresponding protein amino acid sequence of nonprotein phosphorylation site is obtained according to data bank, obtains the non-protein The corresponding core principle component logical binary feature vector of matter phosphorylation site is as input data negative sample, and by nonprotein phosphorus Polyadenylation sites information is as output data negative sample;
Selected part input data positive sample, input data negative sample and corresponding output are as a result, to the random forest Model is trained;
Remaining input data positive sample, input data negative sample and corresponding output are chosen as a result, to described random gloomy Woods model is tested.
Specifically, data bank includes database and various bibliography, the known protein phosphorylation of database purchase Site information stores the amino acid residue that phosphorylation can occur.Protein in protein phosphorylation site database Phosphorylation site information is the amino acid residue that phosphorylation truly occurs by experimental verification.Therefore, protein phosphorylation position In point data base the corresponding core principle component logical binary feature vector of protein phosphorylation site information can be used as training and The input data positive sample of Random Forest model is tested, corresponding site information can be used as training and test Random Forest model Output data positive sample.
Nonprotein phosphorylation site information is not present in protein phosphorylation site database, as protein Nonprotein phosphorylation site information except phosphorylation site database, each of which site information indicate that the amino acid cannot Occur phosphorylation, the fact that also quantificational expression be nonprotein phosphorylation site information.Therefore, protein phosphorylation site number According to the nonprotein phosphorylation site information except library, the output data for training and testing Random Forest model can be used as Negative sample, corresponding nonprotein phosphorylation site core principle component logical binary feature vector can be used as training and Test the input data negative sample of Random Forest model.
Random Forest model is trained and is tested using data positive sample and data negative sample, makes Random Forest model Prediction result is more acurrate, closer with truth.
Protein phosphorylation site database is UniProtKB (The Universal Protein Resource Knowledgebase) database is obtained by screening, and the specific method is as follows:
Human protein's sequence information is collected, i.e., is noted as the protein of Homo sapiens in database;
Collection include amino acid residue annotation be Phosphoserine protein sequence and site information, i.e., if It is Phosphoserine that some amino acid residue in human protein annotates in UniProtKB database, then means The amino acid residue is serine, and phosphorylation can occur;
The protein phosphorylation site information that amino acid residue annotation is ECO:0000250 is deleted, i.e., if the amino acid Residue annotation be ECO:0000250, then mean the amino acid residue phosphorylation be obtained by sequence alignment of protein, and It is not to be determined by specific experiment information.Therefore, in order to guarantee the reliabilities of positive sample data, the amino with the annotation is deleted Acid phosphoric acid site information.
Preferably, the nonprotein phosphorylation site, is prepared by the following:
All lysine residues are searched in the protein sequence where protein phosphorylation site;
It is protein phosphorylation site marked in data bank when determining the lysine residue not, then is labeled as non-egg White matter phosphorylation site.
Specifically, randomly choose collection includes the human protein sequence data of phosphorylation site information, determines choosing The serine residue location information of phosphorylation is deleted in the position of all serine residues in the protein sequence selected, remaining As nonprotein phosphorylation site.
Preferably, the construction method of training set and test set uses five folding cross validation methods, specific training and test The data set size and construction method of Random Forest model are as follows: will output and input data positive sample, output and input data Negative sample stochastic averagina is divided into five equal portions, i.e., every portion accounts for the 20% of entire positive and negative sample data set;It selects at random a defeated Enter and output and input data minus sample composition test set with output data positive sample and portion, the input of remaining quarter and Output data positive sample and quarter output and input data minus sample composition training set;Repeat above-mentioned random election process Five times, guarantee that every portion outputs and inputs data positive sample and every a output and output data negative sample is all selected work Primary for test set, more times are selected as training set four times.Using prediction overall accuracy, sensibility, specificity, geneva related coefficient With the parameter evaluations Random Forest model predictive ability such as Receiver operating curve's area.
In one embodiment, the number set in random forest is set as 100, the randomly selected spy of each node of tree The square root round numbers that number is total number of features is levied, Random Forest model is constructed using five folding cross validation methods, is used in combination Overall accuracy, sensibility, specificity, geneva related coefficient and the assessment Random Forest model prediction of Receiver operating curve's area Ability.The results are shown in Table 1, and core principle component number is divided into 5 since 5, be up to 100, as core principle component number is into one Step increases, and small range of fluctuation is presented in the assessment parameter such as prediction overall accuracy;It is random gloomy when core principle component number is set as 65 Woods model obtains highest total precision of prediction 93.21%, corresponding sensibility, specificity, geneva related coefficient and subject Performance curve area is 96.82%, 89.61%, 0.8665 and 0.9835 respectively, all relatively high.Therefore, final random The model that forest model constructs when using core principle component number as 65, for identification potential protein phosphorylation site.
Five folding cross validation results of Random Forest model when table 1, different IPs number of principal components
As described in Figure 2, a kind of protein phosphorylation site identifying system, comprising:
Retrieval module, for obtaining the amino acid sequence segments of protein phosphorylation site to be identified;
Primary vector obtains module, carries out logic for the binary coding to amino acid in the amino acid sequence segments Operation, obtains the logical binary feature vector of the amino acid sequence segments;
Secondary vector obtains module, for carrying out core to the logical binary feature vector according to preset kernel function Principal component analysis obtains core principle component logical binary feature vector;
Identification module is carried out for the core principle component logical binary feature vector to be input in Random Forest model Processing, obtains the recognition result of the protein phosphorylation site.
As it can be seen that the content in above method embodiment, suitable for this system embodiment, this system embodiment institute is specific The function of realization is identical as above method embodiment, and the beneficial effect reached and above method embodiment are achieved beneficial Effect is also identical.
As described in Figure 3, the embodiment of the invention also provides a kind of protein phosphorylation site identification devices, comprising:
At least one processor;
At least one processor, for storing at least one program;
When at least one described program is executed by least one described processor, so that at least one described processor is realized The protein phosphorylation site recognition methods.
As it can be seen that the content in above method embodiment, suitable for present apparatus embodiment, present apparatus embodiment institute is specific The function of realization is identical as above method embodiment, and the beneficial effect reached and above method embodiment are achieved beneficial Effect is also identical.
In addition, the embodiment of the invention also provides a kind of storage mediums, wherein being stored with the executable instruction of processor, institute The executable instruction of processor is stated when executed by the processor for executing the protein phosphorylation site recognition methods.Together Sample, for the content in above method embodiment suitable for this storage medium embodiment, this storage medium embodiment institute is specific The function of realization is identical as above method embodiment, and the beneficial effect reached and above method embodiment are achieved beneficial Effect is also identical.
As shown in figure 4, the embodiment of the invention also provides a kind of protein phosphorylation site identifying system, including amino acid Sequence acquisition equipment and the computer equipment being connect with amino acid sequence acquisition equipment;Wherein,
The amino acid sequence acquires equipment, for acquiring the corresponding amino acid sequence of protein phosphorylation site to be identified Segment;
The computer equipment includes:
At least one processor;
At least one processor, for storing at least one program;
When at least one described program is executed by least one described processor, so that at least one described processor is realized The protein phosphorylation site recognition methods.
Specifically, the computer equipment can be different types of electronic equipment, including but not limited to there is desktop The terminals such as brain, laptop computer.
As it can be seen that the content in above method embodiment, suitable for this system embodiment, this system embodiment institute is specific The function of realization is identical as above method embodiment, and the beneficial effect reached and above method embodiment are achieved beneficial Effect is also identical.
It is to be illustrated to preferable implementation of the invention, but the invention is not limited to the implementation above Example, those skilled in the art can also make various equivalent variations on the premise of without prejudice to spirit of the invention or replace It changes, these equivalent deformations or replacement are all included in the scope defined by the claims of the present application.

Claims (10)

1. a kind of protein phosphorylation site recognition methods, which comprises the following steps:
Obtain the amino acid sequence segments of protein phosphorylation site to be identified;
Logical operation is carried out to the binary coding of amino acid in the amino acid sequence segments, obtains the amino acid sequence piece The logical binary feature vector of section;
According to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, core principle component is obtained and patrols Collect vector of binary features;
The core principle component logical binary feature vector is input in Random Forest model and is handled, the albumen is obtained The recognition result of matter phosphorylation site.
2. protein phosphorylation site recognition methods according to claim 1, which is characterized in that described to the amino acid The binary coding of amino acid carries out logical operation in sequence fragment, and the logical binary for obtaining the amino acid sequence segments is special Levy vector, comprising the following steps:
Logical AND operation is carried out two-by-two to the binary coding of amino acid in the amino acid sequence segments, obtain fisrt feature to Quantity set;
Logic or operation are carried out two-by-two to the binary coding of amino acid in the amino acid sequence segments, obtain second feature to Quantity set;
Logic xor operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains third feature Vector set;
The first eigenvector collection, the second feature vector set and third feature vector set head and the tail are connected, are obtained The logical binary feature vector of the amino acid sequence segments.
3. protein phosphorylation site recognition methods according to claim 1, which is characterized in that described according to preset core Function carries out core principle component analysis to the logical binary feature vector, obtains core principle component logical binary feature vector, The following steps are included:
According to preset kernel function by logical binary maps feature vectors to higher dimensional space, high bit space nuclear matrix is obtained;
Calculate the characteristic value of the nuclear matrix and the feature vector of characteristic value;
It chooses in the characteristic value the corresponding feature vector of preceding k larger characteristic values and carries out head and the tail connection, obtain the amino acid The core principle component logical binary feature vector of sequence fragment.
4. protein phosphorylation site recognition methods according to claim 3, which is characterized in that the kernel function is Gauss Kernel function.
5. protein phosphorylation site recognition methods according to claim 1, which is characterized in that the Random Forest model Construction method the following steps are included:
According to the corresponding protein amino acid sequence of protein phosphorylation site in data bank, the protein phosphorylation position is obtained The corresponding core principle component logical binary feature vector of point is as input data positive sample, and by protein phosphorylation site information As output data positive sample;
The corresponding protein amino acid sequence of nonprotein phosphorylation site is obtained according to data bank, obtains the nonprotein phosphorus The corresponding core principle component logical binary feature vector of polyadenylation sites is as input data negative sample, and by nonprotein phosphorylation Site information is as output data negative sample;
Selected part input data positive sample, input data negative sample, output data positive sample and output data negative sample, it is right The Random Forest model is trained;
Remaining input data positive sample, input data negative sample and corresponding output are chosen as a result, to the random forest mould Type is tested.
6. protein phosphorylation site recognition methods according to claim 5, which is characterized in that the nonprotein phosphoric acid Change site, be prepared by the following:
All lysine residues are searched in the protein sequence where protein phosphorylation site;
It is protein phosphorylation site marked in data bank when determining the lysine residue not, then is labeled as nonprotein Phosphorylation site.
7. a kind of protein phosphorylation site identifying system characterized by comprising
Retrieval module, for obtaining the amino acid sequence segments of protein phosphorylation site to be identified;
Primary vector obtains module, carries out logic behaviour for the binary coding to amino acid in the amino acid sequence segments Make, obtains the logical binary feature vector of the amino acid sequence segments;
Secondary vector obtains module, for according to preset kernel function, to the logical binary feature vector carry out core it is main at Analysis obtains core principle component logical binary feature vector;
Identification module, for the core principle component logical binary feature vector to be input in Random Forest model Reason, obtains the recognition result of the protein phosphorylation site.
8. a kind of protein phosphorylation site identification device characterized by comprising
At least one processor;
At least one processor, for storing at least one program;
When at least one described program is executed by least one described processor, so that at least one described processor is realized as weighed Benefit requires the described in any item protein phosphorylation site recognition methods of 1-6.
9. a kind of storage medium, wherein being stored with the executable instruction of processor, which is characterized in that the processor can be performed Instruction is when executed by the processor for executing protein phosphorylation site identification side as claimed in any one of claims 1 to 6 Method.
10. a kind of protein phosphorylation site identifying system, which is characterized in that including amino acid sequence acquire equipment and with institute State the computer equipment of amino acid sequence acquisition equipment connection;Wherein,
The amino acid sequence acquires equipment, for acquiring the corresponding amino acid sequence piece of protein phosphorylation site to be identified Section;
The computer equipment includes:
At least one processor;
At least one processor, for storing at least one program;
When at least one described program is executed by least one described processor, so that at least one described processor is realized as weighed Benefit requires the described in any item protein phosphorylation site recognition methods of 1-6.
CN201910569671.2A 2019-06-27 2019-06-27 Protein phosphorylation site recognition method, system, device and storage medium Active CN110349628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910569671.2A CN110349628B (en) 2019-06-27 2019-06-27 Protein phosphorylation site recognition method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910569671.2A CN110349628B (en) 2019-06-27 2019-06-27 Protein phosphorylation site recognition method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN110349628A true CN110349628A (en) 2019-10-18
CN110349628B CN110349628B (en) 2021-06-15

Family

ID=68176723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910569671.2A Active CN110349628B (en) 2019-06-27 2019-06-27 Protein phosphorylation site recognition method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN110349628B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489789A (en) * 2020-04-21 2020-08-04 华中科技大学 Method for improving mass spectrum phosphorylation modification site identification flux and accuracy
CN111696621A (en) * 2020-06-03 2020-09-22 广东药科大学 Protein phosphorylation modification site-disease relation identification method, system, device and storage medium
CN112489721A (en) * 2020-11-25 2021-03-12 清华大学 Mirror image protein information storage and coding technology
CN114927165A (en) * 2022-07-20 2022-08-19 深圳大学 Method, device, system and storage medium for identifying ubiquitination sites

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710365A (en) * 2009-12-14 2010-05-19 重庆大学 Method for calculating and identifying protein kinase phosphorylation specific sites
US20130280238A1 (en) * 2012-04-24 2013-10-24 Laboratory Corporation Of America Holdings Methods and Systems for Identification of a Protein Binding Site
CN103617203A (en) * 2013-11-15 2014-03-05 南京理工大学 Protein-ligand binding site predicting method based on inquiry drive
CN105637097A (en) * 2013-08-05 2016-06-01 特韦斯特生物科学公司 De novo synthesized gene libraries
CN105893787A (en) * 2016-06-21 2016-08-24 南昌大学 Prediction method for protein post-translational modification methylation loci
CN106570336A (en) * 2016-11-10 2017-04-19 中南大学 Method and system for predicting the sulfenylation sulfur sites in cysteine
CN107247873A (en) * 2017-03-29 2017-10-13 电子科技大学 A kind of recognition methods of differential methylation site
CN107395196A (en) * 2017-08-23 2017-11-24 郑州轻工业学院 Matrix-vector multiplication double rail logic circuit and its method based on the compound strand displacements of DNA
CN107463795A (en) * 2017-08-02 2017-12-12 南昌大学 A kind of prediction algorithm for identifying tyrosine posttranslational modification site
CN107817466A (en) * 2017-06-19 2018-03-20 重庆大学 Based on the indoor orientation method for stacking limited Boltzmann machine and random forests algorithm

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710365A (en) * 2009-12-14 2010-05-19 重庆大学 Method for calculating and identifying protein kinase phosphorylation specific sites
US20130280238A1 (en) * 2012-04-24 2013-10-24 Laboratory Corporation Of America Holdings Methods and Systems for Identification of a Protein Binding Site
CN105637097A (en) * 2013-08-05 2016-06-01 特韦斯特生物科学公司 De novo synthesized gene libraries
CN103617203A (en) * 2013-11-15 2014-03-05 南京理工大学 Protein-ligand binding site predicting method based on inquiry drive
CN105893787A (en) * 2016-06-21 2016-08-24 南昌大学 Prediction method for protein post-translational modification methylation loci
CN106570336A (en) * 2016-11-10 2017-04-19 中南大学 Method and system for predicting the sulfenylation sulfur sites in cysteine
CN107247873A (en) * 2017-03-29 2017-10-13 电子科技大学 A kind of recognition methods of differential methylation site
CN107817466A (en) * 2017-06-19 2018-03-20 重庆大学 Based on the indoor orientation method for stacking limited Boltzmann machine and random forests algorithm
CN107463795A (en) * 2017-08-02 2017-12-12 南昌大学 A kind of prediction algorithm for identifying tyrosine posttranslational modification site
CN107395196A (en) * 2017-08-23 2017-11-24 郑州轻工业学院 Matrix-vector multiplication double rail logic circuit and its method based on the compound strand displacements of DNA

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
MD. MEHEDI HASAN 等: "Computational identifcation of microbial phosphorylation sites by the enhanced characteristics of sequence information", 《SCIENTIFIC REPORTS》 *
ZIMO YIN 等: "New encoding schemes for prediction of protein phosphorylation sites", 《2012 IEEE 6TH INTERNATIONAL CONFERENCE ON SYSTEMS BIOLOGY》 *
胡敏菁: "面向蛋白质功能位点识别的机器学习平台构建", 《万方数据库》 *
胡青 等: "核主成分分析与随机森林相结合的变压器故障诊断方法", 《高压电技术》 *
范自柱 著: "《新型特征抽取算法研究》", 31 December 2016, 中国科学技术大学出版社 *
赵云彬 等: "DNA逻辑计算模型的研究现状与展望", 《HTTP://WWW.AROCMAG.COM/ARTICLE/02-2019-11-087.HTML》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489789A (en) * 2020-04-21 2020-08-04 华中科技大学 Method for improving mass spectrum phosphorylation modification site identification flux and accuracy
CN111489789B (en) * 2020-04-21 2021-10-15 华中科技大学 Method for improving mass spectrum phosphorylation modification site identification flux and accuracy
CN111696621A (en) * 2020-06-03 2020-09-22 广东药科大学 Protein phosphorylation modification site-disease relation identification method, system, device and storage medium
CN111696621B (en) * 2020-06-03 2023-03-31 广东药科大学 Protein phosphorylation modification site-disease relation identification method, system, device and storage medium
CN112489721A (en) * 2020-11-25 2021-03-12 清华大学 Mirror image protein information storage and coding technology
CN112489721B (en) * 2020-11-25 2021-11-12 清华大学 Mirror image protein information storage and coding technology
CN114927165A (en) * 2022-07-20 2022-08-19 深圳大学 Method, device, system and storage medium for identifying ubiquitination sites

Also Published As

Publication number Publication date
CN110349628B (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN110349628A (en) A kind of protein phosphorylation site recognition methods, system, device and storage medium
Hie et al. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama
Xu et al. scIGANs: single-cell RNA-seq imputation using generative adversarial networks
Weber et al. Comparison of clustering methods for high‐dimensional single‐cell flow and mass cytometry data
Shilov et al. The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra
Jiang et al. Bridging the information gap: computational tools for intermediate resolution structure interpretation
Swofford et al. A method for the statistical interpretation of friction ridge skin impression evidence: method development and validation
CN110659207B (en) Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration
Braytee et al. Multi-label feature selection using correlation information
Courtney et al. Shotgun correlations in software measures
CN110890137A (en) Modeling method, device and application of compound toxicity prediction model
Kumar et al. PRmePRed: A protein arginine methylation prediction tool
WO2015037003A1 (en) Method and electronic nose for comparing odors
CN114420212A (en) Escherichia coli strain identification method and system
Zhu et al. Datr: Domain-adaptive transformer for multi-domain landmark detection
CN112966702A (en) Method and apparatus for classifying protein-ligand complex
Moharekar et al. Thyroid disease detection using machine learning and Pycaret
González Calabozo et al. Gene Expression Array Exploration Using-Formal Concept Analysis
Que et al. Evaluation of protein phosphorylation site predictors
TWI652481B (en) Method for detecting drug resistance of microorganism
Baker et al. Quality assurance and error identification for the Community Earth System Model
Lim et al. JSOM: Jointly-evolving self-organizing maps for alignment of biological datasets and identification of related clusters
Cascitti et al. RNACache: Fast Mapping of RNA-Seq Reads to Transcriptomes Using MinHashing
Inhester Mining of Interaction Geometries in Collections of Protein Structures
CN114496089B (en) Pathogenic microorganism identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant