CN110349628A

CN110349628A - A kind of protein phosphorylation site recognition methods, system, device and storage medium

Info

Publication number: CN110349628A
Application number: CN201910569671.2A
Authority: CN
Inventors: 李占潮; 邹小勇; 戴宗
Original assignee: Guangdong Pharmaceutical University; Sun Yat Sen University
Current assignee: Guangdong Pharmaceutical University; Sun Yat Sen University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2019-10-18
Anticipated expiration: 2039-06-27
Also published as: CN110349628B

Abstract

The invention discloses a kind of protein phosphorylation site recognition methods, system, device and storage mediums, this method comprises: obtaining the amino acid sequence segments of protein phosphorylation site to be identified；Logical operation is carried out to the binary coding of amino acid in the amino acid sequence segments, obtains the logical binary feature vector of the amino acid sequence segments；According to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, obtains core principle component logical binary feature vector；The core principle component logical binary feature vector is input in Random Forest model and is handled, the recognition result of the protein phosphorylation site is obtained.Theoretical calculation of the invention based on Random Forest model, it can rapidly and accurately identify a large amount of protein phosphorylation site information, and it is at low cost, facilitate the development of phosphorylation mechanism and phosphorylation and disease relationship research, protein phosphorylation site is widely applied and identifies field.

Description

A kind of protein phosphorylation site recognition methods, system, device and storage medium

Technical field

The present invention relates to protein phosphorylation site identification field more particularly to a kind of protein phosphorylation site identification sides Method, system, device and storage medium.

Background technique

Protein is the agent and executor of life entity biological function, and the protein after gene expression is known as precursor egg White, usually not no bioactivity, needing could become by a series of processing and modification has certain biological function Protein.Protein phosphorylation, which refers to, is transferred to bottom the phosphate group of atriphos last bit under the action of protein kinase catalysis Series reaction type on the specific amino of object protein, be processed after the protein translation being currently known it is a kind of most Common modified types.Studies have shown that protein phosphorylation is in cell Proliferation, development, differentiation and apoptosis, cell signalling, Important role is played during nervous activity, contraction of muscle, and metabolism and tumour generation etc., and is also to adjust With the main mechanism of control protein function.Therefore, the identification of protein phosphorylation site multiplicity complicated for parsing life entity Physiology and pathologic process, the researchs such as prevention, diagnosing and treating and the medicament research and development and design of disease have important work With.With the rapid development of various high throughput sequencing technologies, the protein sequence data of magnanimity has been produced.But only know Not very small amount of protein phosphorylation site information, greatly hinders the research of protein phosphorylation mechanism.And it is real Proved recipe method identifies that phosphorylation site is usually time-consuming, laborious, and needs expensive cost.

Summary of the invention

In view of this, the purpose of the embodiment of the present invention is that providing a kind of protein phosphorylation site recognition methods, system, dress It sets and storage medium.The recognition methods is based on theoretical calculation, can recognize a large amount of protein phosphorylation site information, efficiently, accurately, It is at low cost.

In a first aspect, the embodiment of the invention provides a kind of protein phosphorylation site recognition methods, comprising the following steps:

Obtain the amino acid sequence segments of protein phosphorylation site to be identified；

Logical operation is carried out to the binary coding of amino acid in the amino acid sequence segments, obtains the amino acid sequence The logical binary feature vector of column-slice section；

According to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, obtain core it is main at Divide logical binary feature vector；

The core principle component logical binary feature vector is input in Random Forest model and is handled, described in acquisition The recognition result of protein phosphorylation site.

Preferably, the binary coding to amino acid in the amino acid sequence segments carries out logical operation, obtains The logical binary feature vector of the amino acid sequence segments, comprising the following steps:

Logical AND operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains the first spy Levy vector set；

Logic or operation are carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtain the second spy Levy vector set；

Logic xor operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains third Set of eigenvectors；

The first eigenvector collection, the second feature vector set and third feature vector set head and the tail are connected, are obtained The logical binary feature vector of the amino acid sequence segments.

Preferably, described according to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, Obtain core principle component logical binary feature vector, comprising the following steps:

According to preset kernel function by logical binary maps feature vectors to higher dimensional space, high bit space nuclear moment is obtained Battle array；

Calculate the characteristic value of the nuclear matrix and the feature vector of characteristic value；

It chooses in the characteristic value the corresponding feature vector of preceding k larger characteristic values and carries out head and the tail connection, obtain the ammonia The corresponding core principle component logical binary feature vector of base acid sequence segment.

Preferably, the kernel function is gaussian kernel function.

Preferably, the Random Forest model before application, need to be by training and test, and detailed process includes following step It is rapid:

According to the corresponding protein amino acid sequence of protein phosphorylation site in data bank, the protein phosphoric acid is obtained Change the corresponding core principle component logical binary feature vector in site as input data positive sample, and by protein phosphorylation site Information is as output data positive sample；

The corresponding protein amino acid sequence of nonprotein phosphorylation site is obtained according to data bank, obtains the non-protein The corresponding core principle component logical binary feature vector of matter phosphorylation site is as input data negative sample, and by nonprotein phosphorus Polyadenylation sites information is as output data negative sample；

Selected part input data positive sample, input data negative sample, output data positive sample and the negative sample of output data This, is trained the Random Forest model；

Remaining input data positive sample, input data negative sample and corresponding output are chosen as a result, to described random gloomy Woods model is tested.

Preferably, the nonprotein phosphorylation site, is prepared by the following:

All lysine residues are searched in the protein sequence where protein phosphorylation site；

It is protein phosphorylation site marked in data bank when determining the lysine residue not, then is labeled as non-egg White matter phosphorylation site.

Second aspect, the embodiment of the invention provides a kind of protein phosphorylation site identifying systems, comprising:

Retrieval module, for obtaining the amino acid sequence segments of protein phosphorylation site to be identified；

Primary vector obtains module, carries out logic for the binary coding to amino acid in the amino acid sequence segments Operation, obtains the logical binary feature vector of the amino acid sequence segments；

Secondary vector obtains module, for carrying out core to the logical binary feature vector according to preset kernel function Principal component analysis obtains core principle component logical binary feature vector；

Identification module is carried out for the core principle component logical binary feature vector to be input in Random Forest model Processing, obtains the recognition result of the protein phosphorylation site.

The third aspect, the embodiment of the invention provides a kind of protein phosphorylation site identification devices, comprising:

At least one processor；

At least one processor, for storing at least one program；

When at least one described program is executed by least one described processor, so that at least one described processor is realized The protein phosphorylation site recognition methods.

Fourth aspect, the embodiment of the invention provides a kind of storage mediums, wherein it is stored with the executable instruction of processor, The executable instruction of the processor is when executed by the processor for executing the protein phosphorylation site recognition methods.

5th aspect, the embodiment of the invention provides a kind of protein phosphorylation site identifying systems, including amino acid sequence Column acquisition equipment and the computer equipment being connect with amino acid sequence acquisition equipment；Wherein,

The amino acid sequence acquires equipment, for acquiring the corresponding amino acid sequence of protein phosphorylation site to be identified Segment；

The computer equipment includes:

At least one processor；

At least one processor, for storing at least one program；

Implementing the present invention includes following the utility model has the advantages that the present invention is by the corresponding amino acid sequence piece of protein phosphorylation site Section is converted into core principle component logical binary feature vector, then with Random Forest model to core principle component logical binary feature to Amount is handled, to obtain the recognition result of protein phosphorylation site, which is based on theoretical calculation, can be quick It accurately identifies a large amount of protein phosphorylation site information, and at low cost, facilitates phosphorylation mechanism and phosphorylation and disease The development of relationship research.

Detailed description of the invention

Fig. 1 is a kind of step flow diagram of protein phosphorylation site recognition methods provided in an embodiment of the present invention；

Fig. 2 is a kind of structural block diagram of protein phosphorylation site identifying system provided in an embodiment of the present invention；

Fig. 3 is a kind of structural block diagram of protein phosphorylation site identification device provided in an embodiment of the present invention；

Fig. 4 is the structural block diagram of another protein phosphorylation site identifying system provided in an embodiment of the present invention.

Specific embodiment

The present invention is described in further detail in the following with reference to the drawings and specific embodiments.In for the examples below Number of steps is arranged only for the purposes of illustrating explanation, does not do any restriction to the sequence between step, each in embodiment The execution sequence of step can be adaptively adjusted according to the understanding of those skilled in the art.

As shown in Figure 1, the embodiment of the invention provides a kind of protein phosphorylation site recognition methods comprising the step of It is as follows:

S1, the corresponding amino acid sequence segments of protein phosphorylation site to be identified are obtained；

S2, logical operation is carried out to the corresponding binary coding of amino acid in the amino acid sequence segments, described in acquisition The corresponding logical binary feature vector of amino acid sequence segments；

S3, core master is obtained to logical binary feature vector progress core principle component analysis according to preset kernel function Ingredient logical binary feature vector；

S4, it the core principle component logical binary feature vector is input in Random Forest model handles, obtain The recognition result of the protein phosphorylation site.

Specifically, amino acid name and corresponding binary system in the corresponding amino acid sequence segments of protein phosphorylation site It encodes as follows:

Alanine A is indicated are as follows: [1 000000000000000000 0]；

Cysteine C is indicated are as follows: [0 100000000000000000 0]；

Aspartic acid D is indicated are as follows: [0 010000000000000000 0]；

Glutamic acid E is indicated are as follows: [0 001000000000000000 0]；

Phenylalanine F is indicated are as follows: [0 000100000000000000 0]；

Glycine G is indicated are as follows: [0 000010000000000000 0]；

Histidine H is indicated are as follows: [0 000001000000000000 0]；

Isoleucine I is indicated are as follows: [0 000000100000000000 0]；

Lysine K is indicated are as follows: [0 000000010000000000 0]；

Leucine L is indicated are as follows: [0 000000001000000000 0]；

Methionine M is indicated are as follows: [0 000000000100000000 0]；

Aspartic acid N is indicated are as follows: [0 000000000010000000 0]；

Proline P is indicated are as follows: [0 000000000001000000 0]；

Glutamine Q is indicated are as follows: [0 000000000000100000 0]；

Arginine R is indicated are as follows: [0 000000000000010000 0]；

Serine S is indicated are as follows: [0 000000000000001000 0]；

Threonine T is indicated are as follows: [0 000000000000000100 0]；

Valine V is indicated are as follows: [0 000000000000000010 0]；

Tryptophan W is indicated are as follows: [0 000000000000000001 0]；

Tyrosine Y is indicated are as follows: [0 000000000000000000 1].

Logical operation includes logical AND operation, logic or operation and logic xor operation, and the rule of each logical operation is as follows:

Logical AND operation: And (0,0)=0；And (1,0)=0；And (0,1)=0；And (1,1)=1；

Logic or operation: Or (0,0)=0；Or (1,0)=1；Or (0,1)=1；Or (1,1)=1；

Logic xor operation: Xor (0,0)=0；Xor (1,0)=1；Xor (0,1)=1；Xor (1,1)=0.

It is preferentially, described that logical operation is carried out to the corresponding binary coding of amino acid in the amino acid sequence segments, Obtain the corresponding logical binary feature vector of the amino acid sequence segments, comprising the following steps:

The first eigenvector collection, the second feature vector set and third feature vector set head and the tail are connected, are obtained The logical binary feature vector of the amino acid sequence segments.Specifically, by taking following amino acid sequence segments as an example: alanine A, cysteine C, tyrosine Y, aspartic acid D and glutamic acid E, abbreviation ACYDE.The logical AND of amino acid sequence segments ACYDE Operating process is as follows: A logically being carried out logical AND operation with the binary coding of C, Y, D and E respectively with the rule of operation； C is subjected to logical AND operation with the binary coding of Y, D and E respectively；Y is subjected to logical AND with the binary coding of D and E respectively Operation；Logical AND operation is carried out to the binary coding of amino acid D and E；The vector head and the tail that all logical ANDs operate are connected It connects, the vector of binary features of composition logical AND operation.With the method for analogy, by logical operation rule to amino acid sequence segments ACYDE carries out logic or operation and logic xor operation, obtain the binary features of logic or operation and logic xor operation to Amount.

According to logical AND operation, logic or operation and the sequence of logic xor operation, by corresponding binary features Vector head and the tail connect, and obtain the corresponding logical binary feature vector BFV of amino acid sequence segments ACYDE_i=[0 10 0 ... 00 1], protein phosphorylation site amino acid sequence segments ACYDE is characterized.

Calculate the characteristic value and corresponding feature vector of the nuclear matrix；

It chooses the corresponding feature vector of preceding k the larger value in the characteristic value and carries out head and the tail connection, obtain the amino acid The corresponding core principle component logical binary feature vector of sequence fragment, k is positive integer.

Preferably, the kernel function is gaussian kernel function.

Specifically, using preset kernel function Φ (x) by logical binary feature vector BFV_iIt is mapped in higher dimensional space: BFV_i→Φ(BFV_i), then calculate nuclear matrix KM_i,j=(Φ (BFV_i),Φ(BFV_j)), obtain centralization nuclear matrixWhereinAccording toCalculate feature Value λ_iWith corresponding feature vector α_i.Finally, characteristic value is arranged from big to small, by the corresponding feature of preceding k characteristic value to Amount head and the tail connect, and form core principle component logical binary feature vector, characterize protein phosphorylation site sequence information.Kernel function Using gaussian kernel function:

Selected part input data positive sample, input data negative sample and corresponding output are as a result, to the random forest Model is trained；

Specifically, data bank includes database and various bibliography, the known protein phosphorylation of database purchase Site information stores the amino acid residue that phosphorylation can occur.Protein in protein phosphorylation site database Phosphorylation site information is the amino acid residue that phosphorylation truly occurs by experimental verification.Therefore, protein phosphorylation position In point data base the corresponding core principle component logical binary feature vector of protein phosphorylation site information can be used as training and The input data positive sample of Random Forest model is tested, corresponding site information can be used as training and test Random Forest model Output data positive sample.

Nonprotein phosphorylation site information is not present in protein phosphorylation site database, as protein Nonprotein phosphorylation site information except phosphorylation site database, each of which site information indicate that the amino acid cannot Occur phosphorylation, the fact that also quantificational expression be nonprotein phosphorylation site information.Therefore, protein phosphorylation site number According to the nonprotein phosphorylation site information except library, the output data for training and testing Random Forest model can be used as Negative sample, corresponding nonprotein phosphorylation site core principle component logical binary feature vector can be used as training and Test the input data negative sample of Random Forest model.

Random Forest model is trained and is tested using data positive sample and data negative sample, makes Random Forest model Prediction result is more acurrate, closer with truth.

Protein phosphorylation site database is UniProtKB (The Universal Protein Resource Knowledgebase) database is obtained by screening, and the specific method is as follows:

Human protein's sequence information is collected, i.e., is noted as the protein of Homo sapiens in database；

Collection include amino acid residue annotation be Phosphoserine protein sequence and site information, i.e., if It is Phosphoserine that some amino acid residue in human protein annotates in UniProtKB database, then means The amino acid residue is serine, and phosphorylation can occur；

The protein phosphorylation site information that amino acid residue annotation is ECO:0000250 is deleted, i.e., if the amino acid Residue annotation be ECO:0000250, then mean the amino acid residue phosphorylation be obtained by sequence alignment of protein, and It is not to be determined by specific experiment information.Therefore, in order to guarantee the reliabilities of positive sample data, the amino with the annotation is deleted Acid phosphoric acid site information.

Preferably, the nonprotein phosphorylation site, is prepared by the following:

Specifically, randomly choose collection includes the human protein sequence data of phosphorylation site information, determines choosing The serine residue location information of phosphorylation is deleted in the position of all serine residues in the protein sequence selected, remaining As nonprotein phosphorylation site.

Preferably, the construction method of training set and test set uses five folding cross validation methods, specific training and test The data set size and construction method of Random Forest model are as follows: will output and input data positive sample, output and input data Negative sample stochastic averagina is divided into five equal portions, i.e., every portion accounts for the 20% of entire positive and negative sample data set；It selects at random a defeated Enter and output and input data minus sample composition test set with output data positive sample and portion, the input of remaining quarter and Output data positive sample and quarter output and input data minus sample composition training set；Repeat above-mentioned random election process Five times, guarantee that every portion outputs and inputs data positive sample and every a output and output data negative sample is all selected work Primary for test set, more times are selected as training set four times.Using prediction overall accuracy, sensibility, specificity, geneva related coefficient With the parameter evaluations Random Forest model predictive ability such as Receiver operating curve's area.

In one embodiment, the number set in random forest is set as 100, the randomly selected spy of each node of tree The square root round numbers that number is total number of features is levied, Random Forest model is constructed using five folding cross validation methods, is used in combination Overall accuracy, sensibility, specificity, geneva related coefficient and the assessment Random Forest model prediction of Receiver operating curve's area Ability.The results are shown in Table 1, and core principle component number is divided into 5 since 5, be up to 100, as core principle component number is into one Step increases, and small range of fluctuation is presented in the assessment parameter such as prediction overall accuracy；It is random gloomy when core principle component number is set as 65 Woods model obtains highest total precision of prediction 93.21%, corresponding sensibility, specificity, geneva related coefficient and subject Performance curve area is 96.82%, 89.61%, 0.8665 and 0.9835 respectively, all relatively high.Therefore, final random The model that forest model constructs when using core principle component number as 65, for identification potential protein phosphorylation site.

Five folding cross validation results of Random Forest model when table 1, different IPs number of principal components

As described in Figure 2, a kind of protein phosphorylation site identifying system, comprising:

As it can be seen that the content in above method embodiment, suitable for this system embodiment, this system embodiment institute is specific The function of realization is identical as above method embodiment, and the beneficial effect reached and above method embodiment are achieved beneficial Effect is also identical.

As described in Figure 3, the embodiment of the invention also provides a kind of protein phosphorylation site identification devices, comprising:

At least one processor；

At least one processor, for storing at least one program；

As it can be seen that the content in above method embodiment, suitable for present apparatus embodiment, present apparatus embodiment institute is specific The function of realization is identical as above method embodiment, and the beneficial effect reached and above method embodiment are achieved beneficial Effect is also identical.

In addition, the embodiment of the invention also provides a kind of storage mediums, wherein being stored with the executable instruction of processor, institute The executable instruction of processor is stated when executed by the processor for executing the protein phosphorylation site recognition methods.Together Sample, for the content in above method embodiment suitable for this storage medium embodiment, this storage medium embodiment institute is specific The function of realization is identical as above method embodiment, and the beneficial effect reached and above method embodiment are achieved beneficial Effect is also identical.

As shown in figure 4, the embodiment of the invention also provides a kind of protein phosphorylation site identifying system, including amino acid Sequence acquisition equipment and the computer equipment being connect with amino acid sequence acquisition equipment；Wherein,

The computer equipment includes:

At least one processor；

At least one processor, for storing at least one program；

Specifically, the computer equipment can be different types of electronic equipment, including but not limited to there is desktop The terminals such as brain, laptop computer.

It is to be illustrated to preferable implementation of the invention, but the invention is not limited to the implementation above Example, those skilled in the art can also make various equivalent variations on the premise of without prejudice to spirit of the invention or replace It changes, these equivalent deformations or replacement are all included in the scope defined by the claims of the present application.

Claims

1. a kind of protein phosphorylation site recognition methods, which comprises the following steps:

Logical operation is carried out to the binary coding of amino acid in the amino acid sequence segments, obtains the amino acid sequence piece The logical binary feature vector of section；

According to preset kernel function, core principle component analysis is carried out to the logical binary feature vector, core principle component is obtained and patrols Collect vector of binary features；

The core principle component logical binary feature vector is input in Random Forest model and is handled, the albumen is obtained The recognition result of matter phosphorylation site.

2. protein phosphorylation site recognition methods according to claim 1, which is characterized in that described to the amino acid The binary coding of amino acid carries out logical operation in sequence fragment, and the logical binary for obtaining the amino acid sequence segments is special Levy vector, comprising the following steps:

Logical AND operation is carried out two-by-two to the binary coding of amino acid in the amino acid sequence segments, obtain fisrt feature to Quantity set；

Logic or operation are carried out two-by-two to the binary coding of amino acid in the amino acid sequence segments, obtain second feature to Quantity set；

Logic xor operation is carried out to the binary coding of amino acid in the amino acid sequence segments two-by-two, obtains third feature Vector set；

3. protein phosphorylation site recognition methods according to claim 1, which is characterized in that described according to preset core Function carries out core principle component analysis to the logical binary feature vector, obtains core principle component logical binary feature vector, The following steps are included:

According to preset kernel function by logical binary maps feature vectors to higher dimensional space, high bit space nuclear matrix is obtained；

It chooses in the characteristic value the corresponding feature vector of preceding k larger characteristic values and carries out head and the tail connection, obtain the amino acid The core principle component logical binary feature vector of sequence fragment.

4. protein phosphorylation site recognition methods according to claim 3, which is characterized in that the kernel function is Gauss Kernel function.

5. protein phosphorylation site recognition methods according to claim 1, which is characterized in that the Random Forest model Construction method the following steps are included:

According to the corresponding protein amino acid sequence of protein phosphorylation site in data bank, the protein phosphorylation position is obtained The corresponding core principle component logical binary feature vector of point is as input data positive sample, and by protein phosphorylation site information As output data positive sample；

The corresponding protein amino acid sequence of nonprotein phosphorylation site is obtained according to data bank, obtains the nonprotein phosphorus The corresponding core principle component logical binary feature vector of polyadenylation sites is as input data negative sample, and by nonprotein phosphorylation Site information is as output data negative sample；

Selected part input data positive sample, input data negative sample, output data positive sample and output data negative sample, it is right The Random Forest model is trained；

Remaining input data positive sample, input data negative sample and corresponding output are chosen as a result, to the random forest mould Type is tested.

6. protein phosphorylation site recognition methods according to claim 5, which is characterized in that the nonprotein phosphoric acid Change site, be prepared by the following:

It is protein phosphorylation site marked in data bank when determining the lysine residue not, then is labeled as nonprotein Phosphorylation site.

7. a kind of protein phosphorylation site identifying system characterized by comprising

Primary vector obtains module, carries out logic behaviour for the binary coding to amino acid in the amino acid sequence segments Make, obtains the logical binary feature vector of the amino acid sequence segments；

Secondary vector obtains module, for according to preset kernel function, to the logical binary feature vector carry out core it is main at Analysis obtains core principle component logical binary feature vector；

Identification module, for the core principle component logical binary feature vector to be input in Random Forest model Reason, obtains the recognition result of the protein phosphorylation site.

8. a kind of protein phosphorylation site identification device characterized by comprising

At least one processor；

At least one processor, for storing at least one program；

When at least one described program is executed by least one described processor, so that at least one described processor is realized as weighed Benefit requires the described in any item protein phosphorylation site recognition methods of 1-6.

9. a kind of storage medium, wherein being stored with the executable instruction of processor, which is characterized in that the processor can be performed Instruction is when executed by the processor for executing protein phosphorylation site identification side as claimed in any one of claims 1 to 6 Method.

10. a kind of protein phosphorylation site identifying system, which is characterized in that including amino acid sequence acquire equipment and with institute State the computer equipment of amino acid sequence acquisition equipment connection；Wherein,

The amino acid sequence acquires equipment, for acquiring the corresponding amino acid sequence piece of protein phosphorylation site to be identified Section；

The computer equipment includes:

At least one processor；

At least one processor, for storing at least one program；