CN110265085A - A kind of protein-protein interaction sites recognition methods - Google Patents

A kind of protein-protein interaction sites recognition methods Download PDF

Info

Publication number
CN110265085A
CN110265085A CN201910686641.XA CN201910686641A CN110265085A CN 110265085 A CN110265085 A CN 110265085A CN 201910686641 A CN201910686641 A CN 201910686641A CN 110265085 A CN110265085 A CN 110265085A
Authority
CN
China
Prior art keywords
protein
residue
interaction sites
data
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910686641.XA
Other languages
Chinese (zh)
Inventor
王兵
张欢
汪文艳
周郁明
王彦
程竹明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Technology AHUT
Original Assignee
Anhui University of Technology AHUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Technology AHUT filed Critical Anhui University of Technology AHUT
Priority to CN201910686641.XA priority Critical patent/CN110265085A/en
Publication of CN110265085A publication Critical patent/CN110265085A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a kind of protein-protein interaction sites recognition methods, belong to bioinformatic analysis field.The method comprise the steps that first acquiring protein chain data and pre-processing to protein chain data, then pretreated protein chain data are divided into interface residue and non-interface residue;Feature is then extracted from database, and it is merged the feature of extraction to obtain data set, the disequilibrium of data set is handled again, then by treated, data set is divided into training set and test set, training set training XGBoost model is recycled, finally obtains protein-protein interaction sites using XGBoost model.Present invention aims to overcome that in the prior art, hold different degrees of " false positive ", " false negative " feature when predicting protein-protein interaction sites, so that the deficiency that interpretation of result is relatively difficult, the present invention can overcome the above deficiency, and the accuracy of identification of protein-protein interaction sites can be improved.

Description

A kind of protein-protein interaction sites recognition methods
Technical field
The present invention relates to bioinformatic analysis technical fields, more specifically to a kind of protein interaction position Point recognition methods.
Background technique
In all cells, protein is most important component part, and in most cells function protein it Between interaction be most basic activity.Interaction between protein and protein constitutes cellular biochemical reaction network An important component, protein-protein interaction network is the main implementation of biological information regulation, be determine it is thin The key factor of born of the same parents' destiny.Study protein-protein interaction be the basis for understanding vital movement, be rear era gene most One of important field of research.With the Human Genome Project implementation so that the data in protein sequence database significantly Rise, the structure and interaction between protein are also impossible to be measured one by one by the method tested, and pass through experiment screening As a result usual imbalanced training sets, the imbalanced training sets problem how being effectively treated in protein data just cause researcher Challenge.In modern molecular biology, the research of protein interaction is played a very important role.Therefore, albumen is disclosed Interaction relationship between matter, the network for establishing interaction relationship, it has also become the hot spot in proteomics research.
Through retrieving, invention and created name are as follows: the method and its application to interact between kind quantitative analysis of protein matter are (open Number: CN101982778A;Publication date: on March 2nd, 2011), the program be based on inventor discovery Fluc and Rluc between not There are the characteristics of interaction, construct biotinylation Fluc expression vector and Rluc expression vector, will quantitative analysis egg White X and Y is inserted into above-mentioned carrier respectively, obtains biotinylation Fluc-X fusion protein expression vector and Rluc-Y expressing fusion protein Carrier, two kinds of fusion proteins coexpression, or express respectively carry out again it is external it is common be incubated for after, pass through the coated magnetic bead of streptavidin Purifying biological element Fluc-X, the DLR of the Fluc-X fusion protein and Rluc-Y fusion protein that are then obtained with DLR measurement purifying Activity.Due to the power to interact between the activity of the Fluc and Rluc of purifying and the amount and albumin X, Y of addition fusion protein It is closely related, therefore this method can quantitatively determine the power to interact between protein Y and X.
In addition, there are also invention and created names are as follows: a kind of Protein interaction detection method of low false positive rate is (open Number: CN103290091A;Publication date: on September 11st, 2013), the program is related to a kind of detection system, belongs to molecular biotechnology Field.The program utilizes the bimolecular fluorescence complementary technology based on fluorescin, will usually require two based on single expression plasmid The BiFC detection architecture of carrier is building up in a double expression plasmid carrier system, can be substantially reduced BiFC detection method in egg False positive rate in white matter-protein interaction research, and can realize quantitative analysis.The program can also be used in based on gene text The agnoprotein matter in library-protein interaction screening study and the protein-protein interaction in living animal Detection.But the shortcoming of this application is, can not distinguish to different pest and disease damages.
Above method belongs to experimental method, can predict protein-protein interaction sites, is that there is also same The shortcomings that sample, such as takes time and effort, and holds different degrees of " false positive ", " false negative " feature, so that interpretation of result is relatively difficult Etc..The problem of how protein-protein interaction sites being predicted, being prior art urgent need to resolve.
Summary of the invention
1. to solve the problems, such as
It is an object of the invention to overcome in the prior art, difference is held when predicting protein-protein interaction sites " false positive ", " false negative " feature of degree, so that the deficiency that interpretation of result is relatively difficult, provides a kind of protein phase interaction With site recognition methods, can to avoid holding different degrees of " false positive ", " false negative " feature, and greatly reduce space and Time overhead, while also improving the accuracy of identification of protein-protein interaction sites.
2. technical solution
To solve the above-mentioned problems, the technical solution adopted in the present invention is as follows:
A kind of protein-protein interaction sites recognition methods of the invention first acquires protein chain data and to protein chain Data are pre-processed, then pretreated protein chain data are divided into interface residue and non-interface residue;Then from data The feature of protein chain is extracted in library, and the feature of extraction is merged to obtain data set, then the disequilibrium to data set It is handled, then by treated, data set is divided into training set and test set, training set training XGBoost model is recycled, Finally protein-protein interaction sites are obtained using XGBoost model.
Further, the specific steps are as follows:
1) protein chain data are first acquired and protein chain data are pre-processed;
It 2) will be pretreated by the distance between the opposite accessible surface product of amino acid and two residue a carbon atoms Protein chain data are divided into interface residue and non-interface residue;
3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid;
4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set;
5) it carries out disequilibrium to data set to handle, then by treated, data set is divided into training set and test Collection recycles training set training XGBoost model, finally obtains protein-protein interaction sites using XGBoost model.
Further, pretreated process is carried out to protein chain data are as follows: abandon the egg for being less than 50 residues first White matter chain data, then cast out the protein chain data that sequence similarity is more than or equal to 30%, then give up some out-of-date albumen Matter chain data finally obtain the protein chain data of nonredundancy.
Further, pretreated protein chain data are divided into the detailed process of interface residue and non-interface residue If are as follows: the 0.16 of the opposite accessible surface product of an amino acid residue at least its maximum accessible surface product, by the residue It is divided into surface residue;In surface residue, if the nm of spacing d < 1.2 between any two residues alpha carbon atom, the residue It is divided into interface residue;If the nm of d >=1.2, residue are divided into non-interface residue.
Further, successively by the feature of the feature of each residue to be measured and most similar ten surface residues in its space It is combined, the dimension of each feature is expanded.
Further, disequilibrium treatment process is carried out to data set are as follows:
Target sample is first chosen, then checks the classification of nearest three samples in data set by K- neighbour's rule, works as presence Two or more sample class and when selected target sample difference, illustrate that the sample is noise data, then to the sample into Row is deleted;Circulate operation is not until have noise data.
Further, journey is treated to the disequilibrium of data set are as follows: first by the number of interface residue and non-boundary The number of face residue is compared, then chooses the residue classification being larger in number in the two, and following equation is recycled to delete selection IH value is greater than 0.7 data point in residue classification:
IH(<xi,yi>)=1-p (yi|xi,h)
Wherein p (yi|xi, h) and indicate that mapping function will mark yiAs input feature value xiSymbol probability
Further, the formula of training set training XGBoost model is utilized are as follows:
Wherein F is all classification tree and regression tree space,Corresponding xiPrediction result, fk(xi) indicate sample xiInput The prediction score of the leaf node obtained after to kth tree.
The formula of XGBoost model are as follows:
It is the error function of model,;It is regularization term, indicates the complexity of K tree.
Further, obtained protein-protein interaction sites are tested by test set.
Further, the feature of extraction is residue spatial sequence spectrum, residue sequence comentropy, relative entropy, residue sequence Conservative weight and residue evolutionary rate.
3. beneficial effect
Compared with the prior art, the invention has the benefit that
A kind of protein-protein interaction sites recognition methods of the invention, by being pre-processed to sample data, thus Problem that can be irregular to avoid the protein chain quality of data, i.e., more convenient for the progress of follow-up work;And it is based on amino acid Evolutionary conservatism extract feature from database and merged, so as to preferably characterize the phase interaction in relation to protein With;Processing is further carried out by the disequilibrium to data set and is equalized data set, to improve protein phase interaction With the accuracy of identification in site;Protein-protein interaction sites are obtained by XGBoost model again, it can be different degrees of to avoid holding " false positive ", " false negative " feature, and greatly reduce room and time expense, further improve mutual to protein The prediction effect of action site.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the protein-protein interaction sites recognition methods of embodiment 1;
The forecast assessment that 1 specimen sample of Fig. 2 embodiment and XGBoost are combined shows compared with;
Fig. 3 is the prediction result comparison schematic diagram of method and other methods of the invention to protein-protein interaction sites.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments;Moreover, be not between each embodiment it is relatively independent, according to It needs can be combined with each other, to reach more preferably effect.Therefore, below to the embodiment of the present invention provided in the accompanying drawings Detailed description is not intended to limit the range of claimed invention, but is merely representative of selected embodiment of the invention.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.
To further appreciate that the contents of the present invention, the present invention is described in detail in conjunction with the accompanying drawings and embodiments.
Embodiment 1
A kind of protein-protein interaction sites recognition methods of the invention first acquires protein chain data and to protein chain Data are pre-processed, by pre-processing to protein chain data, so as to avoid the protein chain quality of data irregular Uneven problem, i.e., more convenient for the progress of follow-up work.Pretreated protein chain data are then divided into interface residue With non-interface residue, then the feature of protein chain is extracted from database, and is merged the feature of extraction to obtain data set; It is worth noting that by extracting feature and being merged, to preferably represent the interaction of protein.Later to data The disequilibrium of collection is handled, and then to treated, data set carries out classification prediction, obtains protein-protein interaction sites.
As shown in connection with fig. 1, a kind of protein-protein interaction sites recognition methods of the invention, the specific steps are as follows:
1) first acquire and protein chain data and protein chain data pre-processed, wherein to protein chain data into The pretreated detailed process of row are as follows: abandon the protein chain data for being less than 50 residues first;Cast out sequence similarity again to be greater than Protein chain data equal to 30%;Give up some out-of-date protein chain data, finally obtains the protein chain number of nonredundancy According to.170 protein chain data are acquired in the present embodiment, and the protein chain data of 91 nonredundancies are obtained after pretreatment.
It 2) will be pretreated by the distance between the opposite accessible surface product of amino acid and two residue a carbon atoms Protein chain data are divided into interface residue and non-interface residue;Specifically, if the opposite accessible surface of an amino acid residue The 0.16 of product at least its maximum accessible surface product, is divided into surface residue for the residue;In surface residue, if any two The nm of spacing d < 1.2 between residue alpha carbon atom, the residue are divided into interface residue;If the nm of d >=1.2, residue are divided into Non- interface residue.There are 10430 surface residues in the present embodiment, wherein interface residue accounts for 22.04%, i.e. 2299 interfaces are residual Base;Another part is non-interface residue, there is 8131.
3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid;The present embodiment is from HSSP number According to extracting feature in library and The ConSurf Sever database, wherein the feature of extraction is residue spatial sequence spectrum, residue Sequence information entropy, relative entropy, residue sequence guard weight and residue evolutionary rate.Five features describe specific as follows:
Residue spatial sequence spectrum: being got, commonly used a kind of feature in research by Multiple Sequence Alignment, is indicated in albumen The frequency of various amino acid is presented in the specified resi-dues of certain in matter primary structure.
Residue sequence comentropy: according to Shanoon information theory, the deformable conservative scoring of sequence is measured;
Relative entropy: being the normalization of residue sequence comentropy:
Residue sequence guards weight: is calculated the conservative of protein sequence position.
Residue evolutionary rate: for describing disabled evolution information, with Rate4Site algorithm to resi-dues each in sequence Conservative scoring carry out operation, calculate the conservative of each amino acid position by calculating the maximal possibility estimation of evolutionary rate Property.
In above formula, Entropy indicates that sequence information entropy, d refer to all amino acid classes sums, fiRefer to i-th kind of amino The frequency that acid occurs in sequence position.
4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set;Wherein The detailed process that the feature of each residue to be measured is expanded are as follows: successively by the feature of each residue to be measured and its space most phase The feature of ten close surface residues is combined, and is expanded the dimension of each feature;It is finally obtained in the present embodiment every 264 dimensional feature vectors of a residue;It should be noted that residue to be measured refers to interface residue and non-interface residue.
5) it is handled by disequilibrium of the specimen sample technology to data set, it is worth noting that, for data set Disequilibrium processing specimen sample technology process are as follows:
A. nearest neighbor algorithm (RENN) is repeated
The algorithm is repeated as many times on the basis of nearest neighbor algorithm (Edited Nearest Neighbors), arest neighbors The step of algorithm are as follows: first choose target sample, then check the classification of nearest three samples in data set by K- neighbour's rule, have Body process are as follows: calculate data set in all samples at a distance from target sample, then check in entire data set with target sample This three nearest sample;When there are two or more sample class and selected target sample difference, illustrate the sample For noise data, then the sample is deleted.Repeating Nearest Neighbor Method is that the above-mentioned algorithm of repetition is multiple, until not having sample energy Until enough removings.
Or disequilibrium processing is carried out to data set using following methods:
B. example hardness threshold value (IHT)
This method indicates data point in training set using the concept of IH property by the probability of mistake classification, in two or Edge between more than two classes or the IH value with higher of the data sample with noise characteristic, this is because learning algorithm meeting Force their over-fittings, IH is that P (h | t) is got from Bayes' theorem, and wherein h indicates for input feature vector to be mapped to its correlation The mapping function of label, t indicate training data: first carrying out the number of interface residue and the number of non-interface residue in the present invention Compare, then choose the residue classification being larger in number in the two, recycles following equation to delete IH value in the residue classification chosen big In 0.7 data point:
IH(<xi,yi>)=1-p (yi|xi,h)
Wherein p (yi|xi, h) and indicate that mapping function will mark yiAs input feature value xiSymbol probability, p (yi| xi, h) and bigger, illustrate that correct label gives xiA possibility that it is bigger.
It is worth noting that the thinking of this method of IHT is got from Bayes' theorem: from Bayes' theorem:
Wherein h: input feature value is mapped to the function of corresponding label vector, and t is training set, uses Bayes' theorem The concept of example hardness can be obtained by the decomposition of P (h | t):
It is higher by removing IH value in most classes using a kind of lack sampling method using IHT based on this concept Data point is until data set reaches balance.
It is worth noting that most class samples are partial in the unbalanced prediction that will lead to model of data, so that interface The prediction of residue is excessively poor, by carrying out disequilibrium processing to data set, so as to improve the prediction effect to interface residue Fruit.Wherein, most class samples refer to the sample for the residue classification being larger in number in both interface residue and non-interface residue;Example Number such as interface residue is bigger than the number of non-interface residue, then the sample of interface residue is most class samples.Further, lead to Cross ten times of cross validations will treated data set is divided into training set and test set, data set is divided into ten sons in the present embodiment Collection, wherein nine subsets are training set, remaining a subset is test set.Further, protein is obtained using training set Interaction sites;Specifically, first with training set training XGBoost disaggregated model, detailed process are as follows: for given training CollectionThe k classification trained or regression tree set F={ f1(x),f2(x),...,fkIt (x) }, can be each Output sample is assigned to different leaf nodes according to the cut-point of attribute value, and each leaf node corresponds to one in real time Score fk, as the sample x that given needs are predictediWhen, the prediction result for the sample is exactly the sum of the prediction result of each tree, Concrete model is as follows:
Wherein F is all classification tree and regression tree space,Corresponding xiPrediction result, fk(xi) indicate sample xiIt is defeated The prediction score of the leaf node obtained after entering to kth tree.
The objective function of model may be defined as:
The optimization aim of model mainly includes two parts, first partRefer to the error letter of model Number;Second partIt is the regularization term of model, indicates the complexity of K tree.
As shown in connection with fig. 2, protein-protein interaction sites can be obtained by above-mentioned XGBoost disaggregated model;It is worth saying Bright, XGBoost model classifies to interface residue and non-interface residue, and the accuracy rate of classification represents to obtain protein The predictive ability of interaction sites.Further, obtained protein-protein interaction sites are tested using test set Evaluation.Wherein, the index of test are as follows:
Accuracy:
Sensitivity:
Accuracy:
Specificity:
F value:
MCC value:
Wherein TP is true positives number, correctly to predict the positive sample number come;TN is true negative number, indicates correct Predict the negative sample number come;FP is false positive number, i.e., was that negative sample is predicted to be positive sample originally in prediction result Number;FN is false negative number, i.e., was originally positive sample and the mispredicted number for negative sample.F value be Precision and The weighted harmonic mean of Recall combines Precision and Recall's as a result, illustrating test method when F value is higher Compare effectively;MCC is the measurement standard well for measuring imbalance problem, and essence is the phase relation between true value and predicted value Number, between -1 and 1, -1 indicates that prediction result is worst, and 1 indicates that prediction result is best.
See Table 1 for details to the result that protein-protein interaction sites identify for the present embodiment, can be seen that this reality by data in table 1 The accuracy rate for applying example has been up to 80.7%.
The classification performance of XGBoost model of the table 1 based on two kinds of specimen sample technologies is assessed
As shown in connection with fig. 3, XGB represents method of the invention, and Li and Luo represents SVM support vector machine method, this hair A kind of bright protein-protein interaction sites recognition methods is compared with other methods, and the present invention knows protein-protein interaction sites Other effect is more preferable.Further, a kind of protein-protein interaction sites recognition methods of the invention, by being carried out to sample data Pretreatment, so as to the problem for avoiding the protein chain quality of data irregular, i.e., more convenient for the progress of follow-up work;Needle To the Characteristic Problem of sample data, the present invention is based on the evolutionary conservatisms of amino acid to extract feature and be melted from database It closes, so as to preferably characterize the interaction in relation to protein;Further by the disequilibrium to data set at Reason is equalized data set, to improve the accuracy of identification of protein-protein interaction sites;It is obtained again by XGBoost model Protein-protein interaction sites, can be to avoid holding different degrees of " false positive ", " false negative " feature, and greatly reduces Room and time expense further improves the prediction effect to protein-protein interaction sites.
Above in conjunction with it is specific exemplary when embodiment the present invention is described in detail.It is understood, however, that can be not It is carry out various modifications in the case where being detached from the scope of the present invention that is defined by the following claims and modification.It is detailed description and it is attached Figure should be to be considered only as it is illustrative and not restrictive, if there is any such modifications and variations, then they are all It will fall into the scope of the present invention described herein.In addition, Development Status and meaning that background technique is intended in order to illustrate this technology Justice, it is no intended to the limitation present invention or the application and application field of the invention.

Claims (10)

1. a kind of protein-protein interaction sites recognition methods, which is characterized in that first acquire protein chain data and to protein Chain data are pre-processed, then pretreated protein chain data are divided into interface residue and non-interface residue;Then from number According to the feature of extraction protein chain in library, and the feature of extraction is merged to obtain data set, then the imbalance to data set Property handled, then will treated data set is divided into training set and test set, recycle training set training XGBoost mould Type finally obtains protein-protein interaction sites using XGBoost model.
2. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that specific steps are such as Under:
1) protein chain data are first acquired and protein chain data are pre-processed;
2) by the way that the opposite accessible surface of amino acid is long-pending and the distance between two residue a carbon atoms are by pretreated albumen Matter chain data are divided into interface residue and non-interface residue;
3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid;
4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set;
5) disequilibrium is carried out to data set to handle, then by treated, data set is divided into training set and test set, then Using training set training XGBoost model, protein-protein interaction sites finally are obtained using XGBoost model.
3. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that protein chain Data carry out pretreated process are as follows: abandon the protein chain data for being less than 50 residues first, then to cast out sequence similarity big In the protein chain data for being equal to 30%, then gives up some out-of-date protein chain data, finally obtain the albumen of nonredundancy Matter chain data.
4. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that after pretreatment Protein chain data be divided into the detailed process of interface residue and non-interface residue are as follows: if the opposite of amino acid residue can connect Touching surface area is at least the 0.16 of its maximum accessible surface product, which is divided into surface residue;In surface residue, if Spacing d < 1.2nm between any two residues alpha carbon atom, the residue are divided into interface residue;If d >=1.2nm, residue It is divided into non-interface residue.
5. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that successively will be each The feature of residue to be measured and the feature of most similar ten surface residues in its space are combined, and are carried out to the dimension of each feature Expand.
6. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that data set into Row disequilibrium treatment process are as follows:
Target sample is first chosen, then checks the classification of nearest three samples in data set by K- neighbour's rule, when there are two Or when more than two sample class and selected target sample difference, illustrates that the sample is noise data, then the sample is deleted It removes;Circulate operation is not until have noise data.
7. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that for data set Disequilibrium treatment process are as follows: first the number of the number of interface residue and non-interface residue is compared, then both is chosen In the residue classification that is larger in number, recycle following equation to delete the data point that IH value in the residue classification chosen is greater than 0.7:
IH(<xi,yi>)=1-p (yi|xi,h)
Wherein p (yi|xi, h) and indicate that mapping function will mark yiAs input feature value xiSymbol probability.
8. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that utilize training set The formula of training XGBoost model are as follows:
Wherein F is all classification tree and regression tree space,Corresponding xiPrediction result, fk(xi) indicate sample xiIt is input to The prediction score of the leaf node obtained after k tree;
The formula of XGBoost model are as follows:
It is the error function of model,;It is regularization term, indicates the complexity of K tree.
9. described in any item a kind of protein-protein interaction sites recognition methods according to claim 1~8, which is characterized in that Obtained protein-protein interaction sites are tested by test set.
10. described in any item a kind of protein-protein interaction sites recognition methods according to claim 1~8, which is characterized in that The feature of extraction is residue spatial sequence spectrum, residue sequence comentropy, relative entropy, residue sequence guard weight and residue evolution is fast Rate.
CN201910686641.XA 2019-07-29 2019-07-29 A kind of protein-protein interaction sites recognition methods Pending CN110265085A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910686641.XA CN110265085A (en) 2019-07-29 2019-07-29 A kind of protein-protein interaction sites recognition methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910686641.XA CN110265085A (en) 2019-07-29 2019-07-29 A kind of protein-protein interaction sites recognition methods

Publications (1)

Publication Number Publication Date
CN110265085A true CN110265085A (en) 2019-09-20

Family

ID=67912145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910686641.XA Pending CN110265085A (en) 2019-07-29 2019-07-29 A kind of protein-protein interaction sites recognition methods

Country Status (1)

Country Link
CN (1) CN110265085A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111834010A (en) * 2020-05-25 2020-10-27 重庆工贸职业技术学院 COVID-19 detection false negative identification method based on attribute reduction and XGboost
CN113611360A (en) * 2021-08-11 2021-11-05 邵阳学院 Protein-protein interaction site prediction method based on deep learning and XGboost
CN115497555A (en) * 2022-08-16 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010165230A (en) * 2009-01-16 2010-07-29 Pharma Design Inc Method and system for predicting protein-protein interaction as drug target
CN109872781A (en) * 2019-02-26 2019-06-11 哈尔滨工业大学 Drug target recognition methods based on Xgboost

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010165230A (en) * 2009-01-16 2010-07-29 Pharma Design Inc Method and system for predicting protein-protein interaction as drug target
CN109872781A (en) * 2019-02-26 2019-06-11 哈尔滨工业大学 Drug target recognition methods based on Xgboost

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHANGQING MEI 等: "Unbalance Data Processing Strategy for Protein Interaction Sites Prediction", 《2018 9TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY IN MEDICINE AND EDUCATION》 *
NAUFAL AZMI VERDIKHA 等: "Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification", 《IJITEE》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111834010A (en) * 2020-05-25 2020-10-27 重庆工贸职业技术学院 COVID-19 detection false negative identification method based on attribute reduction and XGboost
CN111834010B (en) * 2020-05-25 2023-12-01 重庆工贸职业技术学院 Virus detection false negative identification method based on attribute reduction and XGBoost
CN113611360A (en) * 2021-08-11 2021-11-05 邵阳学院 Protein-protein interaction site prediction method based on deep learning and XGboost
CN115497555A (en) * 2022-08-16 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium
CN115497555B (en) * 2022-08-16 2024-01-05 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model
CN115512763B (en) * 2022-09-06 2023-10-24 北京百度网讯科技有限公司 Polypeptide sequence generation method, and training method and device of polypeptide generation model

Similar Documents

Publication Publication Date Title
CN110265085A (en) A kind of protein-protein interaction sites recognition methods
Bashashati et al. A survey of flow cytometry data analysis methods
Nguyen et al. Learning graph representation via frequent subgraphs
CN112767997A (en) Protein secondary structure prediction method based on multi-scale convolution attention neural network
KR102213670B1 (en) Method for prediction of drug-target interactions
CN112417132B (en) New meaning identification method for screening negative samples by using guest information
CN115472221A (en) Protein fitness prediction method based on deep learning
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
CN116206688A (en) Multi-mode information fusion model and method for DTA prediction
KR20220083649A (en) Chemical binding similarity searching method using evolutionary information of protein
CN115206423A (en) Label guidance-based protein action relation prediction method
Dotan et al. Effect of tokenization on transformers for biological sequences
CN113823356A (en) Methylation site identification method and device
Golenko et al. IMPLEMENTATION OF MACHINE LEARNING MODELS TO DETERMINE THE APPROPRIATE MODEL FOR PROTEIN FUNCTION PREDICTION.
CN116401369B (en) Entity identification and classification method for biological product production terms
Murphy et al. Self-supervised learning of cell type specificity from immunohistochemical images
CN112270950A (en) Fusion network drug target relation prediction method based on network enhancement and graph regularization
CN116343915A (en) Construction method of biological sequence integrated classifier and biological sequence prediction classification method
CN112735532B (en) Metabolite identification system based on molecular fingerprint prediction and application method thereof
Aggarwal et al. A review on protein subcellular localization prediction using microscopic images
CN112151109A (en) Semi-supervised learning method for evaluating randomness of biomolecular cross-linking mass spectrometry identification
Altinier et al. An expert system for the classification of serum protein electrophoresis patterns
Dong et al. A region selection model to identify unknown unknowns in image datasets
Gholap et al. Content-based tissue image mining
Subhashini et al. PREDICTING SUBCELLULAR LOCALIZATION OF PROTEINS WITH MULTIPLE SITES USING THRESHOLD ML-KNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination