CN110265085A - A kind of protein-protein interaction sites recognition methods - Google Patents
A kind of protein-protein interaction sites recognition methods Download PDFInfo
- Publication number
- CN110265085A CN110265085A CN201910686641.XA CN201910686641A CN110265085A CN 110265085 A CN110265085 A CN 110265085A CN 201910686641 A CN201910686641 A CN 201910686641A CN 110265085 A CN110265085 A CN 110265085A
- Authority
- CN
- China
- Prior art keywords
- protein
- residue
- interaction sites
- data
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a kind of protein-protein interaction sites recognition methods, belong to bioinformatic analysis field.The method comprise the steps that first acquiring protein chain data and pre-processing to protein chain data, then pretreated protein chain data are divided into interface residue and non-interface residue;Feature is then extracted from database, and it is merged the feature of extraction to obtain data set, the disequilibrium of data set is handled again, then by treated, data set is divided into training set and test set, training set training XGBoost model is recycled, finally obtains protein-protein interaction sites using XGBoost model.Present invention aims to overcome that in the prior art, hold different degrees of " false positive ", " false negative " feature when predicting protein-protein interaction sites, so that the deficiency that interpretation of result is relatively difficult, the present invention can overcome the above deficiency, and the accuracy of identification of protein-protein interaction sites can be improved.
Description
Technical field
The present invention relates to bioinformatic analysis technical fields, more specifically to a kind of protein interaction position
Point recognition methods.
Background technique
In all cells, protein is most important component part, and in most cells function protein it
Between interaction be most basic activity.Interaction between protein and protein constitutes cellular biochemical reaction network
An important component, protein-protein interaction network is the main implementation of biological information regulation, be determine it is thin
The key factor of born of the same parents' destiny.Study protein-protein interaction be the basis for understanding vital movement, be rear era gene most
One of important field of research.With the Human Genome Project implementation so that the data in protein sequence database significantly
Rise, the structure and interaction between protein are also impossible to be measured one by one by the method tested, and pass through experiment screening
As a result usual imbalanced training sets, the imbalanced training sets problem how being effectively treated in protein data just cause researcher
Challenge.In modern molecular biology, the research of protein interaction is played a very important role.Therefore, albumen is disclosed
Interaction relationship between matter, the network for establishing interaction relationship, it has also become the hot spot in proteomics research.
Through retrieving, invention and created name are as follows: the method and its application to interact between kind quantitative analysis of protein matter are (open
Number: CN101982778A;Publication date: on March 2nd, 2011), the program be based on inventor discovery Fluc and Rluc between not
There are the characteristics of interaction, construct biotinylation Fluc expression vector and Rluc expression vector, will quantitative analysis egg
White X and Y is inserted into above-mentioned carrier respectively, obtains biotinylation Fluc-X fusion protein expression vector and Rluc-Y expressing fusion protein
Carrier, two kinds of fusion proteins coexpression, or express respectively carry out again it is external it is common be incubated for after, pass through the coated magnetic bead of streptavidin
Purifying biological element Fluc-X, the DLR of the Fluc-X fusion protein and Rluc-Y fusion protein that are then obtained with DLR measurement purifying
Activity.Due to the power to interact between the activity of the Fluc and Rluc of purifying and the amount and albumin X, Y of addition fusion protein
It is closely related, therefore this method can quantitatively determine the power to interact between protein Y and X.
In addition, there are also invention and created names are as follows: a kind of Protein interaction detection method of low false positive rate is (open
Number: CN103290091A;Publication date: on September 11st, 2013), the program is related to a kind of detection system, belongs to molecular biotechnology
Field.The program utilizes the bimolecular fluorescence complementary technology based on fluorescin, will usually require two based on single expression plasmid
The BiFC detection architecture of carrier is building up in a double expression plasmid carrier system, can be substantially reduced BiFC detection method in egg
False positive rate in white matter-protein interaction research, and can realize quantitative analysis.The program can also be used in based on gene text
The agnoprotein matter in library-protein interaction screening study and the protein-protein interaction in living animal
Detection.But the shortcoming of this application is, can not distinguish to different pest and disease damages.
Above method belongs to experimental method, can predict protein-protein interaction sites, is that there is also same
The shortcomings that sample, such as takes time and effort, and holds different degrees of " false positive ", " false negative " feature, so that interpretation of result is relatively difficult
Etc..The problem of how protein-protein interaction sites being predicted, being prior art urgent need to resolve.
Summary of the invention
1. to solve the problems, such as
It is an object of the invention to overcome in the prior art, difference is held when predicting protein-protein interaction sites
" false positive ", " false negative " feature of degree, so that the deficiency that interpretation of result is relatively difficult, provides a kind of protein phase interaction
With site recognition methods, can to avoid holding different degrees of " false positive ", " false negative " feature, and greatly reduce space and
Time overhead, while also improving the accuracy of identification of protein-protein interaction sites.
2. technical solution
To solve the above-mentioned problems, the technical solution adopted in the present invention is as follows:
A kind of protein-protein interaction sites recognition methods of the invention first acquires protein chain data and to protein chain
Data are pre-processed, then pretreated protein chain data are divided into interface residue and non-interface residue;Then from data
The feature of protein chain is extracted in library, and the feature of extraction is merged to obtain data set, then the disequilibrium to data set
It is handled, then by treated, data set is divided into training set and test set, training set training XGBoost model is recycled,
Finally protein-protein interaction sites are obtained using XGBoost model.
Further, the specific steps are as follows:
1) protein chain data are first acquired and protein chain data are pre-processed;
It 2) will be pretreated by the distance between the opposite accessible surface product of amino acid and two residue a carbon atoms
Protein chain data are divided into interface residue and non-interface residue;
3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid;
4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set;
5) it carries out disequilibrium to data set to handle, then by treated, data set is divided into training set and test
Collection recycles training set training XGBoost model, finally obtains protein-protein interaction sites using XGBoost model.
Further, pretreated process is carried out to protein chain data are as follows: abandon the egg for being less than 50 residues first
White matter chain data, then cast out the protein chain data that sequence similarity is more than or equal to 30%, then give up some out-of-date albumen
Matter chain data finally obtain the protein chain data of nonredundancy.
Further, pretreated protein chain data are divided into the detailed process of interface residue and non-interface residue
If are as follows: the 0.16 of the opposite accessible surface product of an amino acid residue at least its maximum accessible surface product, by the residue
It is divided into surface residue;In surface residue, if the nm of spacing d < 1.2 between any two residues alpha carbon atom, the residue
It is divided into interface residue;If the nm of d >=1.2, residue are divided into non-interface residue.
Further, successively by the feature of the feature of each residue to be measured and most similar ten surface residues in its space
It is combined, the dimension of each feature is expanded.
Further, disequilibrium treatment process is carried out to data set are as follows:
Target sample is first chosen, then checks the classification of nearest three samples in data set by K- neighbour's rule, works as presence
Two or more sample class and when selected target sample difference, illustrate that the sample is noise data, then to the sample into
Row is deleted;Circulate operation is not until have noise data.
Further, journey is treated to the disequilibrium of data set are as follows: first by the number of interface residue and non-boundary
The number of face residue is compared, then chooses the residue classification being larger in number in the two, and following equation is recycled to delete selection
IH value is greater than 0.7 data point in residue classification:
IH(<xi,yi>)=1-p (yi|xi,h)
Wherein p (yi|xi, h) and indicate that mapping function will mark yiAs input feature value xiSymbol probability
Further, the formula of training set training XGBoost model is utilized are as follows:
Wherein F is all classification tree and regression tree space,Corresponding xiPrediction result, fk(xi) indicate sample xiInput
The prediction score of the leaf node obtained after to kth tree.
The formula of XGBoost model are as follows:
It is the error function of model,;It is regularization term, indicates the complexity of K tree.
Further, obtained protein-protein interaction sites are tested by test set.
Further, the feature of extraction is residue spatial sequence spectrum, residue sequence comentropy, relative entropy, residue sequence
Conservative weight and residue evolutionary rate.
3. beneficial effect
Compared with the prior art, the invention has the benefit that
A kind of protein-protein interaction sites recognition methods of the invention, by being pre-processed to sample data, thus
Problem that can be irregular to avoid the protein chain quality of data, i.e., more convenient for the progress of follow-up work;And it is based on amino acid
Evolutionary conservatism extract feature from database and merged, so as to preferably characterize the phase interaction in relation to protein
With;Processing is further carried out by the disequilibrium to data set and is equalized data set, to improve protein phase interaction
With the accuracy of identification in site;Protein-protein interaction sites are obtained by XGBoost model again, it can be different degrees of to avoid holding
" false positive ", " false negative " feature, and greatly reduce room and time expense, further improve mutual to protein
The prediction effect of action site.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the protein-protein interaction sites recognition methods of embodiment 1;
The forecast assessment that 1 specimen sample of Fig. 2 embodiment and XGBoost are combined shows compared with;
Fig. 3 is the prediction result comparison schematic diagram of method and other methods of the invention to protein-protein interaction sites.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments;Moreover, be not between each embodiment it is relatively independent, according to
It needs can be combined with each other, to reach more preferably effect.Therefore, below to the embodiment of the present invention provided in the accompanying drawings
Detailed description is not intended to limit the range of claimed invention, but is merely representative of selected embodiment of the invention.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
His embodiment, shall fall within the protection scope of the present invention.
To further appreciate that the contents of the present invention, the present invention is described in detail in conjunction with the accompanying drawings and embodiments.
Embodiment 1
A kind of protein-protein interaction sites recognition methods of the invention first acquires protein chain data and to protein chain
Data are pre-processed, by pre-processing to protein chain data, so as to avoid the protein chain quality of data irregular
Uneven problem, i.e., more convenient for the progress of follow-up work.Pretreated protein chain data are then divided into interface residue
With non-interface residue, then the feature of protein chain is extracted from database, and is merged the feature of extraction to obtain data set;
It is worth noting that by extracting feature and being merged, to preferably represent the interaction of protein.Later to data
The disequilibrium of collection is handled, and then to treated, data set carries out classification prediction, obtains protein-protein interaction sites.
As shown in connection with fig. 1, a kind of protein-protein interaction sites recognition methods of the invention, the specific steps are as follows:
1) first acquire and protein chain data and protein chain data pre-processed, wherein to protein chain data into
The pretreated detailed process of row are as follows: abandon the protein chain data for being less than 50 residues first;Cast out sequence similarity again to be greater than
Protein chain data equal to 30%;Give up some out-of-date protein chain data, finally obtains the protein chain number of nonredundancy
According to.170 protein chain data are acquired in the present embodiment, and the protein chain data of 91 nonredundancies are obtained after pretreatment.
It 2) will be pretreated by the distance between the opposite accessible surface product of amino acid and two residue a carbon atoms
Protein chain data are divided into interface residue and non-interface residue;Specifically, if the opposite accessible surface of an amino acid residue
The 0.16 of product at least its maximum accessible surface product, is divided into surface residue for the residue;In surface residue, if any two
The nm of spacing d < 1.2 between residue alpha carbon atom, the residue are divided into interface residue;If the nm of d >=1.2, residue are divided into
Non- interface residue.There are 10430 surface residues in the present embodiment, wherein interface residue accounts for 22.04%, i.e. 2299 interfaces are residual
Base;Another part is non-interface residue, there is 8131.
3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid;The present embodiment is from HSSP number
According to extracting feature in library and The ConSurf Sever database, wherein the feature of extraction is residue spatial sequence spectrum, residue
Sequence information entropy, relative entropy, residue sequence guard weight and residue evolutionary rate.Five features describe specific as follows:
Residue spatial sequence spectrum: being got, commonly used a kind of feature in research by Multiple Sequence Alignment, is indicated in albumen
The frequency of various amino acid is presented in the specified resi-dues of certain in matter primary structure.
Residue sequence comentropy: according to Shanoon information theory, the deformable conservative scoring of sequence is measured;
Relative entropy: being the normalization of residue sequence comentropy:
Residue sequence guards weight: is calculated the conservative of protein sequence position.
Residue evolutionary rate: for describing disabled evolution information, with Rate4Site algorithm to resi-dues each in sequence
Conservative scoring carry out operation, calculate the conservative of each amino acid position by calculating the maximal possibility estimation of evolutionary rate
Property.
In above formula, Entropy indicates that sequence information entropy, d refer to all amino acid classes sums, fiRefer to i-th kind of amino
The frequency that acid occurs in sequence position.
4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set;Wherein
The detailed process that the feature of each residue to be measured is expanded are as follows: successively by the feature of each residue to be measured and its space most phase
The feature of ten close surface residues is combined, and is expanded the dimension of each feature;It is finally obtained in the present embodiment every
264 dimensional feature vectors of a residue;It should be noted that residue to be measured refers to interface residue and non-interface residue.
5) it is handled by disequilibrium of the specimen sample technology to data set, it is worth noting that, for data set
Disequilibrium processing specimen sample technology process are as follows:
A. nearest neighbor algorithm (RENN) is repeated
The algorithm is repeated as many times on the basis of nearest neighbor algorithm (Edited Nearest Neighbors), arest neighbors
The step of algorithm are as follows: first choose target sample, then check the classification of nearest three samples in data set by K- neighbour's rule, have
Body process are as follows: calculate data set in all samples at a distance from target sample, then check in entire data set with target sample
This three nearest sample;When there are two or more sample class and selected target sample difference, illustrate the sample
For noise data, then the sample is deleted.Repeating Nearest Neighbor Method is that the above-mentioned algorithm of repetition is multiple, until not having sample energy
Until enough removings.
Or disequilibrium processing is carried out to data set using following methods:
B. example hardness threshold value (IHT)
This method indicates data point in training set using the concept of IH property by the probability of mistake classification, in two or
Edge between more than two classes or the IH value with higher of the data sample with noise characteristic, this is because learning algorithm meeting
Force their over-fittings, IH is that P (h | t) is got from Bayes' theorem, and wherein h indicates for input feature vector to be mapped to its correlation
The mapping function of label, t indicate training data: first carrying out the number of interface residue and the number of non-interface residue in the present invention
Compare, then choose the residue classification being larger in number in the two, recycles following equation to delete IH value in the residue classification chosen big
In 0.7 data point:
IH(<xi,yi>)=1-p (yi|xi,h)
Wherein p (yi|xi, h) and indicate that mapping function will mark yiAs input feature value xiSymbol probability, p (yi|
xi, h) and bigger, illustrate that correct label gives xiA possibility that it is bigger.
It is worth noting that the thinking of this method of IHT is got from Bayes' theorem: from Bayes' theorem:
Wherein h: input feature value is mapped to the function of corresponding label vector, and t is training set, uses Bayes' theorem
The concept of example hardness can be obtained by the decomposition of P (h | t):
It is higher by removing IH value in most classes using a kind of lack sampling method using IHT based on this concept
Data point is until data set reaches balance.
It is worth noting that most class samples are partial in the unbalanced prediction that will lead to model of data, so that interface
The prediction of residue is excessively poor, by carrying out disequilibrium processing to data set, so as to improve the prediction effect to interface residue
Fruit.Wherein, most class samples refer to the sample for the residue classification being larger in number in both interface residue and non-interface residue;Example
Number such as interface residue is bigger than the number of non-interface residue, then the sample of interface residue is most class samples.Further, lead to
Cross ten times of cross validations will treated data set is divided into training set and test set, data set is divided into ten sons in the present embodiment
Collection, wherein nine subsets are training set, remaining a subset is test set.Further, protein is obtained using training set
Interaction sites;Specifically, first with training set training XGBoost disaggregated model, detailed process are as follows: for given training
CollectionThe k classification trained or regression tree set F={ f1(x),f2(x),...,fkIt (x) }, can be each
Output sample is assigned to different leaf nodes according to the cut-point of attribute value, and each leaf node corresponds to one in real time
Score fk, as the sample x that given needs are predictediWhen, the prediction result for the sample is exactly the sum of the prediction result of each tree,
Concrete model is as follows:
Wherein F is all classification tree and regression tree space,Corresponding xiPrediction result, fk(xi) indicate sample xiIt is defeated
The prediction score of the leaf node obtained after entering to kth tree.
The objective function of model may be defined as:
The optimization aim of model mainly includes two parts, first partRefer to the error letter of model
Number;Second partIt is the regularization term of model, indicates the complexity of K tree.
As shown in connection with fig. 2, protein-protein interaction sites can be obtained by above-mentioned XGBoost disaggregated model;It is worth saying
Bright, XGBoost model classifies to interface residue and non-interface residue, and the accuracy rate of classification represents to obtain protein
The predictive ability of interaction sites.Further, obtained protein-protein interaction sites are tested using test set
Evaluation.Wherein, the index of test are as follows:
Accuracy:
Sensitivity:
Accuracy:
Specificity:
F value:
MCC value:
Wherein TP is true positives number, correctly to predict the positive sample number come;TN is true negative number, indicates correct
Predict the negative sample number come;FP is false positive number, i.e., was that negative sample is predicted to be positive sample originally in prediction result
Number;FN is false negative number, i.e., was originally positive sample and the mispredicted number for negative sample.F value be Precision and
The weighted harmonic mean of Recall combines Precision and Recall's as a result, illustrating test method when F value is higher
Compare effectively;MCC is the measurement standard well for measuring imbalance problem, and essence is the phase relation between true value and predicted value
Number, between -1 and 1, -1 indicates that prediction result is worst, and 1 indicates that prediction result is best.
See Table 1 for details to the result that protein-protein interaction sites identify for the present embodiment, can be seen that this reality by data in table 1
The accuracy rate for applying example has been up to 80.7%.
The classification performance of XGBoost model of the table 1 based on two kinds of specimen sample technologies is assessed
As shown in connection with fig. 3, XGB represents method of the invention, and Li and Luo represents SVM support vector machine method, this hair
A kind of bright protein-protein interaction sites recognition methods is compared with other methods, and the present invention knows protein-protein interaction sites
Other effect is more preferable.Further, a kind of protein-protein interaction sites recognition methods of the invention, by being carried out to sample data
Pretreatment, so as to the problem for avoiding the protein chain quality of data irregular, i.e., more convenient for the progress of follow-up work;Needle
To the Characteristic Problem of sample data, the present invention is based on the evolutionary conservatisms of amino acid to extract feature and be melted from database
It closes, so as to preferably characterize the interaction in relation to protein;Further by the disequilibrium to data set at
Reason is equalized data set, to improve the accuracy of identification of protein-protein interaction sites;It is obtained again by XGBoost model
Protein-protein interaction sites, can be to avoid holding different degrees of " false positive ", " false negative " feature, and greatly reduces
Room and time expense further improves the prediction effect to protein-protein interaction sites.
Above in conjunction with it is specific exemplary when embodiment the present invention is described in detail.It is understood, however, that can be not
It is carry out various modifications in the case where being detached from the scope of the present invention that is defined by the following claims and modification.It is detailed description and it is attached
Figure should be to be considered only as it is illustrative and not restrictive, if there is any such modifications and variations, then they are all
It will fall into the scope of the present invention described herein.In addition, Development Status and meaning that background technique is intended in order to illustrate this technology
Justice, it is no intended to the limitation present invention or the application and application field of the invention.
Claims (10)
1. a kind of protein-protein interaction sites recognition methods, which is characterized in that first acquire protein chain data and to protein
Chain data are pre-processed, then pretreated protein chain data are divided into interface residue and non-interface residue;Then from number
According to the feature of extraction protein chain in library, and the feature of extraction is merged to obtain data set, then the imbalance to data set
Property handled, then will treated data set is divided into training set and test set, recycle training set training XGBoost mould
Type finally obtains protein-protein interaction sites using XGBoost model.
2. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that specific steps are such as
Under:
1) protein chain data are first acquired and protein chain data are pre-processed;
2) by the way that the opposite accessible surface of amino acid is long-pending and the distance between two residue a carbon atoms are by pretreated albumen
Matter chain data are divided into interface residue and non-interface residue;
3) feature of protein chain is extracted from database by the evolutionary conservatism of amino acid;
4) feature of extraction is merged, and the feature of each residue to be measured is expanded to obtain data set;
5) disequilibrium is carried out to data set to handle, then by treated, data set is divided into training set and test set, then
Using training set training XGBoost model, protein-protein interaction sites finally are obtained using XGBoost model.
3. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that protein chain
Data carry out pretreated process are as follows: abandon the protein chain data for being less than 50 residues first, then to cast out sequence similarity big
In the protein chain data for being equal to 30%, then gives up some out-of-date protein chain data, finally obtain the albumen of nonredundancy
Matter chain data.
4. a kind of protein-protein interaction sites recognition methods according to claim 1, which is characterized in that after pretreatment
Protein chain data be divided into the detailed process of interface residue and non-interface residue are as follows: if the opposite of amino acid residue can connect
Touching surface area is at least the 0.16 of its maximum accessible surface product, which is divided into surface residue;In surface residue, if
Spacing d < 1.2nm between any two residues alpha carbon atom, the residue are divided into interface residue;If d >=1.2nm, residue
It is divided into non-interface residue.
5. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that successively will be each
The feature of residue to be measured and the feature of most similar ten surface residues in its space are combined, and are carried out to the dimension of each feature
Expand.
6. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that data set into
Row disequilibrium treatment process are as follows:
Target sample is first chosen, then checks the classification of nearest three samples in data set by K- neighbour's rule, when there are two
Or when more than two sample class and selected target sample difference, illustrates that the sample is noise data, then the sample is deleted
It removes;Circulate operation is not until have noise data.
7. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that for data set
Disequilibrium treatment process are as follows: first the number of the number of interface residue and non-interface residue is compared, then both is chosen
In the residue classification that is larger in number, recycle following equation to delete the data point that IH value in the residue classification chosen is greater than 0.7:
IH(<xi,yi>)=1-p (yi|xi,h)
Wherein p (yi|xi, h) and indicate that mapping function will mark yiAs input feature value xiSymbol probability.
8. a kind of protein-protein interaction sites recognition methods according to claim 2, which is characterized in that utilize training set
The formula of training XGBoost model are as follows:
Wherein F is all classification tree and regression tree space,Corresponding xiPrediction result, fk(xi) indicate sample xiIt is input to
The prediction score of the leaf node obtained after k tree;
The formula of XGBoost model are as follows:
It is the error function of model,;It is regularization term, indicates the complexity of K tree.
9. described in any item a kind of protein-protein interaction sites recognition methods according to claim 1~8, which is characterized in that
Obtained protein-protein interaction sites are tested by test set.
10. described in any item a kind of protein-protein interaction sites recognition methods according to claim 1~8, which is characterized in that
The feature of extraction is residue spatial sequence spectrum, residue sequence comentropy, relative entropy, residue sequence guard weight and residue evolution is fast
Rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910686641.XA CN110265085A (en) | 2019-07-29 | 2019-07-29 | A kind of protein-protein interaction sites recognition methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910686641.XA CN110265085A (en) | 2019-07-29 | 2019-07-29 | A kind of protein-protein interaction sites recognition methods |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110265085A true CN110265085A (en) | 2019-09-20 |
Family
ID=67912145
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910686641.XA Pending CN110265085A (en) | 2019-07-29 | 2019-07-29 | A kind of protein-protein interaction sites recognition methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110265085A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111834010A (en) * | 2020-05-25 | 2020-10-27 | 重庆工贸职业技术学院 | COVID-19 detection false negative identification method based on attribute reduction and XGboost |
CN113611360A (en) * | 2021-08-11 | 2021-11-05 | 邵阳学院 | Protein-protein interaction site prediction method based on deep learning and XGboost |
CN115497555A (en) * | 2022-08-16 | 2022-12-20 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-species protein function prediction method, device, equipment and storage medium |
CN115512763A (en) * | 2022-09-06 | 2022-12-23 | 北京百度网讯科技有限公司 | Method for generating polypeptide sequence, method and device for training polypeptide generation model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010165230A (en) * | 2009-01-16 | 2010-07-29 | Pharma Design Inc | Method and system for predicting protein-protein interaction as drug target |
CN109872781A (en) * | 2019-02-26 | 2019-06-11 | 哈尔滨工业大学 | Drug target recognition methods based on Xgboost |
-
2019
- 2019-07-29 CN CN201910686641.XA patent/CN110265085A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010165230A (en) * | 2009-01-16 | 2010-07-29 | Pharma Design Inc | Method and system for predicting protein-protein interaction as drug target |
CN109872781A (en) * | 2019-02-26 | 2019-06-11 | 哈尔滨工业大学 | Drug target recognition methods based on Xgboost |
Non-Patent Citations (2)
Title |
---|
CHANGQING MEI 等: "Unbalance Data Processing Strategy for Protein Interaction Sites Prediction", 《2018 9TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY IN MEDICINE AND EDUCATION》 * |
NAUFAL AZMI VERDIKHA 等: "Study of Undersampling Method: Instance Hardness Threshold with Various Estimators for Hate Speech Classification", 《IJITEE》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111834010A (en) * | 2020-05-25 | 2020-10-27 | 重庆工贸职业技术学院 | COVID-19 detection false negative identification method based on attribute reduction and XGboost |
CN111834010B (en) * | 2020-05-25 | 2023-12-01 | 重庆工贸职业技术学院 | Virus detection false negative identification method based on attribute reduction and XGBoost |
CN113611360A (en) * | 2021-08-11 | 2021-11-05 | 邵阳学院 | Protein-protein interaction site prediction method based on deep learning and XGboost |
CN115497555A (en) * | 2022-08-16 | 2022-12-20 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-species protein function prediction method, device, equipment and storage medium |
CN115497555B (en) * | 2022-08-16 | 2024-01-05 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Multi-species protein function prediction method, device, equipment and storage medium |
CN115512763A (en) * | 2022-09-06 | 2022-12-23 | 北京百度网讯科技有限公司 | Method for generating polypeptide sequence, method and device for training polypeptide generation model |
CN115512763B (en) * | 2022-09-06 | 2023-10-24 | 北京百度网讯科技有限公司 | Polypeptide sequence generation method, and training method and device of polypeptide generation model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110265085A (en) | A kind of protein-protein interaction sites recognition methods | |
Bashashati et al. | A survey of flow cytometry data analysis methods | |
Nguyen et al. | Learning graph representation via frequent subgraphs | |
CN112767997A (en) | Protein secondary structure prediction method based on multi-scale convolution attention neural network | |
KR102213670B1 (en) | Method for prediction of drug-target interactions | |
CN112417132B (en) | New meaning identification method for screening negative samples by using guest information | |
CN115472221A (en) | Protein fitness prediction method based on deep learning | |
CN114743600A (en) | Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity | |
CN116206688A (en) | Multi-mode information fusion model and method for DTA prediction | |
KR20220083649A (en) | Chemical binding similarity searching method using evolutionary information of protein | |
CN115206423A (en) | Label guidance-based protein action relation prediction method | |
Dotan et al. | Effect of tokenization on transformers for biological sequences | |
CN113823356A (en) | Methylation site identification method and device | |
Golenko et al. | IMPLEMENTATION OF MACHINE LEARNING MODELS TO DETERMINE THE APPROPRIATE MODEL FOR PROTEIN FUNCTION PREDICTION. | |
CN116401369B (en) | Entity identification and classification method for biological product production terms | |
Murphy et al. | Self-supervised learning of cell type specificity from immunohistochemical images | |
CN112270950A (en) | Fusion network drug target relation prediction method based on network enhancement and graph regularization | |
CN116343915A (en) | Construction method of biological sequence integrated classifier and biological sequence prediction classification method | |
CN112735532B (en) | Metabolite identification system based on molecular fingerprint prediction and application method thereof | |
Aggarwal et al. | A review on protein subcellular localization prediction using microscopic images | |
CN112151109A (en) | Semi-supervised learning method for evaluating randomness of biomolecular cross-linking mass spectrometry identification | |
Altinier et al. | An expert system for the classification of serum protein electrophoresis patterns | |
Dong et al. | A region selection model to identify unknown unknowns in image datasets | |
Gholap et al. | Content-based tissue image mining | |
Subhashini et al. | PREDICTING SUBCELLULAR LOCALIZATION OF PROTEINS WITH MULTIPLE SITES USING THRESHOLD ML-KNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |