CN108052795A - A kind of method of the G-protein coupling specificities prediction of feature based optimization - Google Patents

A kind of method of the G-protein coupling specificities prediction of feature based optimization Download PDF

Info

Publication number
CN108052795A
CN108052795A CN201711211883.0A CN201711211883A CN108052795A CN 108052795 A CN108052795 A CN 108052795A CN 201711211883 A CN201711211883 A CN 201711211883A CN 108052795 A CN108052795 A CN 108052795A
Authority
CN
China
Prior art keywords
protein
feature
prediction
coupling
specificities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711211883.0A
Other languages
Chinese (zh)
Inventor
江振然
余蔚明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201711211883.0A priority Critical patent/CN108052795A/en
Publication of CN108052795A publication Critical patent/CN108052795A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of methods of the G-protein coupling specificities prediction of feature based optimization, include the following steps:The coupling information of GPCRs and G-protein is obtained from database, builds protein sequence data collection to be measured;A variety of different biological features of different zones protein sequence data inside and outside data set inner cell are extracted, integration obtains original characteristic information;Feature evaluation is carried out to primitive character information using the optimization method including mRMR algorithms and Relief algorithms simultaneously and validity feature selects, chooses the relevant optimal feature subset of coupling specificities between GPCR/G albumen;The svm classifier model predicted using support vector machine classification method structure for G-protein coupling specificities;It obtains to predict the sub- prediction models of each coupling specificities of G-protein based on svm classifier model.

Description

A kind of method of the G-protein coupling specificities prediction of feature based optimization
Technical field
The present invention relates to coupling specificities electric powder prediction more particularly to one kind between g protein coupled receptor/G-protein The method that the G-protein coupling specificities prediction with optimizing is integrated based on multiple features.
Background technology
G protein coupled receptor (G-Protein Coupled Receptors, GPCRs) is that current pharmaceutical industry is most important One of drug target.Current about 40% marketed drug is all using g protein coupled receptor as action target spot according to estimates.G eggs It is that one kind can be combined with guanylic acid in vain, the signal transducer with GTP hydrolytic enzyme activities.GPCRs by with trimerization Body G-protein is coupled, and key player is carry during extracellular signal is transmitted to cell interior.Traditional biochemical test side The coupling specificities of method detection GPCR are not only bothersome laborious, but also cost is higher.In recent years, effective computational methods pair are utilized Coupling specificities between GPCR/G albumen are predicted, can not only help to illustrate weight of the g protein coupled receptor in cell Function is wanted, and helps to inquire into the mechanism of action of its cell signalling, it is valuable so as to be provided for the research and development of modern medicines Clue.Thus with important scientific meaning and application prospect.
Since GPCRs is the transmembrane protein of a kind of round-trip cross-film seven times, it is not easy to obtain crystal, under normal conditions often very Hardly possible determines three-D space structure, while this kind of memebrane protein solubility in general solvent using existing method of X-ray diffraction It is all little, thus be also not easy to survey its dynamic structure with magnetic nuclear resonance method in the solution.GPCRs/G protein coupling specificities The early stage research mostly computational methods based on sequence alignment, such as BLAST and ClustalW.The sequence pair of the results show classics It is more unsatisfactory in G-protein coupling specificities prediction effect than algorithm.Its reason is mainly due to being coupled with same family's G-protein GPCRs has relatively low sequence similarity, while also there are the higher GPCRs of some sequence similarities and the G eggs of multiple families The phenomenon that white coupling.This phenomenon is known as GPCRs/G albumen and is coupled more.More coupling phenomenons are that current GPCRs/G albumen couplings are special Difficult point in opposite sex research.
Current G-protein coupling specificities forecasting research mostly only for Gi/o, Gq/11, GsThe GPCRs of three classes list coupling, is pressed Chosen according to coupling region different, which is broadly divided into two major classes, the first kind be based on the whole sequences of GPCRs into Row feature extraction (bibliography Ghimire G.D., Imai K., Akazawa F., Tsuji T., Sonoyama M., Mitaku S.,Physicochemical properties of GPCR amino acid sequences for understanding GPCR-G-protein coupling,J.Chem.Bio.Inf.,2008,8(2):49-57).Second class It is the sequence progress feature extraction (bibliography for GPCRs intracellular spaces S.,Vilo J., CroningM.D.R.,Prediction ofthe coupling specificity ofG protein coupled receptor to their G proteins,Bioinformatics,2001,17:S174-S181;Sgourakis N.G., Bagos P.G.,Papasaikas P.K.,Hamodrakas S.J.,A method for the prediction of GPCRs coupling specificity to G-proteins using refined profile Hidden Markov Models,BMCBioinformatics,2005,6:104-116)。
By early period literature survey analysis, it has been found that existing G-protein coupling specificities forecasting research mostly only for Gi/o, Gq/11, GsThe GPCRs that three classes G-protein family is singly coupled, concern and G12/13Coupling and the research more being coupled are simultaneously few.This Outside, since the complexity of the finite sum coupling mechanism of existing training sample so that we are difficult the pre- of method before systematically evaluating Survey ability, this causes the prediction of GPCR/G albumen couplings to have certain challenge to a certain extent.Although take machine learning New method be the emphasis direction studied at present the coupling specificities predicted between GPCR and G-protein, but correlation model The improvement of algorithm and the selection and optimization of validity feature are always the difficult point of current G-protein coupling specificities forecasting research.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of multiple features fusion and the G-protein coupling of characteristic optimization is special Property Forecasting Methodology.
The method of the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, includes the following steps:
The coupling information of GPCRs and G-protein is obtained from database, builds protein sequence data collection to be measured;
A variety of different biological features of protein sequence data in the data set are extracted, obtain primitive character information;
Characteristic evaluating and validity feature is carried out to primitive character information using sub- prediction model simultaneously to select, choose with The relevant optimal characteristics collection of GPCR/G protein coupling specificities;
Summary characteristic optimization information uses support vector machines (Support Vector Machine, SVM) grader The svm classifier model that method structure is predicted for G-protein coupling specificities;
It obtains to predict the sub- prediction models of each coupling specificities of G-protein based on svm classifier model.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, albumen to be measured is built The set of sequence data includes:Single coupling information of GPCRs and G-protein is obtained from database, builds single coupling specificities Prediction data subset.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, albumen to be measured is built The set of sequence data includes:More coupling information of GPCRs and G-protein are obtained from database, build more coupling specificities Prediction data subset.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, for described to be measured Protein sequence data collection, further removal is without the protein sequence for predicting 7 transbilayer helixes.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, for described to be measured Protein sequence data collection further removes the protein sequence that similarity is more than 98%.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, pass through molecular descriptor Described in information extraction in data set protein sequence data biological property, the description information of the molecular descriptor includes atom Type, functional group, autocorrelation exponent, 3D molecular descriptors and molecule attribute.
It is similar by GO semantemes in the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention Property the extraction data set in protein sequence data biological property;Gene ontology GO terms include molecular function, biochemical way Footpath and cell build Info calculate the GO Semantic Similarities between any two target proteins, structure three using csbl.go R bags The matrix of a characterization protein Semantic SimilarityWithAnd provide each albumen The average Semantic Similarity definition of matter
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, it is made up of amino acid The biological property of protein sequence data in the data set is extracted, is included the following steps:
Kiel autocorrelation characteristic group is selected to characterize the related physicochemical property of protein;The feature group is by eight kinds of amino acid Attribute forms, and it is molten to include polarization parameter, hydrophobicity, average elasticity index, steric parameter, amino acid residue capacity, amino acid Liquid in water free energy parameter, with residue accessible surface product and the opposite easy variable element of amino acid;Kiel autocorrelation calculation mistake Journey is as follows:
Wherein, N is the length of each protein amino acid sequence, and d is auto-correlation step parameter, and Pi and Pi+d are represented the I and i+d positions upper amino acid property value,Represent the average of Pi;
The feature vector of 260 dimension of 20 kinds of amino acid composition characteristics and Kiel autocorrelation characteristic composition is characterized into each albumen The biological attribute of matter.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, pass through position weight ammonia Base acid composition extracts the biological property of protein sequence data in the data set, includes the following steps:
It extracts and amino acid position information is introduced in amino acid composition characteristic, formed by extracting position weight amino acid, into One step improves the information content of sequence signature extraction, and specific calculating process is as follows:
Wherein, Length represents the length of whole protein sequence, position of the p represented amino acids in protein sequence;
Protein sequence can be represented as the feature vector of 20 dimensions:
ω AAC (seq)=[σ (A) ..., σ (a) ..., σ (Y)], a ∈ aa
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, carried by dipeptides composition The biological property of protein sequence data in the data set is taken, is included the following steps:
Dipeptides content calculation is described as in dipeptides set all elements in whole sequence in all possible dipeptides number The probability of appearance, calculation formula are:
Wherein,It is dipeptides a1a2The number occurred in the sequence, Length-1 are all possible in whole sequence Protein sequence if extraction is characterized as two peptide contents, can be expressed as the feature vector of 20 × 20 dimensions by dipeptides number:
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, by intracellular each Zone length extracts the biological property of protein sequence data in the data set, and G is carried out using membrane areas forecasting tool The selection of G-protein linked receptor intracellular region sequence.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, by forming transfer point Cloth extracts the biological property of protein sequence data in the data set, includes the following steps:
Amino acid composition transfer distribution characteristics coding method is employed, investigates polarity, charge, polarizability, water solubility, Fan De Eight kinds of amino acid physicochemical properties such as magnificent volume, hydropathic amino acid, secondary structure, side-chain radical, in whole protein sequence The composition transfer distribution characteristics 168 extracted on row is tieed up, and the feature extracted on 4 intracellular spaces 672 is tieed up totally.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, pass through global order column selection Mode is taken to choose the protein sequence data in the set, choose respectively GPCRs intracellular region sequences and the whole sequences of GPCRs into Row experiment, it is considered that GPCRs intracellular spaces are to be coupled region with G-protein, include the information in relation to coupling specificities.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, pass through intracellular space Sequence selection mode chooses the protein sequence data in the set, and intracellular region sequence is carried out using TMHMM2.0 or TopPred Column selection takes.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, structure is even for G-protein The svm classifier model of connection specificity predictions includes the following steps:For the feature after extraction, carried out using the method for linear combination Validity feature is integrated, excellent using feature based so as to obtain can be used for the feature vector set of GPCR coupling specificities prediction The synthesis of character subset is analyzed comprehensively selected by the strategy progress of change.
In the method for the G-protein coupling specificities prediction of feature based proposed by the present invention optimization, based on mRMR algorithms into Row characteristic evaluating, includes the following steps:
Data are subjected to processing conversion;Distribution and mutual information between calculating feature, between feature and response variable;
MRMR scores are carried out to feature, and are ranked up.
The method of the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that Characteristic evaluating is carried out based on Relief, is included the following steps:
A sample R is randomly choosed from training set D, then finds nearest samples from the sample similar with sample R H;
Nearest samples M is found from the inhomogeneous samples of sample R, then according to each feature of following Policy Updates Weight:If the distance of sample R and nearest samples H in some feature is less than the distance on sample R and nearest samples M, It is beneficial to distinguishing similar and inhomogeneous arest neighbors then to illustrate this feature, then increases the weight of this feature;It is on the contrary then reduce The weight of this feature;
Above procedure Repeated m time finally obtains the average weight of each feature;The weight of feature is bigger, represents this feature Classification capacity it is stronger.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, select to calculate based on increment Method carries out validity feature selection, includes the following steps:
Candidate feature set is selected with the evaluation function unrelated with grader;
Sorting algorithm is acted on into candidate feature set, selection character subset is removed by the use of nicety of grading as evaluation criterion.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, calculated based on search forward Method carries out validity feature selection, includes the following steps:Character subset X is since empty set, and one feature x of selection adds in feature every time Subset X so that characteristic function J (X) is optimal.
In the method for the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, in structure prediction mould Further comprise after type:Verify the stability and generalization of the sub- prediction model.
The method of the present invention is analyzing various features extracting method and the combination of sequence choosing method to prediction G-protein coupling spy On the basis of the opposite sex influences, it has been firstly introduced into Relief and minimal redundancy maximum correlation (mRMR) carries out feature selecting evaluation, Then optimal feature subset is chosen using IFS (increment selection algorithm) and FFS (searching algorithm forward).
Since the number of known more coupling GPCRs sequences is relatively fewer, and for the sorting technique of multi-tag problem Difficult point always in Research of Classification.The present invention utilizes supporting vector for single coupling and mostly two kinds of different situations of coupling Machine method builds the sorter model integrated based on multiple features respectively, achieves ideal prediction result.In addition, it incorporates A variety of correlated characteristics such as molecular descriptor, GO Semantic Similarities, homologous information realize effective prediction to more coupling phenomenons.
Feature selecting and optimization are carried out to selected feature present invention employs mRMR and Relief methods, it is possible to prevente effectively from The noise that diversified feature may be brought.MRMR algorithms are that one kind is used for while maximum correlation is ensured, and are removed The method of redundancy feature, the character subset for being equivalent to have obtained one group " most pure " are (widely different between feature, and same target The correlation of variable is also very big).Relief algorithms are a kind of feature weight algorithms, are assigned according to the correlation of each feature and classification The weight that feature is different is given, the feature that weight is less than some threshold value will be removed.Feature is related to classification in Relief algorithms Property is separating capacity of the feature based to closely sample.
The invention has the advantages that:
The Forecasting Methodology of the present invention is directed to the features such as sequence signature of GPCR, establishes mostly special with good predictive ability Integrated Models are levied, and form the selection and optimization method of G-protein coupling correlated characteristic subset on this basis, are effectively avoided more The noise data that the feature of sample may be brought.By the comparison of different experiments, the method for the present invention is demonstrated for predicting G eggs The validity of white coupling specificities.
Description of the drawings
Fig. 1 show the method for the present invention prediction GPCR/G protein coupling specificity basic flow charts.
Fig. 2 show in the method for the present invention single coupling feature selecting figure based on Relief methods.
Fig. 3 show in the method for the present invention single coupling feature selecting figure based on mRMR methods.
Fig. 4 show more coupling feature selecting figures based on two kinds of characteristic optimization methods in the method for the present invention.
Specific embodiment
With reference to specific examples below and attached drawing, the present invention is described in further detail.The process of the implementation present invention, Condition, experimental method etc. in addition to the following content specially referred to, are among the general principles and common general knowledge in the art, this hair It is bright that content is not particularly limited.
As shown in Figure 1, illustrate the prediction GPCR/G protein coupling specificity flow charts that multiple features are integrated and optimized.
The method of the G-protein coupling specificities prediction of feature based optimization proposed by the present invention, includes the following steps:
GPCRs coupling information related to G-protein is obtained from database, builds protein sequence data collection to be measured;
A variety of different biological features of protein sequence data in the data set are extracted, obtain primitive character information;
Characteristic evaluating and validity feature is carried out to primitive character information using sub- prediction model simultaneously to select, choose with The relevant optimal characteristics collection of GPCR/G protein coupling specificities;
Summary characteristic optimization information, using support vector machine classifier method structure for G-protein coupling specificities The svm classifier model of prediction;
It obtains to predict the sub- prediction models of each coupling specificities of G-protein based on svm classifier model;
In the present invention, building the set of protein sequence data to be measured includes:GPCRs and G eggs are obtained from database White single coupling information, builds single coupling specificities prediction data subset.
In the present invention, building the set of protein sequence data to be measured includes:GPCRs and G eggs are obtained from database White more coupling information, build more coupling specificities prediction data subsets.
In the present invention, for the protein sequence data collection to be measured, further removal is without predicting 7 cross-film spiral shells The protein sequence of rotation.
In the present invention, for the protein sequence data collection to be measured, the albumen that similarity is more than 98% is further removed Sequence.
In the present invention, by described in molecular descriptor information extraction in data set protein sequence data biological property, Present invention uses the information that E-Dragon online tools calculate functional group, which provides retouching for different kinds of molecules descriptor Information is stated, these information include atomic type, functional group, autocorrelation exponent, the spies such as 3D molecular descriptors and molecule attribute Sign.
In the present invention, the biological property of protein sequence data in the data set, base are extracted by GO Semantic Similarities Because body (Gene Ontology, GO) is a widely used body in field of bioinformatics.Gene ontology GO terms Include molecular function (MF), biochemical route (BP) and cell set up the information of (CC), all and relevant gene of gpcr protein Body GO terms derive from Uniprot Knowledgebase databases.We calculate arbitrary two using csbl.go R bags GO Semantic Similarities between a target proteins.Finally we construct the matrix of three characterization protein Semantic SimilaritiesWithAnd provide the average Semantic Similarity definition of each protein
In the present invention, the biological property of protein sequence data in the data set is extracted by amino acid composition, including Following steps:
Kiel autocorrelation characteristic group in present invention selection PROFEAT databases is special to characterize the related physics and chemistry of protein Property.This feature group is mainly made of eight kinds of amino acid attributes, includes polarization parameter (30 dimension), hydrophobicity (30 dimension), average bullet Sex index (30 dimension), steric parameter (30 dimension), amino acid residue capacity (30 dimension), amino acid solution free energy parameter in water (30 dimension), with residue accessible surface product (30 dimension), the relatively easy variable element (30 dimension) of amino acid.Specific Kiel autocorrelation calculation Process is as follows:
Wherein N is the length of each protein amino acid sequence, and d is auto-correlation step parameter, and Pi and Pi+d are represented i-th With i+d positions upper amino acid property value,Represent the average of Pi.Finally, we by 20 kinds of amino acid composition characteristics and Kiel from Correlated characteristic forms the feature vector of 260 dimensions to characterize the biological attribute of each protein.
In the present invention, the biological characteristics of protein sequence data in the data set are extracted by position weight amino acid composition Sign since amino acid composition is used only, often loses the information such as order, length that protein sequence is contained.The present invention exists It extracts and amino acid position information is introduced in amino acid composition characteristic, formed by extracting position weight amino acid, it is further perfect The information content of sequence signature extraction, specific calculating process are as follows:
Wherein, Length represents the length of whole protein sequence, position of the p represented amino acids in protein sequence.Therefore, protein sequence can be represented as the spy of 20 dimensions Sign vector:
ω AAC (seq)=[σ (A) ..., σ (a) ..., σ (Y)], a ∈ aa
In the present invention, the biological property of protein sequence data in the data set is extracted by dipeptides composition, including such as Lower step:
Dipeptides content calculation can be described as all elements all possible dipeptides number in whole sequence in dipeptides set The probability occurred in mesh, calculation formula are:
Wherein,It is dipeptides a1a2The number occurred in the sequence, Length-1 are all possible in whole sequence Protein sequence if extraction is characterized as two peptide contents, can be expressed as the feature vector of 20 × 20 dimensions by dipeptides number:
In the present invention, the biological characteristics of protein sequence data in the data set are extracted by intracellular each area length Sign, includes the following steps:
The present invention carries out the choosing of g protein coupled receptor intracellular region sequence using membrane areas forecasting tool TMHMM2.0 It takes.
In the present invention, by forming the biological property for shifting distribution and extracting protein sequence data in the data set, adopt Transfer distribution (CTD) feature coding method is formed with amino acid.Investigated in experiment polarity, charge, polarizability, water solubility, Eight kinds of Van der waals volumes, hydropathic amino acid, secondary structure, side-chain radical amino acid physicochemical properties.In whole protein The composition transfer distribution characteristics extracted in sequence has 168 dimensions, and the feature extracted on 4 intracellular spaces shares 672 dimensions.
In the present invention, by global order column selection mode is taken to choose the protein sequence data in the set, chosen respectively GPCRs intracellular regions sequence and the whole sequences of GPCRs are tested, and GPCRs intracellular spaces are to be coupled region, bag with G-protein The information in relation to coupling specificities is contained, but since the intracellular space position of every GPCRs sequence, length are different, has carried Corresponding information is taken to acquire a certain degree of difficulty.In addition, it additionally depends on the precision of membrane areas forecasting tool to a certain extent.Therefore, It is more universal in protein classification forecasting research that feature coding is carried out to whole protein sequence.
In the present invention, the protein sequence data in the set is chosen by way of the sequence selection of intracellular space, at present, The instrument that can be used for membrane areas prediction has TMHMM2.0 and TopPred etc..Using above-mentioned instrument can be directed to Property extraction feature, it is clear that the prediction result based on intracellular space disaggregated model depend on membrane areas forecasting tool essence Degree.The present invention carries out intracellular region sequence selection using membrane areas forecasting tool TMHMM2.0.
In the present invention, svm classifier model of the structure for the prediction of G-protein coupling specificities includes the following steps:
For the feature after said extracted, the present invention carries out feature integration using the method for linear combination, so as to which obtain can For the feature vector set of GPCR coupling specificities prediction.One of the present invention is mainly characterized by using a kind of based on spy The synthesis of character subset is analyzed comprehensively selected by the strategy progress of sign optimization.
In the present invention, characteristic evaluating is carried out based on mRMR algorithms, is included the following steps:
Maximal correlation minimal redundancy (mRMR) considered not only the correlation between feature and label, it is also contemplated that every Correlation between a feature and feature.Module uses mutual information.MRMR algorithms include several steps:(1) by original Beginning data carry out the process of processing conversion.(2) distribution between calculating feature, between feature and response variable and mutual information.(3) MRMR scores are carried out to correlated characteristic, and are ranked up.
In the present invention, characteristic evaluating is carried out based on Relief, is included the following steps:
Relief algorithms are a kind of feature subset selections of weights search, weights searching method, i.e., according to each feature and The correlation of classification assigns feature different weights, and the feature that weight is less than some threshold value will be removed.It is special in Relief algorithms The correlation for classification of seeking peace is separating capacity of the feature based to closely sample.Algorithm includes several steps:(1) from training set A sample R is randomly choosed in D, nearest samples H is then found from the sample similar with R.(2) from the inhomogeneous samples of R Nearest samples M is found in this, then according to the weight of each feature of following Policy Updates:If R and H is in some feature Distance is less than the distance on R and M, then it is beneficial to distinguishing similar and inhomogeneous arest neighbors to illustrate this feature, then increasing should The weight of feature;Weight that is on the contrary then reducing this feature.(3) above procedure Repeated m time finally obtains the average power of each feature Weight.The weight of feature is bigger, represents that the classification capacity of this feature is stronger.
In the present invention, validity feature selection is carried out based on increment selection algorithm, incremental learning refers to a learning system energy Constantly from the new knowledge of new samples learning, and most of knowledge learnt in the past can be preserved.This invention takes One kind selects (Incremental Feature Selection, IFS) algorithm based on increment feature.The algorithm is first with dividing The unrelated evaluation function of class device selects candidate feature set, and sorting algorithm then is acted on candidate feature set, utilizes classification Precision removes selection character subset as evaluation criterion.
In the present invention, validity feature selection is carried out based on searching algorithm forward, is included the following steps:Forward direction feature selecting is calculated Method from empty set, according to certain rule choose it is optimal to the combinations of features in current collection, with study a question it is most related Feature be added in selected feature set.Its algorithm key step:Character subset X selects a spy every time since empty set It levies x and adds in character subset X so that characteristic function J (X) is optimal.One is selected every time so that the value of evaluation function reaches Optimal feature adds in.
In the present invention, further comprise after sub- prediction model is built:Verify the sub- prediction model stability and Generalization.
It is proposed by the present invention
The method of the present invention realizes that mainly there are three steps:
(i) compile newest GPCRs and be coupled relevant information with G-protein, construct reliable data collection.
(1) structure of single coupling specificities predictive data set:G-protein-GPCRs the couplings arranged from the databases such as gpDB Single coupling information is extracted in information, predicts 7 transbilayer helix regions of corresponding protein sequence.Removal is not previously predicted out 7 cross-film spiral shells The sequence of rotation, and remove the sequence that similarity is more than 98%;
(2) structure of more coupling specificities predictive data sets:How even G-protein-the GPCRs arranged from the databases such as gpDB is Join information, predict 7 transbilayer helix regions of corresponding protein sequence.Removal is not previously predicted out the sequence of 7 transbilayer helixes, and removes Similarity is more than 98% sequence;
The present invention can be carried out at the same time single coupling specificities and the prediction of more coupling specificities.
(ii) effect of various features extracting method and the combination of sequence choosing method in prediction G-protein coupling specificities is analyzed Fruit.
Analysis, verification various features extracting method and the combination of sequence choosing method are in the effect of prediction G-protein coupling specificities Fruit, wherein feature extracting method include molecular descriptor information, GO Semantic Similarities, amino acid composition, position weight amino acid Composition, dipeptides composition, the length of intracellular each area, composition transfer distribution.
Features above extracting method can be applied in combination, this, which depends on specific experiment, needs, feature of the invention choosing It is also on the basis of existing primitive character is analyzed in fact to select with optimization, selected, combine it is selected to use.
Sequence selection includes choosing whole sequence either selection intracellular space sequence.In addition, to four intracellular regions Domain has studied each region to the resulting influence of Forecasting Methodology respectively.
Global order column selection proposed by the present invention, which takes to be used alone and/or combine with the selection of intracellular space sequence, to be made With the present invention is each in the cell for the GPCRs of three families for any of the above feature extracting method in preliminary experiment Region has carried out analysis and has compared, and result is as shown in table 1 below.This explanation global order column selection proposed by the present invention takes and intracellular region Domain sequence selection can be used alone and/or be applied in combination.
The results contrast of the different sequence areas of table 1. and feature extracting method
Feature Whole sequential extraction procedures Intracellular sequence
Amino acid forms 67.80% 89.55%
Position weight amino acid forms 76.97% 70.79%
Two peptide contents 82.30% 89.34%
Composition transfer distribution 89.13% 73.77%
(iii) based on SVM methods, validity feature selection is carried out to selected feature using mRMR and Relief methods.Specifically Details is as follows:
It employs two methods of Relief and mRMR and carries out characteristic evaluating, and use IFS (increment selection algorithm) and FFS (searching algorithm forward) chooses optimal feature subset.It is relevant with GPCR/G protein coupling specificities so as to obtain to greatest extent Characteristic information.
As shown in figs. 2 to 4, single coupling based on two kinds of characteristic optimization methods/more couplings feature selecting figure is illustrated, we It can be seen that two methods proposed by the present invention can effectively select it is related with G-protein coupling specificities to g protein coupled receptor Feature vector.Classifier performance based on these features is obviously improved.
Table 2. carries out intracellular different zones the results contrast of different characteristic coding
Feature ICL1 ICL2 ICL3 C-terminal
Amino acid content 79.74% 75.27% 65.88% 65.25%
Intracellular region length 79.53% 76.97% 73.35% 67.59%
Amino acid physico-chemical property 83.37% 73.35% 73.35% 68.44%
As shown in table 2, for any of the above feature extracting method, for GPCRs each areas in the cell of three families The overall accuracy (ACC) tested on domain, which summarize, to be compared (table 2).It is carried when being carried out at the same time feature to four intracellular spaces Result after taking is most preferable.Analysis the reason is that high due to carrying out the dimension that feature extractions go out four intracellular spaces, comprising More coupling information.So the research of the present invention is all concentrated carries out feature extraction to four intracellular spaces.GPCR cells The specificity analysis that interior different zones are surveyed.
The prediction result of GPCRs/G albumen list coupling of the table 3. based on mRMR
For influence of the optimization to prediction effect that further test feature is chosen, we are coupled in GPCRs/G albumen list It is trained and tests using mRMR and Relief methods with more coupled models.As shown in table 3, five retransposings verification is employed to come Predict the precision of the model.Table 3 is based on five retransposing verification result of mRMR and Relief.
GPCRs/G albumen of the table 4. based on mRMR is coupled prediction result more
Table 4 is more coupling prediction results based on mRMR methods.The above results show after adding in mRMR and Relief methods Either single prediction effect for being coupled and being coupled more has different degrees of promotion, and it is to have to also illustrate method proposed by the present invention Effect.
The result shows that after as a result Relief methods processing, several classifications such as G for being singly coupledi/O (characteristic 156), Gq/11 (characteristic 54) 54, Gs(characteristic 183) all achieves preferable prediction effect.As a result after mRMR methods processing, singly it is coupled Several classifications such as Gi/O (characteristic 54), Gq/11(characteristic 249), Gs(characteristic 49) all achieves good prediction effect. Illustrate that the IFS (increment selection algorithm) used and FFS (searching algorithm forward) of the invention can help effectively to choose optimal characteristics Subset.
In order to systematically examine the stability and generalization of G-protein coupling specificities prediction model, we also compare not Same test method is to the validity of prediction model.In independent test diversity method, since training set is identical with test set, so Prediction result is higher, but this also illustrates model is with good stability, i.e. the differentiation energy to being used for training pattern sample Power is very high.Training set and test set are the two disjoint subsets gone out by sample set random division in the test of independent test collection, Prediction result is preferable, and single coupling disaggregated model result of structure can be used for predicting.Leaving-one method (leaves behind a sample to do every time Test set, other samples do training set, if there is k sample, then need to train k time, test k times) and cross validation (from whole Training data S in randomly choose s sample as training set, it is remaining as test set) result also demonstrate the prediction The superperformance of model.
Therefore, the G-protein coupling specificities Forecasting Methodology of feature based optimization proposed by the present invention can be by region Feature is in optimized selection, can accurate description coupling characteristic and the relatively low correlated characteristic of dimension so as to obtain.And then effectively Predict the purpose of the potential coupled action between g protein coupled receptor and G-protein.
The protection content of the present invention is not limited to above example.Without departing from the spirit and scope of the invention, originally Field technology personnel it is conceivable that variation and advantage be all included in the present invention, and using appended claims as protect Protect scope.

Claims (11)

  1. A kind of 1. method of the G-protein coupling specificities prediction of feature based optimization, which is characterized in that include the following steps:
    The coupling information between GPCRs and G-protein is obtained from gpDB databases, builds protein sequence data collection to be measured;
    A variety of different biological features of protein sequence data in the data set are extracted, obtain primitive character information;
    Characteristic evaluating is carried out to primitive character information using sub- prediction model simultaneously and validity feature selects, is chosen and GPCR/G eggs The relevant optimal feature subset of white coupling specificities;
    Summary characteristic optimization information is predicted using support vector machine classifier method structure for G-protein coupling specificities Svm classifier model;
    It obtains to predict the sub- prediction models of each coupling specificities of G-protein based on svm classifier model.
  2. 2. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that structure Building protein sequence data collection to be measured includes:Single coupling information of GPCRs and G-protein is obtained from database, builds single coupling Specificity predictions data subset.
  3. 3. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that structure Building protein sequence data collection to be measured includes:More coupling information of GPCRs and G-protein are obtained from database, build more couplings Specificity predictions data subset.
  4. 4. the method for the G-protein coupling specificities prediction of the feature based optimization as described in claim 1,2 or 3, feature exist In for the protein sequence data collection to be measured, further removal is without the protein sequence for predicting 7 transbilayer helixes.
  5. 5. the method for the G-protein coupling specificities prediction of the feature based optimization as described in claim 1,2 or 3, feature exist In, for the protein sequence data collection to be measured, the further protein sequence for removing similarity and being more than 98%.
  6. 6. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that logical Cross the biological property that GO Semantic Similarities extract protein sequence data in the data set;GO terms include molecular function, life Change approach and cell composition information, the GO Semantic Similarities between any two target proteins, structure are calculated using csbl.go R bags Build the matrix of three characterization protein Semantic SimilaritiesWithAnd it provides each The average Semantic Similarity definition of protein(i=1,2 ..., ng;R=MF, BP, CC)
  7. 7. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that logical The biological property that intracellular each area length extracts protein sequence data in the data set is crossed, it is pre- using membrane areas Survey instrument carries out effective selection of g protein coupled receptor intracellular region sequence.
  8. 8. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that logical The biological property that protein sequence data in the data set is extracted in composition transfer distribution is crossed, is included the following steps:
    Transfer distribution characteristics coding method is formed using amino acid, investigates polarity, charge, polarizability, water solubility, Van der Waals body Eight kinds of product, hydropathic amino acid, secondary structure, side-chain radical amino acid physicochemical properties, above carry in whole protein sequence The composition transfer distribution characteristics 168 taken is tieed up, and the feature extracted on 4 intracellular spaces 672 is tieed up totally.
  9. 9. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that logical Crossing global order column selection takes mode to choose the protein sequence data in the data set, respectively choose GPCRs intracellular region sequences and The whole sequences of GPCRs are tested, it is considered that GPCRs intracellular spaces are the main regions with G-protein coupling, and it includes have Close the information of coupling specificities.
  10. 10. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that structure The svm classifier model built for the prediction of G-protein coupling specificities includes the following steps:For the feature after extraction, using linear The method of combination carries out feature integration, so as to obtain can be used for the feature vector set of GPCR coupling specificities prediction, utilizes The synthesis of character subset is analyzed comprehensively selected by the strategy progress of feature based optimization.
  11. 11. the method for the G-protein coupling specificities prediction of feature based optimization as described in claim 1, which is characterized in that Sub- prediction model is built afterwards to further comprise:Verify the stability of the sub- prediction model and generalization step.
CN201711211883.0A 2017-11-28 2017-11-28 A kind of method of the G-protein coupling specificities prediction of feature based optimization Pending CN108052795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711211883.0A CN108052795A (en) 2017-11-28 2017-11-28 A kind of method of the G-protein coupling specificities prediction of feature based optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711211883.0A CN108052795A (en) 2017-11-28 2017-11-28 A kind of method of the G-protein coupling specificities prediction of feature based optimization

Publications (1)

Publication Number Publication Date
CN108052795A true CN108052795A (en) 2018-05-18

Family

ID=62120713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711211883.0A Pending CN108052795A (en) 2017-11-28 2017-11-28 A kind of method of the G-protein coupling specificities prediction of feature based optimization

Country Status (1)

Country Link
CN (1) CN108052795A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033753A (en) * 2018-06-07 2018-12-18 浙江工业大学 A kind of group's Advances in protein structure prediction based on the assembling of secondary structure segment
CN111710360A (en) * 2020-05-27 2020-09-25 广州大学 Method, system, device and medium for predicting protein sequence
CN113270153A (en) * 2021-05-27 2021-08-17 南华大学 Screening method of compound targeting G protein coupled receptor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258146A (en) * 2013-05-13 2013-08-21 中国人民解放军第二军医大学 Delaminating and classifying method for G-protein-coupled receptor family
WO2017040520A1 (en) * 2015-08-31 2017-03-09 Hitachi Chemical Co., Ltd. Molecular methods for assessing urothelial disease
CN106960131A (en) * 2017-05-05 2017-07-18 华东师范大学 A kind of drug side-effect Forecasting Methodology based on multi-feature fusion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258146A (en) * 2013-05-13 2013-08-21 中国人民解放军第二军医大学 Delaminating and classifying method for G-protein-coupled receptor family
WO2017040520A1 (en) * 2015-08-31 2017-03-09 Hitachi Chemical Co., Ltd. Molecular methods for assessing urothelial disease
CN106960131A (en) * 2017-05-05 2017-07-18 华东师范大学 A kind of drug side-effect Forecasting Methodology based on multi-feature fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
W. YU 等: "Using Feature Selection Technique for Drug-Target Interaction Networks Prediction", 《CURRENT MEDICINAL CHEMISTRY》 *
余蔚明: "药物_靶标相互作用网络预测方法研究", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *
李丹丹: "基于支持向量机的G蛋白偶联特异性预测研究", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033753A (en) * 2018-06-07 2018-12-18 浙江工业大学 A kind of group's Advances in protein structure prediction based on the assembling of secondary structure segment
CN109033753B (en) * 2018-06-07 2021-06-18 浙江工业大学 Group protein structure prediction method based on secondary structure fragment assembly
CN111710360A (en) * 2020-05-27 2020-09-25 广州大学 Method, system, device and medium for predicting protein sequence
CN111710360B (en) * 2020-05-27 2023-04-25 广州大学 Method, system, device and medium for predicting protein sequence
CN113270153A (en) * 2021-05-27 2021-08-17 南华大学 Screening method of compound targeting G protein coupled receptor

Similar Documents

Publication Publication Date Title
Yang et al. Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition
Sankari et al. Predicting membrane protein types by incorporating a novel feature set into Chou's general PseAAC
Butt et al. CanLect-Pred: A cancer therapeutics tool for prediction of target cancerlectins using experiential annotated proteomic sequences
Gao et al. Improving discrimination of outer membrane proteins by fusing different forms of pseudo amino acid composition
Li et al. Protein contact map prediction based on ResNet and DenseNet
CN108052795A (en) A kind of method of the G-protein coupling specificities prediction of feature based optimization
CN108763865A (en) A kind of integrated learning approach of prediction DNA protein binding sites
CN104331642A (en) Integrated learning method for recognizing ECM (extracellular matrix) protein
Lai et al. A brief survey of machine learning application in cancerlectin identification
Wang et al. IDRBP-PPCT: identifying nucleic acid-binding proteins based on position-specific score matrix and position-specific frequency matrix cross transformation
Kabir et al. Prediction of membrane protein types by exploring local discriminative information from evolutionary profiles
Exarchos et al. Mining sequential patterns for protein fold recognition
Chen et al. Detection of outlier residues for improving interface prediction in protein heterocomplexes
Ghualm et al. Identification of pathway-specific protein domain by incorporating hyperparameter optimization based on 2D convolutional neural network
CN105046106B (en) A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval
CN106845156A (en) Sorting technique, apparatus and system based on blood platelet difference expression gene mark
CN113409897A (en) Method, apparatus, device and storage medium for predicting drug-target interaction
Mohamed et al. Multi-class protein sequence classification using fuzzy ARTMAP
Chen et al. Domain-based predictive models for protein-protein interaction prediction
Yun et al. Experimental comparison of feature subset selection methods
Sikander et al. Identification of cancerlectin proteins using hyperparameter optimization in deep learning and DDE profiles
Elayaraja et al. Extraction of motif patterns from protein sequence using rough-k-means algorithm
Alnabati et al. MarkovFit: Structure Fitting for Protein Complexes in Electron Microscopy Maps Using Markov Random Field
Turner et al. rG4detector: convolutional neural network to predict RNA G-quadruplex propensity based on rG4-seq data
Pandey et al. Predicting protein–RNA interaction using sequence derived features and machine learning approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180518