CN108508207A

CN108508207A - The identification method of protein-DNA binding sites

Info

Publication number: CN108508207A
Application number: CN201710245597.XA
Authority: CN
Inventors: 张德强; 司婧娜
Original assignee: Beijing Forestry University
Current assignee: Beijing Forestry University
Priority date: 2017-04-14
Filing date: 2017-04-14
Publication date: 2018-09-07

Abstract

The present invention proposes the method for determining protein dna binding site and determines the device of protein dna binding site.The method includes：The amino acid sequence of the amino acid sequence of reference protein collection and testing protein is split as multiple candidate units with predetermined amino acid number respectively；It determines the multiple candidate unit amino acid attribute of each, determines the protein dna binding site of the testing protein.Protein dna binding site can be accurately determined using the method and device of the determination protein dna binding site of the present invention, and step is simple, it is easy to operate, significantly reduce cost.

Description

The identification method of protein-DNA binding sites

Technical field

The present invention relates to biological fields.In particular it relates to the identification method of protein-DNA binding sites.More Body, the present invention relates to the method for determining protein-DNA binding sites and the devices of determining protein-DNA binding sites.

Background technology

The interaction of protein and DNA are widely present in the vital movement of cell.DNA molecular serves not only as hereditary object Matter carrys out coding protein, moreover it is possible to be combined with specific protein, controlling gene expression.Such as DNA replication dna, mRNA transcription with modification with And the infection etc. of virus is directed to the interaction between DNA and protein.

However, at present to determining that the method and device of protein-DNA binding sites still has much room for improvement.

Invention content

The present invention is directed to solve at least one the technical problems existing in the prior art at least to a certain extent.For this purpose, The present invention proposes the method and device of determining protein-DNA binding sites.Utilize determining albumen according to the ... of the embodiment of the present invention The method and device of matter-DNA binding sites can accurately determine protein-DNA binding sites, and step is simple, operation side Just, cost is significantly reduced.

It should be noted that the present invention is the following discovery based on inventor and completes：

Currently, the method for determining protein-DNA binding sites includes mainly point mutation experiment, DNA mobility shifting realities It tests, DNaseI foot printing tests, X-ray diffraction, nuclear magnetic resonance etc..But experimental period is long, input is huge, especially some Protein-DNA complexes be difficult obtain, cause protein function site mark speed lag far behind protein sequence and The speed that structural information increases.

In view of this, inventor has found by many experiments, some amino acid attributes significantly affect amino acid and are tied with DNA Close, in turn, the binding site of these amino acid attributes, its protein-DNA based on reference protein and testing protein this A little amino acid attributes, can accurately determine the protein-DNA binding sites of testing protein.It is utilized as a result, according to the present invention The method and device of the determination protein-DNA binding sites of embodiment can accurately determine protein-DNA binding sites, and Step is simple, easy to operate, significantly reduces cost.

For this purpose, in one aspect of the invention, the present invention proposes a kind of method of determining protein-DNA binding sites. According to an embodiment of the invention, the method includes：Respectively by the amino acid sequence of reference protein collection and testing protein Amino acid sequence is split as multiple candidate units with predetermined amino acid number；Determine the multiple candidate unit each Amino acid attribute, the amino acid attribute include selected from least one of following：The average non-binding energy of residue, transfer free energy Cap-chx, amino acid composition participate in the non-binding energy of short- and medium-range, molecular weight, transfer free energy vap-oct, alpha-helix tendency Property, chromatography RF values with high salt, residue average external volume, cytochromes synthetic proteins amino acid composition, principal component III, SD total protein Amino acid composition, accessible surface product, the mesophilic protein family amino acids distribution of 18 nonredundancies and surface accessibility protein content； And the attribute based on amino acid in the candidate unit, determine the protein-DNA binding sites of the testing protein.

Inventor has found that above-mentioned amino acid attribute significantly affects amino acid and combined with DNA, in turn, is based on reference protein These amino acid attributes, its protein-DNA binding sites and testing protein these amino acid attributes, can be accurately Determine the protein-DNA binding sites of testing protein.Furthermore, it is contemplated that amino acid residue adjacent on protein sequence it Between there may be interactions, the amino acid sequence of the amino acid sequence of reference protein collection and testing protein is split as having The multiple candidate units for having predetermined amino acid number, to improve the accuracy of result.It utilizes as a result, according to embodiments of the present invention The methods of determination protein-DNA binding sites can accurately determine protein-DNA binding sites, and step is simple, behaviour Facilitate, significantly reduces cost.

According to an embodiment of the invention, the method for above-mentioned determining protein-DNA binding sites can also have following additional Technical characteristic：

According to an embodiment of the invention, the predetermined amino acid numerical value is 19.It utilizes as a result, according to embodiments of the present invention The methods of determination protein-DNA binding sites further accurately determine protein-DNA binding sites.

According to an embodiment of the invention, the reference protein collection contains at least 30 reference proteins.It is sharp as a result, Protein-DNA knots are further accurately determined with the method for determining protein-DNA binding sites according to the ... of the embodiment of the present invention Close site.

According to an embodiment of the invention, the amino acid in the candidate unit has at least one following attribute, is institute State the instruction that amino acid is protein-DNA binding sites：The average non-binding energy of residue is -26.17~-7.59；Transfer free energy Cap-chx is -8.21~1.45；Amino acid group becomes 0.7~8.8；Participate in short- and medium-range it is non-binding can for -14.42~- 5.46；Molecular weight is 75.07~204.24；Transfer free energy vap-oct is -18.6~2.39；Alpha-helix tendentiousness is -0.38 ~1.24；Chromatography RF values with high salt are 0.2~0.97；Residue average external volume is 67.5~237.2；Cytochromes synthetic proteins amino Acid group becomes 1.06~8.36；Principal component III is -0.29~0.49；The amino acid group of SD total proteins becomes 1.15~3.73；It can And surface area is 0~271.6；The mesophilic protein family amino acids distribution of 18 nonredundancies is 1~9.4；And surface accessibility egg Bai Hanliang is 0~0.22.Utilize the method for determining protein-DNA binding sites according to the ... of the embodiment of the present invention further as a result, Accurately determine protein-DNA binding sites.

In another aspect of this invention, the present invention proposes a kind of device of determining protein-DNA binding sites.According to The embodiment of the present invention, described device include：Component is split, is suitable for the amino acid sequence of reference protein collection and to be measured respectively The amino acid sequence of protein is split as multiple candidate units with predetermined amino acid number；Amino acid attribute determines component, The amino acid attribute determines that component is connected with the fractionation component, is adapted to determine that the multiple candidate unit amino of each Sour attribute, the amino acid attribute include selected from least one of following：The average non-binding energy of residue, transfer free energy cap- Chx, amino acid composition, participate in the non-binding energy of short- and medium-range, molecular weight, transfer free energy vap-oct, alpha-helix tendentiousness, The amino of chromatography RF values with high salt, residue average external volume, cytochromes synthetic proteins amino acid composition, principal component III, SD total protein Acid composition, accessible surface product, the mesophilic protein family amino acids distribution of 18 nonredundancies and surface accessibility protein content；And It determines that component, the determining component determine that component is connected with the amino acid attribute, is suitable for based on amino in the candidate unit The attribute of acid, determines the protein-DNA binding sites of the testing protein.It utilizes as a result, according to the ... of the embodiment of the present invention true Protein-DNA binding sites can be accurately determined by determining the device of protein-DNA binding sites, and step is simple, operation side Just, cost is significantly reduced.

The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description Obviously, or practice through the invention is recognized.

Description of the drawings

The above-mentioned and/or additional aspect and advantage of the present invention will become in the description from combination following accompanying drawings to embodiment Obviously and it is readily appreciated that, wherein：

Fig. 1 shows the flow signal of the method for determining protein-DNA binding sites according to an embodiment of the invention Figure；

Fig. 2 shows the structural representation of the device of determining protein-DNA binding sites according to an embodiment of the invention Figure；

Fig. 3 shows actually determined protein-DNA binding sites according to an embodiment of the invention and theoretical setting egg(s) The comparative analysis schematic diagram of white matter-DNA binding sites；And

Fig. 4 shows the analysis schematic diagram that predetermined amino acid number according to an embodiment of the invention influences result.

Specific implementation mode

The embodiment of the present invention is described below in detail.The embodiments described below is exemplary, and is only used for explaining this hair It is bright, and be not considered as limiting the invention.

It should be noted that term " first ", " second " are used for description purposes only, it is not understood to indicate or imply phase To importance or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be with Express or implicitly include one or more this feature.Further, in the description of the present invention, unless otherwise saying Bright, the meaning of " plurality " is two or more.

The present invention proposes the method and device of determining protein-DNA binding sites, will be carried out in detail to it respectively below Description.

The method for determining protein-DNA binding sites

In one aspect of the invention, the present invention proposes a kind of method of determining protein-DNA binding sites.According to The embodiment of the present invention, referring to Fig. 1, this method includes：

S100 is split as multiple candidate units

In this step, the amino acid sequence of the amino acid sequence of reference protein collection and testing protein is split respectively For multiple candidate units with predetermined amino acid number.

Inventor will refer to egg in view of there may be interactions between amino acid residue adjacent on protein sequence The amino acid sequence of white matter collection and the amino acid sequence of testing protein are split as multiple candidates with predetermined amino acid number Unit, will close on residue feature addition predicted characteristics can improve prediction effect.It utilizes as a result, according to the ... of the embodiment of the present invention true Protein-DNA binding sites can be accurately determined by determining the method for protein-DNA binding sites, and step is simple, operation side Just, cost is significantly reduced.

According to an embodiment of the invention, predetermined amino acid numerical value is 19.Inventor has found, the ammonia in each candidate unit Base acid number (i.e. predetermined amino acid numerical value) can significantly affect the accuracy of result.If number is very few, information content is insufficient；If a Number is excessive, can reduce the speed of subsequent operation progress, and then reduce whole efficiency.Inventor has found by further investigation, when pre- Determine amino acid numerical value be 19 when, effect is preferable.

According to an embodiment of the invention, reference protein collection contains at least 30 reference proteins.Inventor has found, with not As reference less than the amino acid attribute of 30 reference proteins and its with the relationships of protein-DNA binding sites, have relatively strong Universality, be compared with the amino acid attribute of above-mentioned reference protein using the amino acid attribute of testing protein, can Accurately determine protein-DNA binding sites.

S200 determines amino acid attribute

In this step, the attribute of each amino acid in each candidate unit in multiple candidate units, amino are determined Sour attribute includes selected from least one of following：Residue be averaged non-binding energy, transfer free energy cap-chx, amino acid composition, ginseng Add the non-binding energy of short- and medium-range, molecular weight, transfer free energy vap-oct, alpha-helix tendentiousness, chromatography RF values with high salt, residue Amino acid composition, the accessible surface of average external volume, cytochromes synthetic proteins amino acid composition, principal component III, SD total protein Product, the mesophilic protein family amino acids distribution of 18 nonredundancies and surface accessibility protein content.

Inventor has found that above-mentioned amino acid attribute significantly affects amino acid and combined with DNA, in turn, is based on reference protein These amino acid attributes, protein-DNA binding sites and testing protein these amino acid attributes, can accurately really Determine the protein-DNA binding sites of testing protein.

According to an embodiment of the invention, the amino acid in candidate unit have lower Column Properties, be amino acid be protein- The instruction of DNA binding sites：The average non-binding energy of residue is -26.17~-7.59；Transfer free energy cap-chx be -8.21~ 1.45；Amino acid group becomes 0.7~8.8；It can be -14.42~-5.46 that it is non-binding, which to participate in short- and medium-range,；Molecular weight is 75.07 ~204.24；Transfer free energy vap-oct is -18.6~2.39；Alpha-helix tendentiousness is -0.38~1.24；Chromatography RF with high salt Value is 0.2~0.97；Residue average external volume is 67.5~237.2；Cytochromes synthetic proteins amino acid group become 1.06~ 8.36；Principal component III is -0.29~0.49；The amino acid group of SD total proteins becomes 1.15~3.73；Accessible surface product for 0~ 271.6；The mesophilic protein family amino acids distribution of 18 nonredundancies is 1~9.4；And surface accessibility protein content be 0~ 0.22。

It should be noted that the attribute of amino acid is obtained by PSAIA softwares in the present invention.

S300 determines protein-DNA binding sites

In this step, the attribute based on amino acid in candidate unit determines that the protein-DNA of testing protein is combined Site.According to a particular embodiment of the invention, determine that the protein-DNA of testing protein is combined using the model of kernel function Site.Specifically, model (core letter is built according to the amino acid sequence of reference protein collection and its protein-DNA binding sites Number), then the amino acid attribute of testing protein is substituted into model, model of the numerical value (being known as decision value) being calculated 0~1 In enclosing, you can determine that the amino acid is protein-DNA binding sites.

In order to facilitate understanding, the source code of the model of the method for determining protein-DNA binding sites and corresponding is given below It illustrates：

Determine the device of protein-DNA binding sites

In another aspect of this invention, the present invention proposes a kind of device of determining protein-DNA binding sites.According to The embodiment of the present invention, referring to Fig. 2, which includes：Component 100 is split, amino acid attribute determines component 200 and determines group Part 300.As a result, egg can be accurately determined using the device of determining protein-DNA binding sites according to the ... of the embodiment of the present invention White matter-DNA binding sites, and step is simple, it is easy to operate, significantly reduce cost.

Split component 100

According to an embodiment of the invention, component 100 is split to be suitable for the amino acid sequence of reference protein collection and to wait for respectively The amino acid sequence for surveying protein is split as multiple candidate units with predetermined amino acid number.

Amino acid attribute determines component 200

According to an embodiment of the invention, amino acid attribute determines that component 200 is connected with component 100 is split, and is adapted to determine that more A candidate unit amino acid attribute of each, amino acid attribute include selected from least one of following：Residue is average non-binding Energy, participates in the non-binding energy of short- and medium-range, molecular weight, transfer free energy vap- at transfer free energy cap-chx, amino acid composition Oct, alpha-helix tendentiousness, chromatography RF values with high salt, residue average external volume, cytochromes synthetic proteins amino acid composition, principal component Amino acid composition, accessible surface product, the mesophilic protein family amino acids distribution of 18 nonredundancies and the surface of III, SD total protein Accessibility protein content.

Determine component 300

According to an embodiment of the invention, it determines that component 300 determines that component 200 is connected with amino acid attribute, is suitable for based on time The attribute of amino acid in menu member determines the protein-DNA binding sites of testing protein.

It will be appreciated to those of skill in the art that above for described by the method for determining protein-DNA binding sites Feature and advantage, be equally applicable to the device of the determination protein-DNA binding sites, details are not described herein.

The solution of the present invention is explained below in conjunction with embodiment.It will be understood to those of skill in the art that following Embodiment is merely to illustrate the present invention, and should not be taken as limiting the scope of the invention.Particular technique or item are not specified in embodiment Part, it is carried out according to technology or condition described in document in the art or according to product description.Agents useful for same or instrument Production firm person is not specified in device, and being can be with conventional products that are commercially available.

Embodiment 1

In this embodiment, the protein that number is 5EEA using in the websites PDB is as testing protein, in following manner Determine its protein-DNA binding sites：

1,62 reference proteins (shown in table specific as follows) are acquired, from the websites PDB (http://www.rcsb.org/pdb/ Home/home.do its protein structural information and protein-DNA binding sites are obtained on).The amino acid sequence of testing protein Column information available sources are various, can be experiment acquisition, sequencing acquisition etc..By each reference protein and testing protein Amino acid sequence is split as multiple units with 19 amino acid.

1AAY

1AZQ

1A74

1A02

1BER-a

1BF5

1BHM-a

1BL0

1B3T

1CDW

1CF7-a

1CJG

1CMA

1C0W-b

1DP7

1D02-a

1D66-a

1ECR

1FJL-a

1GAT

1GCC

1GDT-a

1HCQ-a

1HCR

1HDD-c

1HLO-a

1HRY

1HWT-h

1IFL-a

1IGN-a

1IHF

1LMB-4

1MDY-a

1MEY-c

1MHD-a

1MNM

1MSE

1OCT

1PAR-b

1PDN

1PER-1

1PNR

1PUE-e

1PVI-b

1PYI-a

1REP-c

1SRS

1SVC

1TC3

1TF3

1TRO-a

1TSR-b

1UBD

1YRN-a

1YSA

1YUI

1XBR-a

2BOP

2DRP-a

2GLI

2HDC

3CRO-1

2, for each amino acid in each unit, the lower Column Properties of amino acid are determined：The average non-binding energy of residue turns It moves free energy cap-chx, amino acid composition, participate in the non-binding energy of short- and medium-range, molecular weight, transfer free energy vap-oct, α- Helical propensity, chromatography RF values with high salt, residue average external volume, cytochromes synthetic proteins amino acid composition, principal component III, SD Amino acid composition, accessible surface product, the mesophilic protein family amino acids distribution of 18 nonredundancies and the surface accessibility of total protein Protein content.

3, the amino acid category of the amino acid attribute based on reference protein, protein-DNA binding sites and testing protein Property, determine the protein-DNA binding sites of testing protein.

Fig. 3 give using the present invention the actually determined protein-DNA binding sites of method (practical binding site) with Protein-DNA the binding sites (theoretically binding site) that the method for document report determines.As can be seen that the method for the present invention In important indicator (the precision of prediction Ac, under prediction susceptibility Sn, prediction accuracy MCC, ROC curve of four kinds of evaluation and foreca effects Area) in the performance method that is better than document report, maintain an equal level in performance and the literature procedure of prediction specific index.

Embodiment 2

In this embodiment, influence of the research predetermined amino acid number for result.

Predetermined amino acid number is an odd number, it is contemplated that may be deposited between adjacent amino acid residue on protein sequence It is interacting, will close on residue feature addition predicted characteristics can improve prediction effect.In the selection of predetermined amino acid number In, if the choosing of predetermined amino acid number it is too small if information content it is insufficient, if the too conference of choosing reduces program operation speed and predicts Modelling effect is without too big promotion, so it is also to build a ring important in model to select suitable predetermined amino acid number.

In order to assess influence of the different predetermined amino acid numbers to prediction effect, 11 odd numbers between 3 to 23 have been used Model is built respectively as predetermined amino acid number, is obtained 11 group model evaluation parameters, be see the table below.Due to comparing in evaluation parameter These three values of concern Ac, MCC, AUC, therefore tendency chart of these three values with predetermined amino acid number of variations is drawn, it will make a reservation for It is as shown in Figure 4 to draw line chart as ordinate as abscissa, evaluation parameter for amino acid number.

With the increase of predetermined amino acid number it can be seen from table and Fig. 4, this four parameters of Ac, Sn, Sp, MCC General trend is all first to rise to decline afterwards, and gradually increased trend is presented in AUC but later stage growth trend gradually slows down, and comprehensive five are commented The prediction effect of valence index, the model that predetermined amino acid number is built when being 19 is best.Therefore, subsequent experiment all uses 19 to make For predetermined amino acid number.

Influence of the different predetermined amino acid numbers of table 1 to prediction model

	Ac	Sn	Sp	MCC	AUC
						3	0.568844	0.556183	0.581505	0.13801	0.599206
5	0.572204	0.596129	0.54828	0.145152	0.615707
						7	0.606075	0.601075	0.611075	0.212586	0.637525
9	0.611102	0.617903	0.604301	0.223451	0.653098
						11	0.617608	0.621129	0.614086	0.236466	0.665952
13	0.621855	0.606183	0.637527	0.244997	0.671059
						15	0.623522	0.617903	0.62914	0.248033	0.681997
17	0.623495	0.612903	0.634086	0.248059	0.677847
						19	0.626855	0.614516	0.639194	0.25445	0.680827
21	0.609462	0.599624	0.619301	0.21966	0.683184
						23	0.617769	0.618065	0.617473	0.237091	0.682246

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims

1. a kind of method of determining protein-DNA binding sites, which is characterized in that including：

The amino acid sequence of the amino acid sequence of reference protein collection and testing protein is split as to have predetermined amino respectively Multiple candidate units of sour number；

Determine the multiple candidate unit amino acid attribute of each, the amino acid attribute include selected from it is following at least it One：Residue be averaged non-binding energy, transfer free energy cap-chx, amino acid composition, participate in short- and medium-range it is non-binding can, molecule Amount, transfer free energy vap-oct, alpha-helix tendentiousness, chromatography RF values with high salt, residue average external volume, cytochromes synthetic proteins Amino acid composition, the amino acid composition of principal component III, SD total protein, accessible surface product, the mesophilic protein family ammonia of 18 nonredundancies Base acid is distributed and surface accessibility protein content；And

Based on the attribute of amino acid in the candidate unit, the protein-DNA binding sites of the testing protein are determined.

2. according to the method described in claim 1, it is characterized in that, the predetermined amino acid numerical value is 19.

3. according to the method described in claim 1, it is characterized in that, the reference protein collection contains at least 30 references Protein.

4. according to the method described in claim 1, it is characterized in that, amino acid in the candidate unit have it is following at least it One attribute is the instruction that the amino acid is protein-DNA binding sites：

The average non-binding energy of residue is -26.17~-7.59；

Transfer free energy cap-chx is -8.21~1.45；

Amino acid group becomes 0.7~8.8；

It can be -14.42~-5.46 that it is non-binding, which to participate in short- and medium-range,；

Molecular weight is 75.07~204.24；

Transfer free energy vap-oct is -18.6~2.39；

Alpha-helix tendentiousness is -0.38~1.24；

Chromatography RF values with high salt are 0.2~0.97；

Residue average external volume is 67.5~237.2；

Cytochromes synthetic proteins amino acid group becomes 1.06~8.36；

Principal component III is -0.29~0.49；

The amino acid group of SD total proteins becomes 1.15~3.73；

Accessible surface product is 0~271.6；

The mesophilic protein family amino acids distribution of 18 nonredundancies is 1~9.4；And

Surface accessibility protein content is 0~0.22.

5. a kind of device of determining protein-DNA binding sites, which is characterized in that including：

Component is split, suitable for being respectively split as the amino acid sequence of the amino acid sequence of reference protein collection and testing protein Multiple candidate units with predetermined amino acid number；

Amino acid attribute determines that component, the amino acid attribute determine that component is connected with the fractionation component, is adapted to determine that described Multiple candidate units amino acid attribute of each, the amino acid attribute include selected from least one of following：Residue is average Non-binding energy, participates in the non-binding energy of short- and medium-range, molecular weight, transfer freely at transfer free energy cap-chx, amino acid composition Can vap-oct, alpha-helix tendentiousness, chromatography RF values with high salt, residue average external volume, cytochromes synthetic proteins amino acid composition, Principal component III, SD total protein amino acid composition, accessible surface product, the mesophilic protein family amino acids distribution of 18 nonredundancies with And surface accessibility protein content；And

It determines that component, the determining component determine that component is connected with the amino acid attribute, is suitable for based in the candidate unit The attribute of amino acid determines the protein-DNA binding sites of the testing protein.