CN101307359A - Process for recognising human gene promoter - Google Patents
Process for recognising human gene promoter Download PDFInfo
- Publication number
- CN101307359A CN101307359A CNA2008100699415A CN200810069941A CN101307359A CN 101307359 A CN101307359 A CN 101307359A CN A2008100699415 A CNA2008100699415 A CN A2008100699415A CN 200810069941 A CN200810069941 A CN 200810069941A CN 101307359 A CN101307359 A CN 101307359A
- Authority
- CN
- China
- Prior art keywords
- promotor
- promoter
- human gene
- gene promoter
- variable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for identifying human gene promoter, which can be used in the determination of a human gene promoter area and the interpretation of the structural function of the human gene promoter area, and can also be used to discover new unknown gene. The method comprises the following steps that: (a) a basic group generalized character scoring representation system is established on the basis of a principal component analysis method; (b) basic group generalized character scoring is used to represent the structures of human gene promoter and nonpromoter; (c) a self-cross covariance method is used to carry out normalization treatment of the representation variable of each promoter and nonpromoter; and (d) a radial nucleus support vector machine is used to establish a human gene promoter identification model.
Description
Technical field
The present invention relates to a kind of Human genome recognition methods, particularly a kind of process for recognising human gene promoter.
Background technology
Human analysis to whole gene has successfully been quickened in the drafting of Human genome sketch.For each gene transcription activity, promotor is important regulation and control zone.The annotation definite and structure function of promoter region is the basis of understanding gene expression ways, gene regulatory network, cytodifferentiation and growth.Promotor prediction is for finding new unknown gene, for improving expression vector in the gene therapy method or the gene import system all has crucial effects.The promotor prediction has caused extensive concern, and its predictor is based upon on the different concepts, and ultimata is that the characteristic of promoter region is different from other gene DNA characteristic, and these notions comprise based on signal and content-based.Biological promotor is carried out computer forecast and identification is a challenging job, and the diversity of promotor and to the limitation of transcriptional control mechanism understanding brings very big difficulty for relevant research work.The homology alignment algorithm has been used for the nucleotide sequence homology comparison, but be used for the promotor prediction and still be in the infancy, though can come the cluster homologous promoter by alignment algorithm, the sequence conservation of homologous gene promoter element is well below its encoding sequence, therefore but in most cases,, similarity searching no longer provides useful clue (Duret et al. to its function identification, Curr.Opin.Struct.Biol., 1997,7:399).In addition, many promotors are subjected to the adjusting of many signal paths, and the functional requirement of specificly-response different stimulated makes the weave construction of promotor become complicated more various.Sometimes in addition the promotor that regulated by same signal paths may not have fully yet sequence homology (Kirchhamer, et al., Proc.Natl.Acad.Sci.U.S.A., 1996,93:9322).In addition, there are many sequence structure feature in the promotor as transcription factor binding site point, and these feature structures are not exclusive by promotor, they are dispersed in the whole genome, how this numerous noise signal of filtering also becomes the difficult problem that computer forecast faced (Sap, et al., the Nature of promotor in the big fragment gene group, 1989,340:242; Bohjanen, et al., Nucleic Acids Res., 1997,25:4481; Wang, et al., Proc.Natl.Acad.Sci.U.S.A., 1998,95:492).The transcription factor binding characteristic that has some programs to obtain according to experiment is described the sequence signature of promotor, and the foundation of predicting as promotor successively, but actual effect is not very good, and omission and false positive are all more serious.
Summary of the invention
In view of this,, the invention provides a kind of process for recognising human gene promoter, can be used in the annotation of the definite and structure function of Human genome promoter region, can be used for finding new unknown gene in order to solve above-mentioned promotor forecasting institute existing problems.
The object of the present invention is achieved like this: a kind of process for recognising human gene promoter comprises the steps:
A), set up base broad sense character score representation system based on principal component analytical method;
B) application base broad sense character score characterizes the structure of Human genome promotor and non-promotor;
C) with the sign variable of everyone genoid promotor and non-promotor being done normalized from intersecting covariance method;
D) set up Human genome Promoter Recognition model with radially basic nuclear SVMs.
Further, be that step a) specifically comprises the steps:
A1) choose 1209 kinds of 0D-3D nature parameters of 5 kinds of bases;
A2) 1209 kinds of nature parameters are done correlation analysis, selectedly obtain 41 nature parameters;
A3) handle the base nature parameters that obtains with PCA, obtain 4 principal constituents;
A4) calculate each principal component scores, will get resolute and be defined as base broad sense character score;
Further, step b) specifically comprises: characterize along 5 ' → 3 ' direction with the sequence of related 4 principal constituents of base broad sense character score vector to Human genome promotor and non-promotor, each base wherein is with 4 base broad sense character score characterization vectors;
Further, step c) specifically comprises the steps: with handling each promotor obtain and the sign variable of non-promoter sequence from intersecting covariance, it is 6 that step-length l is set, make the sign variable number unanimity of each sequence, and will handle the variable obtain independent variable(s) through intersecting covariance certainly as the Promoter Recognition model;
Further, step d) specifically comprises the steps: at first to define two indieating variables, use " 1 " expression promotor sample respectively, with the non-promotor sample of " 1 " expression, with the dependent variable of this indieating variable, set up Human genome Promoter Recognition model with radially basic nuclear SVMs as the Promoter Recognition model.
A kind of process for recognising human gene promoter of the present invention, the base broad sense character score of wherein choosing is contained to contain much information, the physical chemistry meaning is clear and definite, the sign ability is strong, the result easily explains, expand performance reaches easy and simple to handle well; With intersecting covariance method certainly the sign variable of each promotor and non-promotor is done normalized, this method can reduce the loss of original variable information largely, can take into full account the interaction between the adjacent base simultaneously and influences each other; And radially base nuclear SVMs passes through the kernel function technology, can be correlated with well through intersecting the sequence characterization variable of covariance conversion and the relation between the observation classification value certainly, can effectively prevent the over-fitting of model, simultaneously, institute's established model has good extensive performance.
Other advantages of the present invention, target and feature will be set forth to a certain extent in the following description, and to a certain extent, based on being conspicuous to those skilled in the art, perhaps can obtain instruction from the practice of the present invention to investigating hereinafter.Target of the present invention and other advantages can be passed through following specification sheets, claims, and the specifically noted structure realizes and obtains in the accompanying drawing.
Description of drawings
In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with accompanying drawing, wherein:
Fig. 1 is that experimenter's operating characteristics of supporting vector machine model recognition result of the present invention is analyzed synoptic diagram.
Embodiment
Hereinafter with reference to accompanying drawing, be that example is described in detail to adopting method of the present invention to be used for the Human genome Promoter Recognition, may further comprise the steps:
A), set up base broad sense character score representation system based on principal component analytical method;
Collect 5 kinds of base (A, C, G, T and U) 1209 kinds of nature parameters, comprising: constitute characteristic, functional group's number, former subcenter fragment and molecular characterization, molecule electricity be apart from vector (MEDV), holographic elements distance vector (MHDV), topology, running and path number, Connectivity Index of Electronic Density, information index, autocorrelation, the edges abut index, Burden eigenwert, topological electric charge index, the eigenwert index, Randic molecule section, how much, based on the RBF descriptor (RDF) of different interatomic distances, based on the descriptor that the molecular structure characterization (MoRSE) of electric diffraction approach obtains, (WHIM) descriptor of the whole constant molecule of weighting and how much, set (GETAWAY) descriptor of topology and atomic wts etc.; Also comprise other relevant nature in addition: nature parameters such as highest occupied molecular orbital (HOMO) energy, moment of dipole and Wiener index.
Adopt principle component analysis compression descriptor quantity, for fear of the harm of multiple correlation serious between the variable to principal constituent, at first 1209 original variables are done correlation analysis, for relation conefficient more than or equal to 0.90 respectively organize variable, according to its magnitude of load in the original variable matrix, with one of them reservation, being deleted of other, 41 variablees of final residue, its principal reaction the following information of base: molecular-weight average, the Multiple Bonds number, average fragrant polarization, average electrical topology state, the electronics total energy, thermodynamic property, Moriguchi suffering-partition ratio (logP), the number of urea derivative, hydrogen bond receptor atom number (N, O, F), E-state topological parameter, the flexible index of Kier, highest occupied molecular orbital (HOMO) can, the holographic elements distance vector, moment of dipole, distortional energy and space structure etc.41 variablees its preceding 4 principal constituents accumulative total after the principle component analysis conversion is explained the variance of raw data matrix (5 * 41) 99.99%, see Table 1 through the principal component scores after the conversion, therefore, available these 4 principal component scores matrixes (5 * 4) replace original variable matrix (5 * 41).
4 principal component scores of 41 kinds of nature parameters of 5 kinds of bases of table 1
4 principal constituent loading analyses are found, just contributing relatively maximum to the 1st principal constituent is to be the WHIM index of the 3rd composition symmetry direction of weight with the atomic mass, the WHIM descriptor belongs to how much class description of 3D, be that PCA to atomic coordinate weight matrix covariance matrix obtains, next is based on the descriptor of structural information content, and this two class descriptions all can be considered three-dimensional (Steric) characteristic description.Negative contribution is bigger is to be variablees such as the Moran autocorrelation descriptor of weight and distortional energy with the atomic polarization degree.Just contributing the bigger electron diffraction method that is based on to characterize molecule 3D structure and variable informations such as the non-weighting 3D-MoRSE descriptor component that obtains and electronic energy to the 2nd principal constituent.Negative contribution is bigger is variablees such as topology distance summation between nitrogen-atoms (N) and the Sauerstoffatom (O).In the 3rd principal constituent, the variable with big positive load is that 2-passage Kier revises α shape index and the flexible index of Kier, and the both belongs to topological class description.Have than the heavy load lotus be information such as average atom polarization (at carbon atom) and molecular-weight average, it all belongs to molecule and constitutes class description.With the 4th principal constituent load positive correlation bigger be the 7th component of the holographic elements distance vector that proposes by this study group.The holographic elements distance vector is that atom is divided into 13 kinds of atomic types, the descriptor that further defines atom belonging and relative bond distance and obtain based on molecule 2D topological framework, wherein the 7th representation in components atomic environment C-and>N-, holography distance ("-" between>P-, ">", "<" represents to be connected with that 1,2,2 non-hydrogen atoms or chemical bond are attached thereto respectively).What present big negative correlation is the 3D-MoRSE descriptor component of non-weighting and is the variable informations such as Moran autocorrelation descriptor of weight with the atomic polarization degree.For convenient, claim that these 4 principal component scores vectors are base broad sense character score because these 4 resolute from multi-angle comprehensive the most information of 1209 kinds of nature parameters of base, therefore, can consider that attempting using it for nucleotide sequence characterizes.
B) application base broad sense character score characterizes the structure of Human genome promotor and non-promotor;
Select 565 Human genome promoter sequences, 3819 non-promoter sequences (890 exon and 2929 introns), with 4 related principal constituents of base broad sense character score vector selected sequence is characterized along 5 ' → 3 ' direction, each base in the sequence is with 4 base broad sense character score characterization vectors.Each sequence characterizes with n * 4 variable according to its base number that contains (being defined as n).
C) with the sign variable of everyone genoid promotor and non-promotor being done normalized from intersecting covariance method;
Handle the sign variable obtain each promotor and non-promoter sequence with intersecting covariance (ACC) certainly, this method has been considered all interactions between the sequence different loci base parameter, therefore, can farthest reduce information loss in the data conversion process.If the shortest sequence length is l+1 in the sample set of being studied, any one is contained the sequence of n base, ACC handles as follows:
In the formula: l is a step-length; I and i+l are base present position in the sequence; A and b are respectively i and i+l the corresponding descriptor components number of base, for base broad sense character score vector, its a, b=1,2,3,4.Can see, when calculate institute might step-length the time (l=1,2,3 ..., l), the sequence of different lengths its descriptor number after ACC handles finally all is 4 in the sample set
2* l, selecting step-length l herein is 6, and every sequence can be by 4 like this
2* 6=96 variable characterizes, and will handle the variable that the obtains independent variable(s) as the Promoter Recognition model through intersecting covariance certainly.
D) set up Human genome Promoter Recognition model with radially basic nuclear SVMs;
At first define two indieating variables, use " 1 " expression promotor sample respectively, with " 1 " expression non-promotor sample (exon and intron), with the dependent variable of this indieating variable as the Promoter Recognition model, set up Human genome Promoter Recognition model with radially basic nuclear SVMs, its parameter is set to: C=200.0, K (x, x
i)=exp (0.125||x-x
i||
2).If define A respectively
CcFor calculating the shared total sample number order per-cent of the correct number of samples of prediction, S
pFor predicting the per-cent of correct promotor number of samples, S
nFor predicting the per-cent of correct non-promotor number of samples, MCC is statistical parameters such as Matthews's relation conefficient, and then through the leaving-one method validation-cross, supporting vector machine model gets A to 565 promotors and 3819 non-Promoter Recognition in the training set
Cc=83.8, S
n=67.1, S
p=86.3 and MCC=0.442, further adopt and stay 1/5 method validation-cross to get A
Cc=81.7, S
n=66.9, S
p=83.8 and MCC=0.406, this shows based on broad sense base character score and characterizes, and intersects the covariance normalized certainly, and radially base nuclear SVMs modeling process institute established model can be discerned the Human genome promotor preferably.The number that leaving-one method and the support vector number that stays 1/5 method to obtain account for total sample is respectively 62.1% and 68.3%, promptly there is 37.9% and 31.7% sample to be deleted safely and do not influence its prediction effect, show that further the support vector sorting machine has good extensive performance new sample.
Further with (1-S
p) be X-coordinate (X-axis), sensitivity (S
n) be ordinate zou (Y-axis), draw experimenter's operating characteristic curve,, can find out that the area that the leaving-one method of institute's established model is corresponding with staying 1/5 method is respectively 0.835 and 0.819 referring to Fig. 1.
For further verifying the prediction effect of institute's inventive method for the Human genome promotor, select 100 promotors and 100 intron sequences different to predict from EPD database (http://www.epd.isb-sib.ch/) with used training set, with base nuclear supporting vector machine model radially to prediction result list in the table 2, select 7 predictive server to the comparison that predicts the outcome of 200 sequences simultaneously, find the inventive method gained S through contrast
nAnd MCC is the highest, and prediction has than remarkable advantages for the Human genome promotor to show it.
The comparison that predicts the outcome of table 2 Human genome promotor
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.
Claims (5)
1. a process for recognising human gene promoter is characterized in that comprising the steps:
A), make up base broad sense character score representation system based on principal component analytical method;
B) application base broad sense character score characterizes the structure of Human genome promotor and non-promotor;
C) with the sign variable of everyone genoid promotor and non-promotor being done normalized from intersecting covariance method;
D) set up Human genome Promoter Recognition model with radially basic nuclear SVMs.
2. according to a kind of process for recognising human gene promoter of claim 1, it is characterized in that step a) specifically comprises the steps:
A1) 1209 kinds of 0D-3D nature parameters of 5 kinds of bases of selection;
A2) 1209 kinds of nature parameters are done correlation analysis, selectedly obtain 41 nature parameters;
A3) handle the base nature parameters that obtains with PCA, obtain 4 principal constituents;
A4) calculate each principal component scores, will get resolute and be defined as base broad sense character score.
3. according to a kind of process for recognising human gene promoter of claim 2, it is characterized in that step b) specifically comprises: characterize with the sequence of related 4 principal constituents of base broad sense character score vector to Human genome promotor and non-promotor, each base in the sequence is with 4 base broad sense character score characterization vectors.
4. according to a kind of process for recognising human gene promoter of claim 3, it is characterized in that step c) specifically comprises the steps: with handling each promotor obtain and the sign variable of non-promoter sequence from intersecting covariance, it is 6 that step-length l is set, make the sign variable number unanimity of each sequence, and will handle the variable obtain independent variable(s) through intersecting covariance certainly as the Promoter Recognition model.
5. according to each a kind of process for recognising human gene promoter in the claim 1 to 4, it is characterized in that step d) specifically comprises the steps: at first to define two indieating variables, use " 1 " expression promotor sample respectively, with the non-promotor sample of " 1 " expression, with the dependent variable of this indieating variable, set up Human genome Promoter Recognition model with radially basic nuclear SVMs as the Promoter Recognition model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008100699415A CN101307359A (en) | 2008-07-08 | 2008-07-08 | Process for recognising human gene promoter |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008100699415A CN101307359A (en) | 2008-07-08 | 2008-07-08 | Process for recognising human gene promoter |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101307359A true CN101307359A (en) | 2008-11-19 |
Family
ID=40124037
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2008100699415A Pending CN101307359A (en) | 2008-07-08 | 2008-07-08 | Process for recognising human gene promoter |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101307359A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324002A (en) * | 2011-06-03 | 2012-01-18 | 哈尔滨工程大学 | Two-dimensional image representation method of digital image processing-based DNA sequence |
CN104834834A (en) * | 2015-04-09 | 2015-08-12 | 苏州大学张家港工业技术研究院 | Construction method and device of promoter recognition system |
-
2008
- 2008-07-08 CN CNA2008100699415A patent/CN101307359A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102324002A (en) * | 2011-06-03 | 2012-01-18 | 哈尔滨工程大学 | Two-dimensional image representation method of digital image processing-based DNA sequence |
CN102324002B (en) * | 2011-06-03 | 2013-10-30 | 哈尔滨工程大学 | Two-dimensional image representation method of digital image processing-based DNA sequence |
CN104834834A (en) * | 2015-04-09 | 2015-08-12 | 苏州大学张家港工业技术研究院 | Construction method and device of promoter recognition system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pan et al. | Predicting RNA–protein binding sites and motifs through combining local and global deep convolutional neural networks | |
Liu et al. | iRSpot-EL: identify recombination spots with an ensemble learning approach | |
CN107038348B (en) | Drug target prediction method based on protein-ligand interaction fingerprint | |
Degroeve et al. | SpliceMachine: predicting splice sites from high-dimensional local context representations | |
Di Lena et al. | Deep architectures for protein contact map prediction | |
Ji et al. | Identifying time-lagged gene clusters using gene expression data | |
Zhu et al. | Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts | |
Iqbal et al. | Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations | |
O'Flanagan et al. | Non-additivity in protein–DNA binding | |
CN111402967B (en) | Method for improving virtual screening capability of docking software based on machine learning algorithm | |
Agostini et al. | SeAMotE: a method for high-throughput motif discovery in nucleic acid sequences | |
CN107194207A (en) | Protein ligands binding site estimation method based on granularity support vector machine ensembles | |
KR101888628B1 (en) | Method and Media of Predicting protein-binding regions in RNA Using Nucleotide Profiles and Compositions | |
Hecker et al. | The adapted Activity-By-Contact model for enhancer–gene assignment and its application to single-cell data | |
Palin et al. | Locating potential enhancer elements by comparative genomics using the EEL software | |
Benegas et al. | GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction | |
Baten et al. | Fast splice site detection using information content and feature reduction | |
CN113823356A (en) | Methylation site identification method and device | |
CN101307359A (en) | Process for recognising human gene promoter | |
Udaka et al. | Empirical evaluation of a dynamic experiment design method for prediction of MHC class I-binding peptides | |
Xiao et al. | PAI-SAE: predicting adenosine to inosine editing sites based on hybrid features by using spare auto-encoder | |
Ye et al. | Interpreting and visualizing ChIP-seq data with the seqMINER software | |
Lu et al. | Prediction for human transcription start site using diversity measure with quadratic discriminant | |
Gopal et al. | A computational investigation of kinetoplastid trans-splicing | |
Mihalek et al. | A structure and evolution-guided Monte Carlo sequence selection strategy for multiple alignment-based analysis of proteins |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20081119 |