CN101307359A - Process for recognising human gene promoter - Google Patents

Process for recognising human gene promoter Download PDF

Info

Publication number
CN101307359A
CN101307359A CNA2008100699415A CN200810069941A CN101307359A CN 101307359 A CN101307359 A CN 101307359A CN A2008100699415 A CNA2008100699415 A CN A2008100699415A CN 200810069941 A CN200810069941 A CN 200810069941A CN 101307359 A CN101307359 A CN 101307359A
Authority
CN
China
Prior art keywords
promotor
promoter
human gene
gene promoter
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100699415A
Other languages
Chinese (zh)
Inventor
梁桂兆
舒茂
梅虎
杨力
李志良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CNA2008100699415A priority Critical patent/CN101307359A/en
Publication of CN101307359A publication Critical patent/CN101307359A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for identifying human gene promoter, which can be used in the determination of a human gene promoter area and the interpretation of the structural function of the human gene promoter area, and can also be used to discover new unknown gene. The method comprises the following steps that: (a) a basic group generalized character scoring representation system is established on the basis of a principal component analysis method; (b) basic group generalized character scoring is used to represent the structures of human gene promoter and nonpromoter; (c) a self-cross covariance method is used to carry out normalization treatment of the representation variable of each promoter and nonpromoter; and (d) a radial nucleus support vector machine is used to establish a human gene promoter identification model.

Description

A kind of process for recognising human gene promoter
Technical field
The present invention relates to a kind of Human genome recognition methods, particularly a kind of process for recognising human gene promoter.
Background technology
Human analysis to whole gene has successfully been quickened in the drafting of Human genome sketch.For each gene transcription activity, promotor is important regulation and control zone.The annotation definite and structure function of promoter region is the basis of understanding gene expression ways, gene regulatory network, cytodifferentiation and growth.Promotor prediction is for finding new unknown gene, for improving expression vector in the gene therapy method or the gene import system all has crucial effects.The promotor prediction has caused extensive concern, and its predictor is based upon on the different concepts, and ultimata is that the characteristic of promoter region is different from other gene DNA characteristic, and these notions comprise based on signal and content-based.Biological promotor is carried out computer forecast and identification is a challenging job, and the diversity of promotor and to the limitation of transcriptional control mechanism understanding brings very big difficulty for relevant research work.The homology alignment algorithm has been used for the nucleotide sequence homology comparison, but be used for the promotor prediction and still be in the infancy, though can come the cluster homologous promoter by alignment algorithm, the sequence conservation of homologous gene promoter element is well below its encoding sequence, therefore but in most cases,, similarity searching no longer provides useful clue (Duret et al. to its function identification, Curr.Opin.Struct.Biol., 1997,7:399).In addition, many promotors are subjected to the adjusting of many signal paths, and the functional requirement of specificly-response different stimulated makes the weave construction of promotor become complicated more various.Sometimes in addition the promotor that regulated by same signal paths may not have fully yet sequence homology (Kirchhamer, et al., Proc.Natl.Acad.Sci.U.S.A., 1996,93:9322).In addition, there are many sequence structure feature in the promotor as transcription factor binding site point, and these feature structures are not exclusive by promotor, they are dispersed in the whole genome, how this numerous noise signal of filtering also becomes the difficult problem that computer forecast faced (Sap, et al., the Nature of promotor in the big fragment gene group, 1989,340:242; Bohjanen, et al., Nucleic Acids Res., 1997,25:4481; Wang, et al., Proc.Natl.Acad.Sci.U.S.A., 1998,95:492).The transcription factor binding characteristic that has some programs to obtain according to experiment is described the sequence signature of promotor, and the foundation of predicting as promotor successively, but actual effect is not very good, and omission and false positive are all more serious.
Summary of the invention
In view of this,, the invention provides a kind of process for recognising human gene promoter, can be used in the annotation of the definite and structure function of Human genome promoter region, can be used for finding new unknown gene in order to solve above-mentioned promotor forecasting institute existing problems.
The object of the present invention is achieved like this: a kind of process for recognising human gene promoter comprises the steps:
A), set up base broad sense character score representation system based on principal component analytical method;
B) application base broad sense character score characterizes the structure of Human genome promotor and non-promotor;
C) with the sign variable of everyone genoid promotor and non-promotor being done normalized from intersecting covariance method;
D) set up Human genome Promoter Recognition model with radially basic nuclear SVMs.
Further, be that step a) specifically comprises the steps:
A1) choose 1209 kinds of 0D-3D nature parameters of 5 kinds of bases;
A2) 1209 kinds of nature parameters are done correlation analysis, selectedly obtain 41 nature parameters;
A3) handle the base nature parameters that obtains with PCA, obtain 4 principal constituents;
A4) calculate each principal component scores, will get resolute and be defined as base broad sense character score;
Further, step b) specifically comprises: characterize along 5 ' → 3 ' direction with the sequence of related 4 principal constituents of base broad sense character score vector to Human genome promotor and non-promotor, each base wherein is with 4 base broad sense character score characterization vectors;
Further, step c) specifically comprises the steps: with handling each promotor obtain and the sign variable of non-promoter sequence from intersecting covariance, it is 6 that step-length l is set, make the sign variable number unanimity of each sequence, and will handle the variable obtain independent variable(s) through intersecting covariance certainly as the Promoter Recognition model;
Further, step d) specifically comprises the steps: at first to define two indieating variables, use " 1 " expression promotor sample respectively, with the non-promotor sample of " 1 " expression, with the dependent variable of this indieating variable, set up Human genome Promoter Recognition model with radially basic nuclear SVMs as the Promoter Recognition model.
A kind of process for recognising human gene promoter of the present invention, the base broad sense character score of wherein choosing is contained to contain much information, the physical chemistry meaning is clear and definite, the sign ability is strong, the result easily explains, expand performance reaches easy and simple to handle well; With intersecting covariance method certainly the sign variable of each promotor and non-promotor is done normalized, this method can reduce the loss of original variable information largely, can take into full account the interaction between the adjacent base simultaneously and influences each other; And radially base nuclear SVMs passes through the kernel function technology, can be correlated with well through intersecting the sequence characterization variable of covariance conversion and the relation between the observation classification value certainly, can effectively prevent the over-fitting of model, simultaneously, institute's established model has good extensive performance.
Other advantages of the present invention, target and feature will be set forth to a certain extent in the following description, and to a certain extent, based on being conspicuous to those skilled in the art, perhaps can obtain instruction from the practice of the present invention to investigating hereinafter.Target of the present invention and other advantages can be passed through following specification sheets, claims, and the specifically noted structure realizes and obtains in the accompanying drawing.
Description of drawings
In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with accompanying drawing, wherein:
Fig. 1 is that experimenter's operating characteristics of supporting vector machine model recognition result of the present invention is analyzed synoptic diagram.
Embodiment
Hereinafter with reference to accompanying drawing, be that example is described in detail to adopting method of the present invention to be used for the Human genome Promoter Recognition, may further comprise the steps:
A), set up base broad sense character score representation system based on principal component analytical method;
Collect 5 kinds of base (A, C, G, T and U) 1209 kinds of nature parameters, comprising: constitute characteristic, functional group's number, former subcenter fragment and molecular characterization, molecule electricity be apart from vector (MEDV), holographic elements distance vector (MHDV), topology, running and path number, Connectivity Index of Electronic Density, information index, autocorrelation, the edges abut index, Burden eigenwert, topological electric charge index, the eigenwert index, Randic molecule section, how much, based on the RBF descriptor (RDF) of different interatomic distances, based on the descriptor that the molecular structure characterization (MoRSE) of electric diffraction approach obtains, (WHIM) descriptor of the whole constant molecule of weighting and how much, set (GETAWAY) descriptor of topology and atomic wts etc.; Also comprise other relevant nature in addition: nature parameters such as highest occupied molecular orbital (HOMO) energy, moment of dipole and Wiener index.
Adopt principle component analysis compression descriptor quantity, for fear of the harm of multiple correlation serious between the variable to principal constituent, at first 1209 original variables are done correlation analysis, for relation conefficient more than or equal to 0.90 respectively organize variable, according to its magnitude of load in the original variable matrix, with one of them reservation, being deleted of other, 41 variablees of final residue, its principal reaction the following information of base: molecular-weight average, the Multiple Bonds number, average fragrant polarization, average electrical topology state, the electronics total energy, thermodynamic property, Moriguchi suffering-partition ratio (logP), the number of urea derivative, hydrogen bond receptor atom number (N, O, F), E-state topological parameter, the flexible index of Kier, highest occupied molecular orbital (HOMO) can, the holographic elements distance vector, moment of dipole, distortional energy and space structure etc.41 variablees its preceding 4 principal constituents accumulative total after the principle component analysis conversion is explained the variance of raw data matrix (5 * 41) 99.99%, see Table 1 through the principal component scores after the conversion, therefore, available these 4 principal component scores matrixes (5 * 4) replace original variable matrix (5 * 41).
4 principal component scores of 41 kinds of nature parameters of 5 kinds of bases of table 1
Figure A20081006994100061
4 principal constituent loading analyses are found, just contributing relatively maximum to the 1st principal constituent is to be the WHIM index of the 3rd composition symmetry direction of weight with the atomic mass, the WHIM descriptor belongs to how much class description of 3D, be that PCA to atomic coordinate weight matrix covariance matrix obtains, next is based on the descriptor of structural information content, and this two class descriptions all can be considered three-dimensional (Steric) characteristic description.Negative contribution is bigger is to be variablees such as the Moran autocorrelation descriptor of weight and distortional energy with the atomic polarization degree.Just contributing the bigger electron diffraction method that is based on to characterize molecule 3D structure and variable informations such as the non-weighting 3D-MoRSE descriptor component that obtains and electronic energy to the 2nd principal constituent.Negative contribution is bigger is variablees such as topology distance summation between nitrogen-atoms (N) and the Sauerstoffatom (O).In the 3rd principal constituent, the variable with big positive load is that 2-passage Kier revises α shape index and the flexible index of Kier, and the both belongs to topological class description.Have than the heavy load lotus be information such as average atom polarization (at carbon atom) and molecular-weight average, it all belongs to molecule and constitutes class description.With the 4th principal constituent load positive correlation bigger be the 7th component of the holographic elements distance vector that proposes by this study group.The holographic elements distance vector is that atom is divided into 13 kinds of atomic types, the descriptor that further defines atom belonging and relative bond distance and obtain based on molecule 2D topological framework, wherein the 7th representation in components atomic environment C-and>N-, holography distance ("-" between>P-, ">", "<" represents to be connected with that 1,2,2 non-hydrogen atoms or chemical bond are attached thereto respectively).What present big negative correlation is the 3D-MoRSE descriptor component of non-weighting and is the variable informations such as Moran autocorrelation descriptor of weight with the atomic polarization degree.For convenient, claim that these 4 principal component scores vectors are base broad sense character score because these 4 resolute from multi-angle comprehensive the most information of 1209 kinds of nature parameters of base, therefore, can consider that attempting using it for nucleotide sequence characterizes.
B) application base broad sense character score characterizes the structure of Human genome promotor and non-promotor;
Select 565 Human genome promoter sequences, 3819 non-promoter sequences (890 exon and 2929 introns), with 4 related principal constituents of base broad sense character score vector selected sequence is characterized along 5 ' → 3 ' direction, each base in the sequence is with 4 base broad sense character score characterization vectors.Each sequence characterizes with n * 4 variable according to its base number that contains (being defined as n).
C) with the sign variable of everyone genoid promotor and non-promotor being done normalized from intersecting covariance method;
Handle the sign variable obtain each promotor and non-promoter sequence with intersecting covariance (ACC) certainly, this method has been considered all interactions between the sequence different loci base parameter, therefore, can farthest reduce information loss in the data conversion process.If the shortest sequence length is l+1 in the sample set of being studied, any one is contained the sequence of n base, ACC handles as follows:
ACC a , b , l = Σ i = 1 n - l Z a , i × Z b , i + l n - l , ( l = 1,2,3 , . . . , l )
In the formula: l is a step-length; I and i+l are base present position in the sequence; A and b are respectively i and i+l the corresponding descriptor components number of base, for base broad sense character score vector, its a, b=1,2,3,4.Can see, when calculate institute might step-length the time (l=1,2,3 ..., l), the sequence of different lengths its descriptor number after ACC handles finally all is 4 in the sample set 2* l, selecting step-length l herein is 6, and every sequence can be by 4 like this 2* 6=96 variable characterizes, and will handle the variable that the obtains independent variable(s) as the Promoter Recognition model through intersecting covariance certainly.
D) set up Human genome Promoter Recognition model with radially basic nuclear SVMs;
At first define two indieating variables, use " 1 " expression promotor sample respectively, with " 1 " expression non-promotor sample (exon and intron), with the dependent variable of this indieating variable as the Promoter Recognition model, set up Human genome Promoter Recognition model with radially basic nuclear SVMs, its parameter is set to: C=200.0, K (x, x i)=exp (0.125||x-x i|| 2).If define A respectively CcFor calculating the shared total sample number order per-cent of the correct number of samples of prediction, S pFor predicting the per-cent of correct promotor number of samples, S nFor predicting the per-cent of correct non-promotor number of samples, MCC is statistical parameters such as Matthews's relation conefficient, and then through the leaving-one method validation-cross, supporting vector machine model gets A to 565 promotors and 3819 non-Promoter Recognition in the training set Cc=83.8, S n=67.1, S p=86.3 and MCC=0.442, further adopt and stay 1/5 method validation-cross to get A Cc=81.7, S n=66.9, S p=83.8 and MCC=0.406, this shows based on broad sense base character score and characterizes, and intersects the covariance normalized certainly, and radially base nuclear SVMs modeling process institute established model can be discerned the Human genome promotor preferably.The number that leaving-one method and the support vector number that stays 1/5 method to obtain account for total sample is respectively 62.1% and 68.3%, promptly there is 37.9% and 31.7% sample to be deleted safely and do not influence its prediction effect, show that further the support vector sorting machine has good extensive performance new sample.
Further with (1-S p) be X-coordinate (X-axis), sensitivity (S n) be ordinate zou (Y-axis), draw experimenter's operating characteristic curve,, can find out that the area that the leaving-one method of institute's established model is corresponding with staying 1/5 method is respectively 0.835 and 0.819 referring to Fig. 1.
For further verifying the prediction effect of institute's inventive method for the Human genome promotor, select 100 promotors and 100 intron sequences different to predict from EPD database (http://www.epd.isb-sib.ch/) with used training set, with base nuclear supporting vector machine model radially to prediction result list in the table 2, select 7 predictive server to the comparison that predicts the outcome of 200 sequences simultaneously, find the inventive method gained S through contrast nAnd MCC is the highest, and prediction has than remarkable advantages for the Human genome promotor to show it.
The comparison that predicts the outcome of table 2 Human genome promotor
Figure A20081006994100091
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (5)

1. a process for recognising human gene promoter is characterized in that comprising the steps:
A), make up base broad sense character score representation system based on principal component analytical method;
B) application base broad sense character score characterizes the structure of Human genome promotor and non-promotor;
C) with the sign variable of everyone genoid promotor and non-promotor being done normalized from intersecting covariance method;
D) set up Human genome Promoter Recognition model with radially basic nuclear SVMs.
2. according to a kind of process for recognising human gene promoter of claim 1, it is characterized in that step a) specifically comprises the steps:
A1) 1209 kinds of 0D-3D nature parameters of 5 kinds of bases of selection;
A2) 1209 kinds of nature parameters are done correlation analysis, selectedly obtain 41 nature parameters;
A3) handle the base nature parameters that obtains with PCA, obtain 4 principal constituents;
A4) calculate each principal component scores, will get resolute and be defined as base broad sense character score.
3. according to a kind of process for recognising human gene promoter of claim 2, it is characterized in that step b) specifically comprises: characterize with the sequence of related 4 principal constituents of base broad sense character score vector to Human genome promotor and non-promotor, each base in the sequence is with 4 base broad sense character score characterization vectors.
4. according to a kind of process for recognising human gene promoter of claim 3, it is characterized in that step c) specifically comprises the steps: with handling each promotor obtain and the sign variable of non-promoter sequence from intersecting covariance, it is 6 that step-length l is set, make the sign variable number unanimity of each sequence, and will handle the variable obtain independent variable(s) through intersecting covariance certainly as the Promoter Recognition model.
5. according to each a kind of process for recognising human gene promoter in the claim 1 to 4, it is characterized in that step d) specifically comprises the steps: at first to define two indieating variables, use " 1 " expression promotor sample respectively, with the non-promotor sample of " 1 " expression, with the dependent variable of this indieating variable, set up Human genome Promoter Recognition model with radially basic nuclear SVMs as the Promoter Recognition model.
CNA2008100699415A 2008-07-08 2008-07-08 Process for recognising human gene promoter Pending CN101307359A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100699415A CN101307359A (en) 2008-07-08 2008-07-08 Process for recognising human gene promoter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100699415A CN101307359A (en) 2008-07-08 2008-07-08 Process for recognising human gene promoter

Publications (1)

Publication Number Publication Date
CN101307359A true CN101307359A (en) 2008-11-19

Family

ID=40124037

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100699415A Pending CN101307359A (en) 2008-07-08 2008-07-08 Process for recognising human gene promoter

Country Status (1)

Country Link
CN (1) CN101307359A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324002A (en) * 2011-06-03 2012-01-18 哈尔滨工程大学 Two-dimensional image representation method of digital image processing-based DNA sequence
CN104834834A (en) * 2015-04-09 2015-08-12 苏州大学张家港工业技术研究院 Construction method and device of promoter recognition system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324002A (en) * 2011-06-03 2012-01-18 哈尔滨工程大学 Two-dimensional image representation method of digital image processing-based DNA sequence
CN102324002B (en) * 2011-06-03 2013-10-30 哈尔滨工程大学 Two-dimensional image representation method of digital image processing-based DNA sequence
CN104834834A (en) * 2015-04-09 2015-08-12 苏州大学张家港工业技术研究院 Construction method and device of promoter recognition system

Similar Documents

Publication Publication Date Title
Pan et al. Predicting RNA–protein binding sites and motifs through combining local and global deep convolutional neural networks
Liu et al. iRSpot-EL: identify recombination spots with an ensemble learning approach
CN107038348B (en) Drug target prediction method based on protein-ligand interaction fingerprint
Degroeve et al. SpliceMachine: predicting splice sites from high-dimensional local context representations
Di Lena et al. Deep architectures for protein contact map prediction
Ji et al. Identifying time-lagged gene clusters using gene expression data
Zhu et al. Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts
Iqbal et al. Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations
O'Flanagan et al. Non-additivity in protein–DNA binding
CN111402967B (en) Method for improving virtual screening capability of docking software based on machine learning algorithm
Agostini et al. SeAMotE: a method for high-throughput motif discovery in nucleic acid sequences
CN107194207A (en) Protein ligands binding site estimation method based on granularity support vector machine ensembles
KR101888628B1 (en) Method and Media of Predicting protein-binding regions in RNA Using Nucleotide Profiles and Compositions
Hecker et al. The adapted Activity-By-Contact model for enhancer–gene assignment and its application to single-cell data
Palin et al. Locating potential enhancer elements by comparative genomics using the EEL software
Benegas et al. GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
Baten et al. Fast splice site detection using information content and feature reduction
CN113823356A (en) Methylation site identification method and device
CN101307359A (en) Process for recognising human gene promoter
Udaka et al. Empirical evaluation of a dynamic experiment design method for prediction of MHC class I-binding peptides
Xiao et al. PAI-SAE: predicting adenosine to inosine editing sites based on hybrid features by using spare auto-encoder
Ye et al. Interpreting and visualizing ChIP-seq data with the seqMINER software
Lu et al. Prediction for human transcription start site using diversity measure with quadratic discriminant
Gopal et al. A computational investigation of kinetoplastid trans-splicing
Mihalek et al. A structure and evolution-guided Monte Carlo sequence selection strategy for multiple alignment-based analysis of proteins

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20081119