CN101307359A - A method for identifying human gene promoters - Google Patents

A method for identifying human gene promoters Download PDF

Info

Publication number
CN101307359A
CN101307359A CNA2008100699415A CN200810069941A CN101307359A CN 101307359 A CN101307359 A CN 101307359A CN A2008100699415 A CNA2008100699415 A CN A2008100699415A CN 200810069941 A CN200810069941 A CN 200810069941A CN 101307359 A CN101307359 A CN 101307359A
Authority
CN
China
Prior art keywords
promoter
human gene
promotor
base
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100699415A
Other languages
Chinese (zh)
Inventor
梁桂兆
舒茂
梅虎
杨力
李志良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CNA2008100699415A priority Critical patent/CN101307359A/en
Publication of CN101307359A publication Critical patent/CN101307359A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种人类基因启动子识别方法,能够用于人类基因启动子区域的确定及其结构功能的诠释,可用于发现新的未知基因,包括如下步骤:a)基于主成分分析方法,建立碱基广义性质得分表征体系;b)应用碱基广义性质得分对人类基因启动子和非启动子的结构进行表征;c)用自交叉协方差方法对每个启动子和非启动子的表征变量做归一化处理;d)用径向基核支持向量机建立人类基因启动子识别模型。

The invention discloses a human gene promoter identification method, which can be used for the determination of the human gene promoter region and the interpretation of its structure and function, and can be used to discover new unknown genes, including the following steps: a) Based on the principal component analysis method, Establish a base generalized property score characterization system; b) use the base generalized property score to characterize the structure of human gene promoters and non-promoters; c) use the self-cross covariance method to characterize each promoter and non-promoter Variables were normalized; d) Using radial basis kernel support vector machine to establish a human gene promoter recognition model.

Description

一种人类基因启动子识别方法 A method for identifying human gene promoters

技术领域 technical field

本发明涉及一种人类基因识别方法,特别是一种人类基因启动子识别方法。The invention relates to a human gene recognition method, in particular to a human gene promoter recognition method.

背景技术 Background technique

人类基因草图的绘制成功加速了人类对整个基因的分析。对于每个基因的转录活性,启动子是重要的调控区域。启动子区域的确定及其结构功能的诠释是理解基因表达方式、基因调控网络、细胞分化和发育的基础。启动子预测对于发现新的未知基因,对于基因治疗方法中改善表达载体或基因导入系统都具有至关重要的作用。启动子预测已引起广泛关注,其预测程序是建立在不同概念之上的,根本的原理是启动子区域的特性不同于其它基因DNA特性,这些概念包括基于信号与基于内容的。对生物启动子进行计算机预测和识别是一项具有挑战性的工作,启动子的多样性和对转录调控机制认识的局限性,给相关的研究工作带来很大的困难。同源比对算法已经用于核苷酸序列同源性比对,但用于启动子预测仍处于幼年时期,虽可通过比对算法来聚类同源启动子,但大多数情况下,同源基因启动子元件的序列保守性远远低于其编码序列,因此,相似性搜索不再对其功能识别提供有益的线索(Duret et al.,Curr.Opin.Struct.Biol.,1997,7:399)。此外,许多启动子受多条信号通路的调节,特异性响应不同刺激的功能需求使启动子的组织结构变得更加复杂多样。有时甚至受同一条信号通路调节的启动子也可能完全不具有序列同源性(Kirchhamer,et al.,Proc.Natl.Acad.Sci.U.S.A.,1996,93:9322)。另外,启动子中存在许多像转录因子结合位点一样的序列结构特征,而这些特征结构并不为启动子所独有,它们散布在整个基因组中,如何滤除这为数众多的噪音信号也成为大片段基因组中启动子的计算机预测所面临的难题(Sap,et al.,Nature,1989,340:242;Bohjanen,et al.,Nucleic Acids Res.,1997,25:4481;Wang,et al.,Proc.Natl.Acad.Sci.U.S.A.,1998,95:492)。有一些程序根据实验获得的转录因子结合特性来描述启动子的序列特征,并依次作为启动子预测的依据,但实际的效果并不十分理想,遗漏和假阳性都较严重。The drawing of the human gene draft has successfully accelerated the analysis of the entire gene. The promoter is an important regulatory region for the transcriptional activity of each gene. The identification of promoter regions and their structure-function interpretation are the basis for understanding gene expression patterns, gene regulatory networks, cell differentiation and development. Promoter prediction plays a vital role in discovering new unknown genes and improving expression vectors or gene delivery systems in gene therapy methods. Promoter prediction has attracted widespread attention, and its prediction procedures are based on different concepts. The fundamental principle is that the characteristics of the promoter region are different from the DNA characteristics of other genes. These concepts include signal-based and content-based. Computer prediction and identification of biological promoters is a challenging task. The diversity of promoters and the limitations of the understanding of transcriptional regulation mechanisms have brought great difficulties to related research work. Homology alignment algorithms have been used for nucleotide sequence homology alignment, but promoter prediction is still in its infancy. Although alignment algorithms can be used to cluster homologous promoters, in most cases, the same The sequence conservation of the promoter element of the source gene is much lower than that of its coding sequence, therefore, the similarity search no longer provides useful clues for its functional identification (Duret et al., Curr. Opin. Struct. Biol., 1997, 7 :399). In addition, many promoters are regulated by multiple signaling pathways, and the functional requirements of specifically responding to different stimuli make the organizational structure of promoters more complex and diverse. Sometimes even promoters regulated by the same signaling pathway may have no sequence homology at all (Kirchhamer, et al., Proc. Natl. Acad. Sci. U.S.A., 1996, 93: 9322). In addition, there are many sequence structural features like transcription factor binding sites in promoters, and these characteristic structures are not unique to promoters, they are scattered throughout the genome, how to filter out these numerous noise signals has also become a Difficulties faced by computer prediction of promoters in large genomes (Sap, et al., Nature, 1989, 340: 242; Bohjanen, et al., Nucleic Acids Res., 1997, 25: 4481; Wang, et al. , Proc. Natl. Acad. Sci. U.S.A., 1998, 95:492). There are some programs that describe the sequence characteristics of promoters based on the transcription factor binding properties obtained from experiments, and use them as the basis for promoter prediction in turn, but the actual effect is not very ideal, and the omissions and false positives are serious.

发明内容 Contents of the invention

有鉴于此,为了解决上述启动子预测所存在问题,本发明提供了一种人类基因启动子识别方法,能够用于人类基因启动子区域的确定及其结构功能的诠释,可用于发现新的未知基因。In view of this, in order to solve the problems existing in the above-mentioned promoter prediction, the present invention provides a human gene promoter recognition method, which can be used for the determination of the human gene promoter region and the interpretation of its structure and function, and can be used to discover new unknowns. Gene.

本发明的目的是这样实现的:一种人类基因启动子识别方法,包括如下步骤:The object of the present invention is achieved like this: a kind of human gene promoter recognition method comprises the steps:

a)基于主成分分析方法,建立碱基广义性质得分表征体系;a) Based on the principal component analysis method, a base generalized property score characterization system is established;

b)应用碱基广义性质得分对人类基因启动子和非启动子的结构进行表征;b) Characterize the structures of human gene promoters and non-promoters by applying the base generalized property score;

c)用自交叉协方差方法对每个人类基因启动子和非启动子的表征变量做归一化处理;c) Normalize the characterization variables of each human gene promoter and non-promoter with the self-cross covariance method;

d)用径向基核支持向量机建立人类基因启动子识别模型。d) Establishment of human gene promoter recognition model with radial basis kernel support vector machine.

进一步,在于步骤a)具体包括如下步骤:Further, step a) specifically includes the following steps:

a1)选取5种碱基的1209种0D-3D性质参数;a1) Select 1209 kinds of 0D-3D property parameters of 5 kinds of bases;

a2)对1209种性质参数做相关性分析,精选得到41个性质参数;a2) Correlation analysis was performed on 1209 property parameters, and 41 property parameters were selected;

a3)用主成分分析法处理得到的碱基性质参数,得到4个主成分;a3) Process the obtained base property parameters with the principal component analysis method to obtain 4 principal components;

a4)计算各主成分得分,将得分矢量定义为碱基广义性质得分;a4) Calculate the scores of each principal component, and define the score vector as the base generalized property score;

进一步,步骤b)具体包括:用碱基广义性质得分矢量所涉及的4个主成分对人类基因启动子和非启动子的序列沿5’→3’方向进行表征,其中的每个碱基用4个碱基广义性质得分矢量表征;Further, step b) specifically includes: using the four principal components involved in the base generalized property score vector to characterize the human gene promoter and non-promoter sequences along the 5'→3' direction, where each base is represented by 4 base generalized property score vector representation;

进一步,步骤c)具体包括如下步骤:用自交叉协方差处理得到的每个启动子和非启动子序列的表征变量,设置步长l为6,使每个序列的表征变量数目一致,并将经自交叉协方差处理得到的变量作为启动子识别模型的自变量;Further, step c) specifically includes the following steps: process the characterization variables of each promoter and non-promoter sequence obtained by self-cross covariance, set the step size 1 to 6, make the number of characterization variables of each sequence consistent, and Variables obtained through self-cross covariance processing are used as independent variables of the promoter recognition model;

进一步,步骤d)具体包括如下步骤:首先定义两个指示变量,分别用“1”表示启动子样本,用“-1”表示非启动子样本,以此指示变量作为启动子识别模型的因变量,用径向基核支持向量机建立人类基因启动子识别模型。Further, step d) specifically includes the following steps: first, define two indicator variables, respectively use "1" to represent a promoter sample, and use "-1" to represent a non-promoter sample, and use this indicator variable as the dependent variable of the promoter recognition model , Modeling Human Gene Promoter Recognition Using Radial Basis Kernel Support Vector Machines.

本发明的一种人类基因启动子识别方法,其中选取的碱基广义性质得分所含信息量大、物理化学意义明确、表征能力强、结果易解释、拓展性能好及操作简便;用自交叉协方差方法对每个启动子和非启动子的表征变量做归一化处理,该方法能够较大程度地减少原始变量信息的损失,同时可充分考虑相邻碱基之间的交互效应及相互影响;而径向基核支持向量机通过核函数技术,可以很好地相关经自交叉协方差转换的序列表征变量及观测分类值之间的关系,可以有效的防止模型的过拟合,同时,所建模型具有良好的泛化性能。A human gene promoter identification method of the present invention, wherein the selected base generalized property score contains a large amount of information, clear physical and chemical meaning, strong characterization ability, easy interpretation of results, good expansion performance and simple operation; The variance method normalizes the characterization variables of each promoter and non-promoter, this method can minimize the loss of original variable information, and at the same time fully consider the interaction and mutual influence between adjacent bases ; and the radial basis kernel support vector machine can well correlate the relationship between the sequence representation variable transformed by self-cross covariance and the observed classification value through the kernel function technology, which can effectively prevent the model from overfitting. At the same time, The built model has good generalization performance.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述,并且在某种程度上,基于对下文的考察研究对本领域技术人员而言将是显而易见的,或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书,权利要求书,以及附图中所特别指出的结构来实现和获得。Other advantages, objects and features of the present invention will be set forth in the following description to some extent, and to some extent, will be obvious to those skilled in the art based on the investigation and research below, or can be obtained from Taught in the practice of the present invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

附图说明 Description of drawings

为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步的详细描述,其中:In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with the accompanying drawings, wherein:

图1是本发明的支持向量机模型识别结果的受试者操作特征分析示意图。Fig. 1 is a schematic diagram of the subject operating characteristic analysis of the identification result of the support vector machine model of the present invention.

具体实施方式 Detailed ways

以下将参照附图,对采用本发明的方法用于人类基因启动子识别为例进行详细的描述,包括以下步骤:With reference to the accompanying drawings, the method for using the method of the present invention for human gene promoter recognition will be described in detail as an example, including the following steps:

a)基于主成分分析方法,建立碱基广义性质得分表征体系;a) Based on the principal component analysis method, a base generalized property score characterization system is established;

收集5种碱基(A,C,G,T与U)的1209种性质参数,包括:构成特性,官能团数目,原子中心碎片和分子特性,分子电距矢量(MEDV),分子全息距离矢量(MHDV),拓扑,运转和路径数目,连接性指数,信息指数,自相关,边缘邻接指数,Burden特征值,拓扑电荷指数,特征值指数,Randic分子剖面,几何,基于不同原子间距的径向基函数描述子(RDF),基于电衍射法的分子结构表征(MoRSE)得到的描述子,加权整体不变分子的(WHIM)描述子及几何、拓扑与原子重量的集合(GETAWAY)描述子等;另外还包括其它相关性质:最高占据轨道(HOMO)能、偶极矩及Wiener指数等性质参数。Collect 1209 property parameters of 5 kinds of bases (A, C, G, T and U), including: composition characteristics, number of functional groups, atomic center fragments and molecular characteristics, molecular electric distance vector (MEDV), molecular holographic distance vector ( MHDV), topology, number of rotations and paths, connectivity index, information index, autocorrelation, edge adjacency index, Burden eigenvalue, topological charge index, eigenvalue index, Randic molecular profile, geometry, radial basis based on different atomic distances Descriptor function (RDF), descriptor obtained by molecular structure characterization based on electric diffraction method (MoRSE), descriptor of weighted integral invariant molecule (WHIM) descriptor and set of geometry, topology and atomic weight (GETAWAY) descriptor, etc.; In addition, other relevant properties are included: property parameters such as highest occupied orbital (HOMO) energy, dipole moment and Wiener index.

采用主成分分析压缩描述子数量,为了避免变量之间严重的多重相关性对主成分的危害,首先对1209个原始变量做相关性分析,对于相关系数大于或等于0.90的各组变量,根据其在原始变量矩阵中的载荷大小,将其中的一个保留,其它的予以删除,最终剩余41个变量,其主要反应了碱基的如下信息:平均分子量、重键数目、平均芳香极化度、平均电拓扑状态、电子总能量、热力学性质、Moriguchi辛-分配系数(logP)、尿素衍生物的数目、氢键接受体原子数目(N、O、F)、E-状态拓扑参数、Kier柔性指数、最高占据轨道(HOMO)能、分子全息距离矢量、偶极矩、扭转能及空间结构等。对41个变量经主成分分析变换后其前4个主成分累计解释原始数据矩阵(5×41)99.99%的方差,经过转换后的主成分得分见表1,因此,可用此4个主成分得分矩阵(5×4)代替原始变量矩阵(5×41)。Principal component analysis is used to compress the number of descriptors. In order to avoid serious multiple correlations between variables from harming principal components, correlation analysis is first performed on 1209 original variables. For each group of variables with a correlation coefficient greater than or equal to 0.90, according to its For the load size in the original variable matrix, one of them is retained, and the others are deleted. Finally, there are 41 remaining variables, which mainly reflect the following information of the base: average molecular weight, number of multiple bonds, average aromatic polarization, average Electrical topological state, total electron energy, thermodynamic properties, Moriguchi octane-partition coefficient (logP), number of urea derivatives, number of hydrogen bond acceptor atoms (N, O, F), E-state topological parameters, Kier flexibility index, Highest occupied orbital (HOMO) energy, molecular holographic distance vector, dipole moment, torsion energy and spatial structure, etc. After 41 variables are transformed by principal component analysis, the first 4 principal components accumulatively explain 99.99% of the variance of the original data matrix (5×41). The converted principal component scores are shown in Table 1. Therefore, these 4 principal components can be used The score matrix (5×4) replaces the original variable matrix (5×41).

表1  5种碱基的41种性质参数的4个主成分得分Table 1 4 principal component scores of 41 property parameters of 5 bases

Figure A20081006994100061
Figure A20081006994100061

对4个主成分载荷分析发现,对第1主成分正贡献相对最大的是以原子质量为权重的第三成分对称方向的WHIM指数,WHIM描述子属于3D几何类描述子,是对原子坐标权重矩阵协方差矩阵的PCA得到,其次是基于结构信息内容的描述子,这两类描述子都可视为立体(Steric)特性描述子。负贡献较大的是以原子极化度为权重的Moran自相关描述子及扭转能等变量。对第2主成分正贡献较大的是基于电子衍射方法表征分子3D结构而得到的非加权3D-MoRSE描述子分量和电子能等变量信息。负贡献较大的是氮原子(N)与氧原子(O)之间的拓扑距离总和等变量。在第3主成分中,具有较大正载荷的变量是2-通道Kier修正α形状指数和Kier柔性指数,两者都属于拓扑类描述子。具有较大负载荷的是平均原子极化度(针对碳原子)和平均分子量等信息,其都属于分子构成类描述子。与第4主成分载荷正相关较大的是由本研究组提出的分子全息距离矢量的第7分量。分子全息距离矢量是将原子划分为13种原子类型,进一步定义原子属性及相对键长而得到的基于分子2D拓扑结构的描述子,其中第7分量表示原子环境C-与>N-,>P-之间的全息距离(“-”,“>”,“<”分别表示连有1,2,2个非氢原子或化学键与之相连)。呈现较大负相关的是非加权的3D-MoRSE描述子分量及以原子极化度为权重的Moran自相关描述子等变量信息。为方便,称此4个主成分得分矢量为碱基广义性质得分,因为此4个得分矢量从多角度综合了碱基的1209种性质参数的大部分信息,因此,可考虑尝试将其用于核酸序列表征。The analysis of the 4 principal component loads found that the WHIM index in the symmetrical direction of the third component weighted by the atomic mass is the largest positive contribution to the first principal component. The WHIM descriptor belongs to the 3D geometric descriptor and is the weight of the atomic coordinates. The PCA of the matrix covariance matrix is obtained, followed by the descriptor based on the structural information content. Both types of descriptors can be regarded as stereo (Steric) characteristic descriptors. The negative contributions are variables such as the Moran autocorrelation descriptor weighted by atomic polarizability and torsional energy. The positive contribution to the second principal component is the unweighted 3D-MoRSE descriptor component and electron energy and other variable information obtained by characterizing the molecular 3D structure based on the electron diffraction method. The larger negative contributions are variables such as the sum of topological distances between nitrogen atoms (N) and oxygen atoms (O). In the third principal component, the variables with larger positive loadings are the 2-channel Kier modified α shape index and Kier flexibility index, both of which belong to the topological class of descriptors. Information such as the average atomic polarizability (for carbon atoms) and the average molecular weight, which have a relatively large load, belong to the molecular composition class descriptors. The seventh component of the molecular holographic distance vector proposed by our research group is positively correlated with the loading of the fourth principal component. Molecular holographic distance vector is a descriptor based on molecular 2D topological structure obtained by dividing atoms into 13 atomic types and further defining atomic properties and relative bond lengths. The seventh component represents the atomic environment C- and >N-, >P -The holographic distance between ("-", ">", "<" respectively indicate that 1, 2, 2 non-hydrogen atoms or chemical bonds are connected with it). The non-weighted 3D-MoRSE descriptor component and the Moran autocorrelation descriptor weighted by the atomic polarizability are variable information that present a large negative correlation. For convenience, these 4 principal component score vectors are called base generalized property scores, because these 4 score vectors synthesize most of the information of the 1209 property parameters of bases from multiple perspectives, so it can be considered to try to use them for Nucleic acid sequence characterization.

b)应用碱基广义性质得分对人类基因启动子和非启动子的结构进行表征;b) Characterize the structures of human gene promoters and non-promoters by applying the base generalized property score;

选择565条人类基因启动子序列、3819条非启动子序列(890条外显子和2929条内含子),用碱基广义性质得分矢量所涉及的4个主成分对所选序列沿5’→3’方向进行表征,序列中的每个碱基用4个碱基广义性质得分矢量表征。每个序列根据其含有的碱基数目(定义为n),以n×4个变量表征。Select 565 human gene promoter sequences, 3819 non-promoter sequences (890 exons and 2929 introns), and use the 4 principal components involved in the base generalized property score vector to pair the selected sequences along the 5' → Characterize in the 3' direction, and each base in the sequence is characterized by a vector of 4 base generalized property scores. Each sequence is characterized by n×4 variables according to the number of bases it contains (defined as n).

c)用自交叉协方差方法对每个人类基因启动子和非启动子的表征变量做归一化处理;c) Normalize the characterization variables of each human gene promoter and non-promoter with the self-cross covariance method;

用自交叉协方差(ACC)处理得到每个启动子和非启动子序列的表征变量,该法考虑了序列不同位点碱基参数之间所有交互效应,因此,在数据变换过程中可最大程度地降低信息损失。设所研究的样本集中最短序列长度为l+1,对任意一个含有n个碱基的序列,ACC处理如下:The characteristic variables of each promoter and non-promoter sequence are obtained by autocross covariance (ACC). This method takes into account all the interaction effects between the base parameters of different positions in the sequence. Therefore, it can be maximized during the data transformation process. reduce information loss. Assuming that the shortest sequence length in the studied sample set is l+1, for any sequence containing n bases, the ACC process is as follows:

ACCACC aa ,, bb ,, ll == &Sigma;&Sigma; ii == 11 nno -- ll ZZ aa ,, ii &times;&times; ZZ bb ,, ii ++ ll nno -- ll ,, (( ll == 1,2,31,2,3 ,, .. .. .. ,, ll ))

式中:l为步长;i和i+l为序列中碱基所处位置;a和b分别为第i和i+l个碱基相应描述子分量号,对于碱基广义性质得分矢量,其a,b=1,2,3,4。可看到,当计算所有可能步长时(l=1,2,3,...,l),样本集中不同长度的序列经ACC处理后其描述子数目最终都为42×l个,此处选择步长l为6,这样每条序列可由42×6=96个变量表征,将经自交叉协方差处理得到的变量作为启动子识别模型的自变量。In the formula: l is the step length; i and i+l are the positions of the bases in the sequence; a and b are the corresponding descriptor component numbers of the i-th and i+l bases respectively, and for the base generalized property score vector, Its a, b=1,2,3,4. It can be seen that when all possible step sizes are calculated (l=1, 2, 3, ..., l), the number of descriptors of sequences of different lengths in the sample set after ACC processing is finally 4 2 ×l, Here, the step size l is selected as 6, so that each sequence can be represented by 4 2 ×6=96 variables, and the variables obtained through self-cross covariance processing are used as independent variables of the promoter recognition model.

d)用径向基核支持向量机建立人类基因启动子识别模型;d) Build a human gene promoter recognition model with radial basis kernel support vector machine;

首先定义两个指示变量,分别用“1”表示启动子样本,用“-1”表示非启动子样本(外显子与内含子),以此指示变量作为启动子识别模型的因变量,用径向基核支持向量机建立人类基因启动子识别模型,其参数设置为:C=200.0,K(x,xi)=exp(-0.125||x-xi||2)。若分别定义Acc为计算预测正确样本数目所占总样本数目百分比,Sp为预测正确的启动子样本数目的百分比,Sn为预测正确的非启动子样本数目的百分比,MCC为马休斯相关系数等统计参数,则经留一法交互验证,支持向量机模型对训练集中565条启动子与3819条非启动子识别得Acc=83.8,Sn=67.1,Sp=86.3与MCC=0.442,进一步采用留1/5法交互验证得Acc=81.7,Sn=66.9,Sp=83.8与MCC=0.406,这表明基于广义碱基性质得分表征,自交叉协方差归一化处理,径向基核支持向量机建模过程所建模型可较好地识别人类基因启动子。留一法及留1/5法得到的支持向量数目占总样本的数目分别为62.1%与68.3%,即有37.9%与31.7%的样本可被安全地删除而不影响其对新样本的预测效果,进一步表明支持向量分类机具有良好的泛化性能。First define two indicator variables, use "1" to represent the promoter sample, use "-1" to represent the non-promoter sample (exon and intron), and use this indicator variable as the dependent variable of the promoter recognition model, A human gene promoter recognition model was established with radial basis kernel support vector machine, and its parameters were set as: C=200.0, K(x, x i )=exp(-0.125||xx i || 2 ). If A cc is defined as the percentage of the total number of samples in which the number of correctly predicted samples is calculated, S p is the percentage of the number of correctly predicted promoter samples, S n is the percentage of the number of correctly predicted non-promoter samples, and MCC is Matthews Statistical parameters such as correlation coefficients were cross-validated by the leave-one-out method, and the support vector machine model identified Acc = 83.8, S n = 67.1, S p = 86.3 and MCC = 565 promoters and 3819 non-promoters in the training set. 0.442, and further cross-validated by leaving 1/5 method to get A cc = 81.7, S n = 66.9, S p = 83.8 and MCC = 0.406, which shows that based on the generalized base property score characterization, self-cross covariance normalization processing, The model built by the radial basis kernel support vector machine modeling process can better identify human gene promoters. The number of support vectors obtained by the leave-one-out method and the leave-one-fifth method accounted for 62.1% and 68.3% of the total samples, respectively, that is, 37.9% and 31.7% of the samples can be safely deleted without affecting their prediction of new samples The effect further shows that the support vector classification machine has good generalization performance.

进一步以(1-Sp)为横坐标(X轴),灵敏度(Sn)为纵坐标(Y轴),绘制受试者操作特征曲线,参见图1,可看出,所建模型的留一法与留1/5法对应的面积分别为0.835和0.819。Further take (1-S p ) as the abscissa (X-axis), sensitivity (S n ) as the ordinate (Y-axis), draw the receiver operating characteristic curve, see Figure 1, it can be seen that the retention of the built model The areas corresponding to the one method and the left 1/5 method are 0.835 and 0.819 respectively.

为进一步验证所发明方法对于人类基因启动子的预测效果,从EPD数据库(http://www.epd.isb-sib.ch/)选择与所用训练集不同的100条启动子与100条内含子序列进行预测,用径向基核支持向量机模型对之预测的结果列于表2中,同时选择7个预测服务器对200条序列进行预测结果比较,经对比发现,本发明方法所得Sn及MCC最高,表明其对于人类基因启动子预测具有较明显的优势。In order to further verify the prediction effect of the invented method on human gene promoters, 100 promoters and 100 containing genes were selected from the EPD database (http://www.epd.isb-sib.ch/) different from the training The subsequences are predicted, and the results predicted by the radial basis kernel support vector machine model are listed in Table 2. At the same time, 7 prediction servers are selected to compare the prediction results of 200 sequences. After comparison, it is found that the S n obtained by the method of the present invention is and MCC are the highest, indicating that it has obvious advantages for the prediction of human gene promoters.

表2  人类基因启动子预测结果比较Table 2 Comparison of human gene promoter prediction results

Figure A20081006994100091
Figure A20081006994100091

以上所述仅为本发明的优选实施例,并不用于限制本发明,显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims (5)

1. a process for recognising human gene promoter is characterized in that comprising the steps:
A), make up base broad sense character score representation system based on principal component analytical method;
B) application base broad sense character score characterizes the structure of Human genome promotor and non-promotor;
C) with the sign variable of everyone genoid promotor and non-promotor being done normalized from intersecting covariance method;
D) set up Human genome Promoter Recognition model with radially basic nuclear SVMs.
2. according to a kind of process for recognising human gene promoter of claim 1, it is characterized in that step a) specifically comprises the steps:
A1) 1209 kinds of 0D-3D nature parameters of 5 kinds of bases of selection;
A2) 1209 kinds of nature parameters are done correlation analysis, selectedly obtain 41 nature parameters;
A3) handle the base nature parameters that obtains with PCA, obtain 4 principal constituents;
A4) calculate each principal component scores, will get resolute and be defined as base broad sense character score.
3. according to a kind of process for recognising human gene promoter of claim 2, it is characterized in that step b) specifically comprises: characterize with the sequence of related 4 principal constituents of base broad sense character score vector to Human genome promotor and non-promotor, each base in the sequence is with 4 base broad sense character score characterization vectors.
4. according to a kind of process for recognising human gene promoter of claim 3, it is characterized in that step c) specifically comprises the steps: with handling each promotor obtain and the sign variable of non-promoter sequence from intersecting covariance, it is 6 that step-length l is set, make the sign variable number unanimity of each sequence, and will handle the variable obtain independent variable(s) through intersecting covariance certainly as the Promoter Recognition model.
5. according to each a kind of process for recognising human gene promoter in the claim 1 to 4, it is characterized in that step d) specifically comprises the steps: at first to define two indieating variables, use " 1 " expression promotor sample respectively, with the non-promotor sample of " 1 " expression, with the dependent variable of this indieating variable, set up Human genome Promoter Recognition model with radially basic nuclear SVMs as the Promoter Recognition model.
CNA2008100699415A 2008-07-08 2008-07-08 A method for identifying human gene promoters Pending CN101307359A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100699415A CN101307359A (en) 2008-07-08 2008-07-08 A method for identifying human gene promoters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100699415A CN101307359A (en) 2008-07-08 2008-07-08 A method for identifying human gene promoters

Publications (1)

Publication Number Publication Date
CN101307359A true CN101307359A (en) 2008-11-19

Family

ID=40124037

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100699415A Pending CN101307359A (en) 2008-07-08 2008-07-08 A method for identifying human gene promoters

Country Status (1)

Country Link
CN (1) CN101307359A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324002A (en) * 2011-06-03 2012-01-18 哈尔滨工程大学 Two-dimensional Image Representation Method of DNA Sequence Based on Digital Image Processing
CN104834834A (en) * 2015-04-09 2015-08-12 苏州大学张家港工业技术研究院 Construction method and device of promoter recognition system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324002A (en) * 2011-06-03 2012-01-18 哈尔滨工程大学 Two-dimensional Image Representation Method of DNA Sequence Based on Digital Image Processing
CN102324002B (en) * 2011-06-03 2013-10-30 哈尔滨工程大学 Two-dimensional image representation method of digital image processing-based DNA sequence
CN104834834A (en) * 2015-04-09 2015-08-12 苏州大学张家港工业技术研究院 Construction method and device of promoter recognition system

Similar Documents

Publication Publication Date Title
Kuznetsov et al. Using evolutionary and structural information to predict DNA‐binding sites on DNA‐binding proteins
Park et al. Rapid and accurate peptide identification from tandem mass spectra
Liu et al. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
DeConde et al. Combining results of microarray experiments: a rank aggregation approach
Zou et al. Supersecondary structure prediction using Chou's pseudo amino acid composition
Liew et al. Missing value imputation for gene expression data: computational techniques to recover missing data from available information
Shea et al. Exploring the space of protein folding Hamiltonians: The balance of forces in a minimalist β-barrel model
CN107679362B (en) Compound-Protein Interaction Affinity Identification Methods, Systems and Devices
CN110674846A (en) Oversampling method for imbalanced dataset based on genetic algorithm and k-means clustering
Mantsyzov et al. Contact-based ligand-clustering approach for the identification of active compounds in virtual screening
Bascom et al. Mesoscale modeling of chromatin fibers
Shazman et al. From face to interface recognition: a differential geometric approach to distinguish DNA from RNA binding surfaces
Page et al. Methods for mapping and categorization of DNA sequence reads from allopolyploid organisms
Liu et al. Bridging protein local structures and protein functions
CN111477287B (en) Drug target prediction method, device, equipment and medium
CN103500293A (en) Screening method of non-ribosomal protein-RNA composite near-nature structure
CN101307359A (en) A method for identifying human gene promoters
Hattne et al. Pattern-recognition-based detection of planar objects in three-dimensional electron-density maps
Wang et al. Analysis and classification of DNA‐binding sites in single‐stranded and double‐stranded DNA‐binding proteins using protein information
Hollup et al. Exploring the factors determining the dynamics of different protein folds
Segal et al. Improved accuracy assessment for 3D genome reconstructions
Sharan et al. A motif-based framework for recognizing sequence families
Langlois et al. Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins
CN109101784A (en) A kind of analysis method of DNA binding protein interface geometry feature
Sykes et al. Benchmarking methods of protein structure alignment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20081119