CN103870719B - A kind of process for recognising human gene promoter and system - Google Patents

A kind of process for recognising human gene promoter and system Download PDF

Info

Publication number
CN103870719B
CN103870719B CN201410140707.2A CN201410140707A CN103870719B CN 103870719 B CN103870719 B CN 103870719B CN 201410140707 A CN201410140707 A CN 201410140707A CN 103870719 B CN103870719 B CN 103870719B
Authority
CN
China
Prior art keywords
gene order
sample gene
promoter
feature
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410140707.2A
Other languages
Chinese (zh)
Other versions
CN103870719A (en
Inventor
张莉
徐文轩
罗璇
王邦军
杨季文
李凡长
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410140707.2A priority Critical patent/CN103870719B/en
Publication of CN103870719A publication Critical patent/CN103870719A/en
Application granted granted Critical
Publication of CN103870719B publication Critical patent/CN103870719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)

Abstract

This application discloses a kind of Promoter Recognition method, by the way that multiple sample gene orders are carried out with the statistics of cytimidine, guanine CG preference profiles, multiple sample gene orders are divided into two classes, following steps are performed respectively for each class sample gene order:Rigidity characteristic, CpG islands feature and the tetrad constituent feature of each of which sample gene order are extracted respectively, and build corresponding grader Promoter Recognition judgement is carried out to sample gene order, non-start up subsequence to recognizing is extracted its five conjuncted constituent feature and constitutes five conjuncted graders, Promoter Recognition judgement is carried out again, and when recognition result meets pre-conditioned, it is determined that current sample gene order is promoter sequence, otherwise it is non-start up subsequence.The application has taken into full account the rigidity characteristic of gene, CpG islands feature and constituent feature, and by hierarchical identification, the Promoter Recognition result accuracy rate for being finally given is higher.

Description

A kind of process for recognising human gene promoter and system
Technical field
The application involves starting up sub- identification technology field, more specifically to a kind of human gene Promoter Recognition side Method and system.
Background technology
After the completion of human gene sketch, already turn into one on research human gene expression regulation and extremely challenging grind Study carefully direction.And Promoter Recognition to whole gene group function annotation have important effect, therefore how faster and better knowledge Other human promoter, it has also become a hot research field.
Prediction promoter is main from transcription initiation site, core promoter region, the transcription factor of identifying promoter at present The CpG islands four direction of binding domain and promoter sets out.Wherein, CpG islands (CpG island) is meant that:CpG dinucleotides Distribution very heterogeneity in human genome, and in some sections of genome, CpG keeps or higher than normal probability, these Section is referred to as CpG islands.Emerging Promoter Recognition method is proposed to be studied promoter structure feature, such as flexible (lexibility), rigidity(rigidity)And flexibility(bendability)Feature is the feature extracted from three dimensions.This A little architectural features can provide important side information for the Promoter Recognition system set up.
Mei Li et al. is proposed and is used SVMs(SVM)The algorithm that ranking of features is used.First order SVM classifier Promoter is recognized using CpG islands feature, the non-start up subsequence that first order SVM classifier is marked off is then by second level SVM Grader is further recognized.What the algorithm was extracted is the constituent feature of identical sample, and the structure using gene is not special Levy.Also, the feature of different promoters sequence is simultaneously differed, so might not be had using the feature of identical sample extraction Most strong resolving power.Therefore, existing method has that discrimination is not high.
The content of the invention
In view of this, this application provides a kind of process for recognising human gene promoter and system, for solving existing calculation The method problem not high to the discrimination of gene promoter.
To achieve these goals, it is proposed that scheme it is as follows:
A kind of process for recognising human gene promoter, including:
The sample set that reception is made up of multiple sample gene orders;
Cytimidine, the guanine CG preference profiles of each sample gene order are counted respectively, obtain statistics;
All of sample gene order is divided into by two classes according to the statistics, a class has the CG preferences special Levy, it is another kind of without the CG preference profiles;
Each class sample gene order after for division, the rigidity that each of which sample gene order is extracted respectively is special Levy, CpG islands feature and tetrad constituent feature;
Rigid grader is constituted using the rigidity characteristic, constitute CpG island graders and profit using CpG islands feature Tetrad grader, the rigid grader, CpG islands grader and described are constituted with the tetrad constituent feature Tetrad grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first identification respectively As a result;
Three first recognition results meet first it is pre-conditioned when, it is determined that current sample gene order is promoter Sequence;
To being unsatisfactory for the first pre-conditioned sample gene order, extract its five conjuncted constituent feature and constitute 5-linked Body grader, promoter knowledge is carried out by described five conjuncted graders to the first pre-conditioned sample gene order that is unsatisfactory for Do not judge, and provide the second recognition result;
When second recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is promoter sequence Row, are otherwise non-start up subsequence.
Preferably, the sample set that is made up of multiple sample gene orders is:
Wherein xi∈RL, yi∈ { " promoter ", " extron ", " introne ", " 3 ' UTR " },
N is number of samples, and L is sample gene order length.
Preferably, the cytimidine that each sample gene order is counted respectively, guanine CG preference profiles, are united Meter result, specially:
To each sample gene order xiThe ratio of statistics wherein cytimidine C and guanine G contents, and it will have been counted The sample set afterwards is expressed as:
WhereinnCIt is the number that cytimidine C in sample gene order occurs, nGTable The number that guanine G occurs in sample this gene order.
Preferably, described that all of sample gene order is divided into two classes according to the statistics, a class has institute State CG preference profiles, it is another kind of without the CG preference profiles, specially:
Threshold value w is set, whenWhen, representing the sample gene order has CG preference profiles, otherwise represents the sample Gene order does not have CG preference profiles.
Preferably, the rigidity characteristic extraction process of each sample gene order, specially:
Trinucleotide model is taken to calculate the rigidity characteristic of each sample gene order:
When the rigidity characteristic of each base position of sample gene order is calculated, calculated using 6 base sequences long, The rigidity parameters value of the trinucleotide of cumulative four overlaps, the sample gene order after rigidity characteristic is extracted is expressed as:
Whereinj(j=1,2,...,L-5) It is location index, tkBe each overlap trinucleotide in j-th rigidity parameters value of base positions.
Preferably, the CpG islands characteristic extraction procedure of each sample gene order, specially:
Calculate each sample gene order xiCytimidine and guanine total content CG_con:
Calculate each sample gene order xiCpG islands observed value and predicted value ratio Obs/Exp:
Wherein, nC,nG,nCGThe number of cytimidine C, guanine G and dyad CG is represented respectively,
Sample gene order is expressed as:
Wherein
Preferably, the tetrad constituent characteristic extraction procedure of each sample gene order, specially:
If fprIt is the frequency that tetrad occurs in promoter,It is tetrad in a sequences of kind non-start up The frequency occurred in row, wherein a=1 represents extron, a=2 and represents introne, a=3 and represents 3 '-UTR, then the KL based on tetrad Divergence is as follows:
Willi∈[1,44] arrange in descending order and make it be
Order:
na∈ [Isosorbide-5-Nitraes4]
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naIndividual tetrad goes out Existing frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
A kind of human gene Promoter Recognition system, including:
Receiving unit, for receiving the sample set being made up of multiple sample gene orders;
Statistic unit, cytimidine, guanine CG preference profiles for counting each sample gene order respectively, obtains Statistics;
Taxon, for all of sample gene order to be divided into two classes according to the statistics, a class has The CG preference profiles, it is another kind of without the CG preference profiles;
Feature extraction unit, for each class sample gene order after for division, extracts each of which sample respectively The rigidity characteristic of this gene order, CpG islands feature and tetrad constituent feature;
The rigid grader being made up of the rigidity characteristic, the CpG islands grader being made up of CpG islands feature, by institute State the tetrad grader of tetrad constituent feature composition, the rigid grader, CpG islands grader and described four Conjuncted grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first identification knot respectively Really;
First promoter determining unit, for three first recognition results meet first it is pre-conditioned when, it is determined that Current sample gene order is promoter sequence;
5-linked body characteristicses extraction unit, for being unsatisfactory for the first pre-conditioned sample gene order, extracting its 5-linked Body constituent feature;
Five conjuncted graders being made up of described five conjuncted constituent features, the five conjuncted grader is to described discontented The first pre-conditioned sample gene order of foot carries out Promoter Recognition judgement, and provides the second recognition result;
Second promoter determining unit, for when second recognition result satisfaction second is pre-conditioned, it is determined that currently Sample gene order is promoter sequence, is otherwise non-start up subsequence.
Preferably, the feature extraction unit includes:
Rigidity characteristic extraction unit, for each class sample gene order after for division, extracts each of which sample The rigidity characteristic of this gene order;
CpG islands feature extraction unit, for each class sample gene order after for division, extracts each of which sample The CpG islands feature of this gene order;
Tetrad constituent feature extraction unit, for each class sample gene order after for division, extracts it In each sample gene order tetrad constituent feature.
Preferably, the tetrad grader includes:
First sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and extron;
Second sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and introne;
3rd sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and 3 ' UTR.
It can be seen from above-mentioned technical scheme that, Promoter Recognition method disclosed in the present application, by multiple sample bases The statistics of cytimidine, guanine CG preference profiles is carried out because of sequence, multiple sample gene orders are divided into two classes, a class has CG Preference profiles, it is another kind of not have, following steps are performed respectively for each class sample gene order:Each of which is extracted respectively The rigidity characteristic of individual sample gene order, CpG islands feature and tetrad constituent feature, rigidity point is constituted using rigidity characteristic Class devices, CpG island graders are constituted using CpG islands feature and tetrad grader is constituted using tetrad constituent feature, Promoter Recognition judgement is carried out to same sample gene order by above three grader respectively, and considers what three provided Recognition result, current sample gene order is determined when satisfaction first is pre-conditioned for promoter sequence, pre- to being unsatisfactory for first If the sample gene order of condition, extract its five conjuncted constituent feature and constitute five conjuncted graders, by five conjuncted classification Device carries out Promoter Recognition judgement to being unsatisfactory for the first pre-conditioned sample gene order, and it is pre- to meet second in recognition result If during condition, it is determined that current sample gene order is promoter sequence, otherwise it is non-start up subsequence.The application takes into full account The rigidity characteristic of gene, CpG islands feature and constituent feature, by hierarchical identification, the Promoter Recognition result for being finally given Accuracy rate is higher.
Brief description of the drawings
In order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of application, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of process for recognising human gene promoter flow chart disclosed in the embodiment of the present application;
Fig. 2 is a kind of human gene Promoter Recognition system construction drawing disclosed in the embodiment of the present application;
Fig. 3 is the structure chart of feature extraction unit disclosed in the embodiment of the present application;
Fig. 4 is the structure chart of tetrad grader disclosed in the embodiment of the present application.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Site preparation is described, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, it is all other that those of ordinary skill in the art are obtained under the premise of creative work is not paid Embodiment, belongs to the scope of the application protection.
Embodiment one
Referring to Fig. 1, Fig. 1 is a kind of process for recognising human gene promoter flow chart disclosed in the embodiment of the present application.
As shown in figure 1, the method includes:
Step 101:The sample set that reception is made up of multiple sample gene orders;
Step 102:Cytimidine, the guanine CG preference profiles of each sample gene order are counted respectively, are counted As a result;
Specifically, cytimidine C in each sample gene order is counted(Cytosine), guanine G(Guanine)CG's Quantity, if the ratio of C, G quantity of the whole sample gene orders of CG is more than certain value, that is, thinks the sample gene order pair CG has preference profiles, does not have CG preference profiles otherwise.
Step 103:All of sample gene order is divided into by two classes according to the statistics, a class has the CG Preference profiles, it is another kind of without the CG preference profiles;
Step 104:Each class sample gene order after for division, extracts each of which sample gene order respectively Rigidity characteristic, CpG islands feature and tetrad constituent feature;
Specifically, rigidity characteristic, CpG islands feature and tetrad constituent feature are the inherent structure features of gene, Wherein CpG islands are meant that:(CpG island)Distribution of the CpG dinucleotides in human genome very heterogeneity, and in base Because some sections organized, CpG keep or higher than normal probability, these sections are referred to as CpG islands.
Step 105:Rigid grader is constituted using the rigidity characteristic, the classification of CpG islands is constituted using CpG islands feature Device and the utilization tetrad constituent feature constitute tetrad grader, the rigid grader, CpG islands classification Device and the tetrad grader carry out Promoter Recognition judgement to same sample gene order respectively, and are given respectively corresponding First recognition result;
Specifically, according to three category features for extracting, three kinds of graders are built respectively, by three kinds of graders respectively to This gene order carries out Promoter Recognition judgement, and provides the first recognition result respectively, and three kinds of graders provide three altogether One recognition result.
Step 106:Three first recognition results meet first it is pre-conditioned when, it is determined that current sample gene order It is promoter sequence;
Specifically, three the first recognition results are considered, and when it meets pre-conditioned, it is determined that current sample gene Sequence is promoter sequence.
Step 107:To being unsatisfactory for the first pre-conditioned sample gene order, its five conjuncted constituent feature is extracted simultaneously Five conjuncted graders are constituted, the first pre-conditioned sample gene order that is unsatisfactory for is carried out by described five conjuncted graders Promoter Recognition judges, and provides the second recognition result;
Step 108:When second recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is to open Promoter sequences, are otherwise non-start up subsequence.
Specifically, further it is identified judging for the non-start up subsequence determined in step 106, it is full in recognition result When foot second is pre-conditioned, the sample gene order is defined as promoter sequence.
Promoter Recognition method disclosed in the embodiment of the present application, cytimidine, bird are carried out by multiple sample gene orders Multiple sample gene orders are divided into two classes by the statistics of purine CG preference profiles, and a class has CG preference profiles, another kind of not have Have, following steps are performed respectively for each class sample gene order:The firm of each of which sample gene order is extracted respectively Property feature, CpG islands feature and tetrad constituent feature, constitute rigid grader using rigidity characteristic, using CpG islands feature Constitute CpG island graders and constitute tetrad grader using tetrad constituent feature, by above three grader point It is other that Promoter Recognition judgement is carried out to same sample gene order, and consider the recognition result that three provides, meeting the One determines current sample gene order for promoter sequence when pre-conditioned, to being unsatisfactory for the first pre-conditioned sample gene sequence Row, extract its five conjuncted constituent feature and constitute five conjuncted graders, are preset to being unsatisfactory for first by five conjuncted graders The sample gene order of condition carries out Promoter Recognition judgement, and when recognition result satisfaction second is pre-conditioned, it is determined that currently Sample gene order is promoter sequence, is otherwise non-start up subsequence.The application taken into full account gene rigidity characteristic, CpG islands feature and constituent feature, by hierarchical identification, the Promoter Recognition result accuracy rate for being finally given is higher.
Embodiment two
What we will be described in detail each step in embodiment one in the present embodiment implements process.
First, defining the sample set being made up of multiple sample gene orders is:
Wherein xi∈RL, yi∈ { " promoter ", " extron ", " introne ", " 3 ' UTR " }, N is sample This number, L is sample gene order length.
It should be noted that 3 ' UTR are the non-translational regions at 3 ' ends in gene order.
Next, we count the cytimidine of each sample gene order, guanine CG preference profiles, specially:
To each sample gene order xiThe ratio of statistics wherein cytimidine C and guanine G contents, and it will have been counted The sample set afterwards is expressed as:
WhereinnCIt is the number that cytimidine C in sample gene order occurs, nG Represent the number of guanine G appearance in sample gene order.
By the statistics of CG preference profiles, all of sample gene order is divided for two classes, a class is inclined with CG Good feature, it is another kind of without CG preference profiles.During division, we can first set threshold value w, whenWhen, representing should Sample gene order has CG preference profiles, otherwise represents the sample gene order without CG preference profiles.
Next, two class sample gene orders after for division extract its architectural feature:
1st, the extraction process of rigidity characteristic:
Trinucleotide model is taken to calculate the rigidity characteristic of each sample gene order:
When the rigidity characteristic of each base position of sample gene order is calculated, calculated using 6 base sequences long, The rigidity parameters value of the trinucleotide of cumulative four overlaps, the sample gene order after rigidity characteristic is extracted is expressed as:
Whereinj(j=1,2,...,L-5) It is location index, tkBe each overlap trinucleotide in j-th rigidity parameters value of base positions.
For example:
6 bases are TATAAA, are calculated first since T base positions,
rT=tTAT+tATA+tTAA+tAAA
2nd, CpG islands characteristic extraction procedure:
Calculate each sample gene order xiCytimidine C and guanine G total contents CG_con:
Calculate each sample gene order xiCpG islands observed value and predicted value ratio Obs/Exp:
Wherein, nC,nG,nCGThe number of cytimidine C, guanine G and dyad CG is represented respectively,
The sample gene order after the feature of CpG islands will be extracted to be expressed as:
Wherein
3rd, tetrad constituent characteristic extraction procedure:
If fprIt is the frequency that tetrad occurs in promoter,It is tetrad in a kinds non-start up The frequency occurred in sequence, wherein a=1 represents extron, a=2 and represents introne, a=3 and represents 3 '-UTR, then based on tetrad KL divergences are as follows:
WillI ∈ [Isosorbide-5-Nitraes4] arrange in descending order and make it be
Order:
na∈[1,44]
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naIndividual tetrad goes out Existing frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
In the manner described above, each sample gene order x is countediNotable feature, notable feature set can represent For:
(R=1 represents the notable feature for distinguishing promoter and extron, r=2 represent differentiation promoter and The notable feature of introne, r=3 represents the notable feature for distinguishing promoter and 3'-UTR), whereinnaN before representinga The frequency of the appearance of individual tetrad.
Feature extraction proceeds the step 105- steps 108 in embodiment one after finishing, will not be repeated here.
It should be noted that the process for extracting five conjuncted constituent features in step 107 is referred to above-mentioned tetrad The process of body constituent feature extraction.It is as follows:
Equally set fprIt is the five conjuncted frequencies occurred in promoter,For five conjuncted in a kind non-start up The frequency occurred in subsequence, wherein a=1 represents extron, a=2 and represents introne, a=3 and represents 3 '-UTR, then conjuncted based on five KL divergences it is as follows:
WillI ∈ [Isosorbide-5-Nitraes5] arrange in descending order and make it be
Order:
na∈[1,45]
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naIndividual five conjuncted go out Existing frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
In the manner described above, each sample gene order x is countediNotable feature, notable feature set can represent For:
(R=1 represents the notable feature for distinguishing promoter and extron, r=2 represent differentiation promoter and The notable feature of introne, r=3 represents the notable feature for distinguishing promoter and 3'-UTR), whereinnaN before representinga The frequency of individual five conjuncted appearance.
Embodiment three
Referring to Fig. 2, Fig. 2 is a kind of human gene Promoter Recognition system construction drawing disclosed in the embodiment of the present application.
As shown in Fig. 2 the system includes:
Receiving unit 21, for receiving the sample set being made up of multiple sample gene orders;
Statistic unit 22, cytimidine, guanine CG preference profiles for counting each sample gene order respectively, obtains To statistics;
Taxon 23, for all of sample gene order to be divided into two classes, class tool according to the statistics There are the CG preference profiles, it is another kind of without the CG preference profiles;
Feature extraction unit 24, for each class sample gene order after for division, extracts each of which respectively The rigidity characteristic of sample gene order, CpG islands feature and tetrad constituent feature;
The rigid grader 25 being made up of the rigidity characteristic, the CpG islands grader 26 being made up of CpG islands feature, The tetrad grader 27 being made up of the tetrad constituent feature, the rigid grader 25, CpG islands grader 26 and the tetrad grader 27 Promoter Recognition judgement is carried out to same sample gene order respectively, and be given respectively correspondence The first recognition result;
First promoter determining unit 28, for three first recognition results meet first it is pre-conditioned when, really Settled preceding sample gene order is promoter sequence;
5-linked body characteristicses extraction unit 29, for being unsatisfactory for the first pre-conditioned sample gene order, extract its five Conjuncted constituent feature;
Five conjuncted graders 30 being made up of described five conjuncted constituent features, described in the five conjuncted grader 30 pairs Being unsatisfactory for the first pre-conditioned sample gene order carries out Promoter Recognition judgement, and provides the second recognition result;
Second promoter determining unit 31, for when second recognition result satisfaction second is pre-conditioned, it is determined that working as Preceding sample gene order is promoter sequence, is otherwise non-start up subsequence.
The method that the specific course of work of said system may refer to corresponding embodiment one is discussed, and the application is implemented The disclosed Promoter Recognition system of example, by the system that multiple sample gene orders are carried out with cytimidine, guanine CG preference profiles Multiple sample gene orders are divided into two classes by meter, and a class has CG preference profiles, another kind of not have, for each class sample Gene order performs following steps respectively:Extract respectively the rigidity characteristic of each of which sample gene order, CpG islands feature and Tetrad constituent feature, constitute rigid grader using rigidity characteristic, CpG island graders are constituted using CpG islands feature with And tetrad grader is constituted using tetrad constituent feature, by above three grader respectively to same sample gene sequence Row carry out Promoter Recognition judgement, and consider the recognition result that three provides, and determine to work as when satisfaction first is pre-conditioned Preceding sample gene order is promoter sequence, to being unsatisfactory for the first pre-conditioned sample gene order, extracts its five conjuncted groups Into composition characteristics and five conjuncted graders of composition, by five conjuncted graders to being unsatisfactory for the first pre-conditioned sample gene order Promoter Recognition judgement is carried out, and when recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is startup Subsequence, is otherwise non-start up subsequence.The application has taken into full account the rigidity characteristic of gene, CpG islands feature and constituent Feature, by hierarchical identification, the Promoter Recognition result accuracy rate for being finally given is higher.
Wherein, referring to Fig. 3, Fig. 3 is the structure chart of feature extraction unit disclosed in the embodiment of the present application.As illustrated, institute Stating feature extraction unit 24 can include:
Rigidity characteristic extraction unit 241, for each class sample gene order after for division, extracts each of which The rigidity characteristic of sample gene order;
CpG islands feature extraction unit 242, for each class sample gene order after for division, extracts each of which The CpG islands feature of individual sample gene order;
Tetrad constituent feature extraction unit 243, for each class sample gene order after for division, extracts The tetrad constituent feature of each of which sample gene order.
Wherein, referring to Fig. 4, Fig. 4 is the structure chart of tetrad grader disclosed in the embodiment of the present application.As illustrated, institute Stating tetrad grader 27 includes:
First sub-classifier 271, Promoter Recognition judgement is carried out for the feature according to promoter and extron;
Second sub-classifier 272, Promoter Recognition judgement is carried out for the feature according to promoter and introne;
3rd sub-classifier 273, Promoter Recognition judgement is carried out for the feature according to promoter and 3 ' UTR.
It should be noted that five conjuncted graders 30 can also be set according to the structure of above-mentioned tetrad grader 27 Put.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant meaning Covering including for nonexcludability, so that process, method, article or equipment including a series of key elements not only include that A little key elements, but also other key elements including being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", does not arrange Except also there is other identical element in the process including the key element, method, article or equipment.
Each embodiment is described by the way of progressive in this specification, and what each embodiment was stressed is and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.
The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or uses the application. Various modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can in other embodiments be realized in the case where spirit herein or scope is not departed from.Therefore, the application The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The scope most wide for causing.

Claims (10)

1. a kind of process for recognising human gene promoter, it is characterised in that including:
The sample set that reception is made up of multiple sample gene orders;
Cytimidine, the guanine CG preference profiles of each sample gene order are counted respectively, obtain statistics;
All of sample gene order is divided into by two classes according to the statistics, a class has the CG preference profiles, separately One class does not have the CG preference profiles;
Each class sample gene order after for division, respectively extract each of which sample gene order rigidity characteristic, CpG islands feature and tetrad constituent feature;
Rigid grader is constituted using the rigidity characteristic, CpG island graders are constituted using CpG islands feature and institute is utilized State tetrad constituent feature and constitute tetrad grader, the rigid grader, CpG islands grader and the tetrad Body grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first identification knot respectively Really;
Three first recognition results meet first it is pre-conditioned when, it is determined that current sample gene order is promoter sequence Row;
To being unsatisfactory for the first pre-conditioned sample gene order, five conjuncted points of its five conjuncted constituent feature and composition are extracted Class device, is unsatisfactory for the first pre-conditioned sample gene order and carries out Promoter Recognition sentencing by described five conjuncted graders to described It is disconnected, and provide the second recognition result;
When second recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is promoter sequence, it is no It is then non-start up subsequence.
2. method according to claim 1, it is characterised in that the sample set being made up of multiple sample gene orders For:
Wherein xi∈RL, xiIt is sample gene order, yi∈ { " promoter ", " extron ", " introne ", " 3 ' UTR " }, N is number of samples, and L is sample gene order length.
3. method according to claim 2, it is characterised in that the born of the same parents for counting each sample gene order respectively are phonetic Pyridine, guanine CG preference profiles, obtain statistics, specially:
To each sample gene order xiThe ratio of statistics wherein cytimidine C and guanine G contents, and after having counted The sample set is expressed as:
WhereinnCIt is the number that cytimidine C in sample gene order occurs, nGRepresent sample The number that guanine G occurs in this gene order.
4. method according to claim 3, it is characterised in that it is described according to the statistics by all of sample gene Sequence is divided into two classes, and a class has a CG preference profiles, another kind of without the CG preference profiles, specially:
Threshold value w is set, whenWhen, representing the sample gene order has CG preference profiles, otherwise represents the sample gene Sequence does not have CG preference profiles.
5. method according to claim 4, it is characterised in that the rigidity characteristic of each sample gene order is extracted Process, specially:
Trinucleotide model is taken to calculate the rigidity characteristic of each sample gene order:
When the rigidity characteristic of each base position of sample gene order is calculated, calculated using 6 base sequences long, added up Four rigidity parameters values of the trinucleotide of overlap, the sample gene order after rigidity characteristic is extracted is expressed as:
WhereinJ (j=1,2 ..., L-5) it is position Put index, tkBe each overlap trinucleotide in j-th rigidity parameters value of base positions.
6. method according to claim 4, it is characterised in that the CpG islands feature of each sample gene order is carried Process is taken, specially:
Calculate each sample gene order xiCytimidine and guanine total content CG_con:
C G _ c o n = n C + n G L
Calculate each sample gene order xiCpG islands observed value and predicted value ratio Obs/Exp:
O b s / E x p = n C G * L n C * n G
Wherein, nC,nG,nCGThe number of cytimidine C, guanine G and dyad CG is represented respectively,
Sample gene order is expressed as:
X CpG = { f ij CpG } i = 1 n
Wherein
7. method according to claim 4, it is characterised in that the tetrad of each sample gene order is constituted into Divide characteristic extraction procedure, specially:
If fprIt is the frequency that tetrad occurs in promoter,It is tetrad in a kind non-start up subsequences The frequency of appearance, wherein a=1 represent extron, a=2 and represent introne, a=3 and represent 3 '-UTR, then the KL based on tetrad Divergence is as follows:
D a ( f p r , f n p a ) = Σ i = 1 4 4 f p r ( i ) l n f p r ( i ) f n p a ( i ) , a = 1 , 2 , 3
WillIt is d to arrange and make it in descending ordera(m), wherein
Order:
R a = Σ m = 1 n a d a ( m ) D a ( f p r , f n p a ) , n a ∈ [ 1 , 4 4 ]
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naThe appearance of individual tetrad Frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
8. a kind of human gene Promoter Recognition system, it is characterised in that including:
Receiving unit, for receiving the sample set being made up of multiple sample gene orders;
Statistic unit, cytimidine, guanine CG preference profiles for counting each sample gene order respectively, is counted As a result;
Taxon, for all of sample gene order to be divided into two classes according to the statistics, a class has described CG preference profiles, it is another kind of without the CG preference profiles;
Feature extraction unit, for each class sample gene order after for division, extracts each of which sample base respectively Because of the rigidity characteristic of sequence, CpG islands feature and tetrad constituent feature;
The rigid grader being made up of the rigidity characteristic, the CpG islands grader being made up of CpG islands feature, by described four The tetrad grader that conjuncted constituent feature is constituted, the rigid grader, CpG islands grader and the tetrad Grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first recognition result respectively;
First promoter determining unit, for three first recognition results meet first it is pre-conditioned when, it is determined that currently Sample gene order is promoter sequence;
5-linked body characteristicses extraction unit, for being unsatisfactory for the first pre-conditioned sample gene order, extracting its five conjuncted groups Into composition characteristics;
Five conjuncted graders being made up of described five conjuncted constituent features, the five conjuncted grader is unsatisfactory for described One pre-conditioned sample gene order carries out Promoter Recognition judgement, and provides the second recognition result;
Second promoter determining unit, for when second recognition result satisfaction second is pre-conditioned, it is determined that current sample Gene order is promoter sequence, is otherwise non-start up subsequence.
9. system according to claim 8, it is characterised in that the feature extraction unit includes:
Rigidity characteristic extraction unit, for each class sample gene order after for division, extracts each of which sample base Because of the rigidity characteristic of sequence;
CpG islands feature extraction unit, for each class sample gene order after for division, extracts each of which sample base Because of the CpG islands feature of sequence;
Tetrad constituent feature extraction unit, for each class sample gene order after for division, extracts wherein every One tetrad constituent feature of sample gene order.
10. system according to claim 8, it is characterised in that the tetrad grader includes:
First sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and extron;
Second sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and introne;
3rd sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and 3 ' UTR.
CN201410140707.2A 2014-04-09 2014-04-09 A kind of process for recognising human gene promoter and system Active CN103870719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410140707.2A CN103870719B (en) 2014-04-09 2014-04-09 A kind of process for recognising human gene promoter and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410140707.2A CN103870719B (en) 2014-04-09 2014-04-09 A kind of process for recognising human gene promoter and system

Publications (2)

Publication Number Publication Date
CN103870719A CN103870719A (en) 2014-06-18
CN103870719B true CN103870719B (en) 2017-06-16

Family

ID=50909244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410140707.2A Active CN103870719B (en) 2014-04-09 2014-04-09 A kind of process for recognising human gene promoter and system

Country Status (1)

Country Link
CN (1) CN103870719B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376234B (en) * 2014-12-03 2017-12-26 苏州大学 promoter recognition method and system
CN104462870A (en) * 2015-01-09 2015-03-25 苏州大学 Method and device for identifying human gene promoter
CN104834834A (en) * 2015-04-09 2015-08-12 苏州大学张家港工业技术研究院 Construction method and device of promoter recognition system
CN105550538B (en) * 2016-02-03 2018-06-01 苏州大学 A kind of process for recognising human gene promoter and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1784498A (en) * 2003-03-28 2006-06-07 科根泰克股份有限公司 Genomic profiling of regulatory factor binding sites
WO2007050706A2 (en) * 2005-10-27 2007-05-03 University Of Missouri-Columbia Dna methylation biomarkers in lymphoid and hematopoietic malignancies
CN102282542A (en) * 2008-10-14 2011-12-14 奇托尔·V·斯里尼瓦桑 TICC-paradigm to build formally verified parallel software for multi-core chips

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1784498A (en) * 2003-03-28 2006-06-07 科根泰克股份有限公司 Genomic profiling of regulatory factor binding sites
WO2007050706A2 (en) * 2005-10-27 2007-05-03 University Of Missouri-Columbia Dna methylation biomarkers in lymphoid and hematopoietic malignancies
CN102282542A (en) * 2008-10-14 2011-12-14 奇托尔·V·斯里尼瓦桑 TICC-paradigm to build formally verified parallel software for multi-core chips

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
DNA sequence and structural properties as predictors of human and mouse promoters;Pelin Akan,et al.,;《Gene》;20080229;第410卷(第1-2期);165-176 *
DNA structural properties in the classification of genomic transcription regulation elements;Meysman P, et al.,;《Bioinformatics and biology insights》;20121231;第6卷;155-168 *
G-四联体的生物学功能研究进展;谌錾,等;《生理科学进展》;20101231;第41卷(第5期);329-334 *
Large-scale human promoter mapping using CpG islands;Ioshikhes I P,et al.,;《nature genetics》;20001231;第26卷;61-63 *
Structure,function and evolution of CpG island promoters;Antequera F.;《Cellular and Molecular Life Sciences CMLS》;20031231;第60卷(第8期);1647-1658 *
Towards accurate human promoter recognition: A review of currently used sequence features and classification methods;Zeng J, et al.,;《Briefings in Bioinformatics》;20090616;第10卷(第5期);498-508 *
人类启动子识别算法研究;梅丽;《中国优秀硕士学位论文全文数据库基础科学辑》;20120515(第05期);A006~14 *
基于KL散度和BP神经网络的人类基因启动子识别;李文举,等;《辽宁师范大学学报(自然科学版)》;20100331;第33卷(第1期);42-45 *
基于碱基偏好分析和SVM的植物启动子识别;李文举,等;《辽宁师范大学学报(自然科学版)》;20120630;第35卷(第2期);183~187 *

Also Published As

Publication number Publication date
CN103870719A (en) 2014-06-18

Similar Documents

Publication Publication Date Title
CN103870719B (en) A kind of process for recognising human gene promoter and system
CN104281649B (en) Input method and device and electronic equipment
CN104462383B (en) A kind of film based on a variety of behavior feedbacks of user recommends method
CN101187927B (en) Criminal case joint investigation intelligent analysis method
CN109657629A (en) A kind of line of text extracting method and device
CN108292369A (en) Visual identity is carried out using deep learning attribute
CN105389713A (en) Mobile data traffic package recommendation algorithm based on user historical data
WO2005015476A3 (en) System and method for determining equivalency factors for use in comparative performance analysis of industrial facilities
CN104462868B (en) A kind of full-length genome SNP site analysis method of combination random forest and Relief F
CN104503973A (en) Recommendation method based on singular value decomposition and classifier combination
CN109543765A (en) A kind of industrial data denoising method based on improvement IForest
CN107958338A (en) Electricity consumption policy recommendation method and device, storage medium
CN107169411A (en) A kind of real-time dynamic gesture identification method based on key frame and boundary constraint DTW
CN104200206B (en) Double-angle sequencing optimization based pedestrian re-identification method
CN104102696A (en) Content recommendation method and device
CN103092931A (en) Multi-strategy combined document automatic classification method
CN102999926A (en) Low-level feature integration based image vision distinctiveness computing method
CN106485096A (en) MiRNA Relationship To Environmental Factors Forecasting Methodology based on random two-way migration and multi-tag study
CN107346478A (en) Shipping paths planning method, server and system based on historical data
CN104020845A (en) Acceleration transducer placement-unrelated movement recognition method based on shapelet characteristic
CN105825078A (en) Small sample gene expression data classification method based on gene big data
CN107679553A (en) Clustering method and device based on density peaks
CN106600044A (en) Method and apparatus for determining vehicle sales quantity prediction model
CN106644035B (en) Vibration source identification method and system based on time-frequency transformation characteristics
CN103955676B (en) Human face identification method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant