CN103870719B - A kind of process for recognising human gene promoter and system - Google Patents
A kind of process for recognising human gene promoter and system Download PDFInfo
- Publication number
- CN103870719B CN103870719B CN201410140707.2A CN201410140707A CN103870719B CN 103870719 B CN103870719 B CN 103870719B CN 201410140707 A CN201410140707 A CN 201410140707A CN 103870719 B CN103870719 B CN 103870719B
- Authority
- CN
- China
- Prior art keywords
- gene order
- sample gene
- promoter
- feature
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Micro-Organisms Or Cultivation Processes Thereof (AREA)
Abstract
This application discloses a kind of Promoter Recognition method, by the way that multiple sample gene orders are carried out with the statistics of cytimidine, guanine CG preference profiles, multiple sample gene orders are divided into two classes, following steps are performed respectively for each class sample gene order:Rigidity characteristic, CpG islands feature and the tetrad constituent feature of each of which sample gene order are extracted respectively, and build corresponding grader Promoter Recognition judgement is carried out to sample gene order, non-start up subsequence to recognizing is extracted its five conjuncted constituent feature and constitutes five conjuncted graders, Promoter Recognition judgement is carried out again, and when recognition result meets pre-conditioned, it is determined that current sample gene order is promoter sequence, otherwise it is non-start up subsequence.The application has taken into full account the rigidity characteristic of gene, CpG islands feature and constituent feature, and by hierarchical identification, the Promoter Recognition result accuracy rate for being finally given is higher.
Description
Technical field
The application involves starting up sub- identification technology field, more specifically to a kind of human gene Promoter Recognition side
Method and system.
Background technology
After the completion of human gene sketch, already turn into one on research human gene expression regulation and extremely challenging grind
Study carefully direction.And Promoter Recognition to whole gene group function annotation have important effect, therefore how faster and better knowledge
Other human promoter, it has also become a hot research field.
Prediction promoter is main from transcription initiation site, core promoter region, the transcription factor of identifying promoter at present
The CpG islands four direction of binding domain and promoter sets out.Wherein, CpG islands (CpG island) is meant that:CpG dinucleotides
Distribution very heterogeneity in human genome, and in some sections of genome, CpG keeps or higher than normal probability, these
Section is referred to as CpG islands.Emerging Promoter Recognition method is proposed to be studied promoter structure feature, such as flexible
(lexibility), rigidity(rigidity)And flexibility(bendability)Feature is the feature extracted from three dimensions.This
A little architectural features can provide important side information for the Promoter Recognition system set up.
Mei Li et al. is proposed and is used SVMs(SVM)The algorithm that ranking of features is used.First order SVM classifier
Promoter is recognized using CpG islands feature, the non-start up subsequence that first order SVM classifier is marked off is then by second level SVM
Grader is further recognized.What the algorithm was extracted is the constituent feature of identical sample, and the structure using gene is not special
Levy.Also, the feature of different promoters sequence is simultaneously differed, so might not be had using the feature of identical sample extraction
Most strong resolving power.Therefore, existing method has that discrimination is not high.
The content of the invention
In view of this, this application provides a kind of process for recognising human gene promoter and system, for solving existing calculation
The method problem not high to the discrimination of gene promoter.
To achieve these goals, it is proposed that scheme it is as follows:
A kind of process for recognising human gene promoter, including:
The sample set that reception is made up of multiple sample gene orders;
Cytimidine, the guanine CG preference profiles of each sample gene order are counted respectively, obtain statistics;
All of sample gene order is divided into by two classes according to the statistics, a class has the CG preferences special
Levy, it is another kind of without the CG preference profiles;
Each class sample gene order after for division, the rigidity that each of which sample gene order is extracted respectively is special
Levy, CpG islands feature and tetrad constituent feature;
Rigid grader is constituted using the rigidity characteristic, constitute CpG island graders and profit using CpG islands feature
Tetrad grader, the rigid grader, CpG islands grader and described are constituted with the tetrad constituent feature
Tetrad grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first identification respectively
As a result;
Three first recognition results meet first it is pre-conditioned when, it is determined that current sample gene order is promoter
Sequence;
To being unsatisfactory for the first pre-conditioned sample gene order, extract its five conjuncted constituent feature and constitute 5-linked
Body grader, promoter knowledge is carried out by described five conjuncted graders to the first pre-conditioned sample gene order that is unsatisfactory for
Do not judge, and provide the second recognition result;
When second recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is promoter sequence
Row, are otherwise non-start up subsequence.
Preferably, the sample set that is made up of multiple sample gene orders is:
Wherein xi∈RL, yi∈ { " promoter ", " extron ", " introne ", " 3 ' UTR " },
N is number of samples, and L is sample gene order length.
Preferably, the cytimidine that each sample gene order is counted respectively, guanine CG preference profiles, are united
Meter result, specially:
To each sample gene order xiThe ratio of statistics wherein cytimidine C and guanine G contents, and it will have been counted
The sample set afterwards is expressed as:
WhereinnCIt is the number that cytimidine C in sample gene order occurs, nGTable
The number that guanine G occurs in sample this gene order.
Preferably, described that all of sample gene order is divided into two classes according to the statistics, a class has institute
State CG preference profiles, it is another kind of without the CG preference profiles, specially:
Threshold value w is set, whenWhen, representing the sample gene order has CG preference profiles, otherwise represents the sample
Gene order does not have CG preference profiles.
Preferably, the rigidity characteristic extraction process of each sample gene order, specially:
Trinucleotide model is taken to calculate the rigidity characteristic of each sample gene order:
When the rigidity characteristic of each base position of sample gene order is calculated, calculated using 6 base sequences long,
The rigidity parameters value of the trinucleotide of cumulative four overlaps, the sample gene order after rigidity characteristic is extracted is expressed as:
Whereinj(j=1,2,...,L-5)
It is location index, tkBe each overlap trinucleotide in j-th rigidity parameters value of base positions.
Preferably, the CpG islands characteristic extraction procedure of each sample gene order, specially:
Calculate each sample gene order xiCytimidine and guanine total content CG_con:
Calculate each sample gene order xiCpG islands observed value and predicted value ratio Obs/Exp:
Wherein, nC,nG,nCGThe number of cytimidine C, guanine G and dyad CG is represented respectively,
Sample gene order is expressed as:
Wherein
Preferably, the tetrad constituent characteristic extraction procedure of each sample gene order, specially:
If fprIt is the frequency that tetrad occurs in promoter,It is tetrad in a sequences of kind non-start up
The frequency occurred in row, wherein a=1 represents extron, a=2 and represents introne, a=3 and represents 3 '-UTR, then the KL based on tetrad
Divergence is as follows:
Willi∈[1,44] arrange in descending order and make it be
Order:
na∈ [Isosorbide-5-Nitraes4]
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naIndividual tetrad goes out
Existing frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
A kind of human gene Promoter Recognition system, including:
Receiving unit, for receiving the sample set being made up of multiple sample gene orders;
Statistic unit, cytimidine, guanine CG preference profiles for counting each sample gene order respectively, obtains
Statistics;
Taxon, for all of sample gene order to be divided into two classes according to the statistics, a class has
The CG preference profiles, it is another kind of without the CG preference profiles;
Feature extraction unit, for each class sample gene order after for division, extracts each of which sample respectively
The rigidity characteristic of this gene order, CpG islands feature and tetrad constituent feature;
The rigid grader being made up of the rigidity characteristic, the CpG islands grader being made up of CpG islands feature, by institute
State the tetrad grader of tetrad constituent feature composition, the rigid grader, CpG islands grader and described four
Conjuncted grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first identification knot respectively
Really;
First promoter determining unit, for three first recognition results meet first it is pre-conditioned when, it is determined that
Current sample gene order is promoter sequence;
5-linked body characteristicses extraction unit, for being unsatisfactory for the first pre-conditioned sample gene order, extracting its 5-linked
Body constituent feature;
Five conjuncted graders being made up of described five conjuncted constituent features, the five conjuncted grader is to described discontented
The first pre-conditioned sample gene order of foot carries out Promoter Recognition judgement, and provides the second recognition result;
Second promoter determining unit, for when second recognition result satisfaction second is pre-conditioned, it is determined that currently
Sample gene order is promoter sequence, is otherwise non-start up subsequence.
Preferably, the feature extraction unit includes:
Rigidity characteristic extraction unit, for each class sample gene order after for division, extracts each of which sample
The rigidity characteristic of this gene order;
CpG islands feature extraction unit, for each class sample gene order after for division, extracts each of which sample
The CpG islands feature of this gene order;
Tetrad constituent feature extraction unit, for each class sample gene order after for division, extracts it
In each sample gene order tetrad constituent feature.
Preferably, the tetrad grader includes:
First sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and extron;
Second sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and introne;
3rd sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and 3 ' UTR.
It can be seen from above-mentioned technical scheme that, Promoter Recognition method disclosed in the present application, by multiple sample bases
The statistics of cytimidine, guanine CG preference profiles is carried out because of sequence, multiple sample gene orders are divided into two classes, a class has CG
Preference profiles, it is another kind of not have, following steps are performed respectively for each class sample gene order:Each of which is extracted respectively
The rigidity characteristic of individual sample gene order, CpG islands feature and tetrad constituent feature, rigidity point is constituted using rigidity characteristic
Class devices, CpG island graders are constituted using CpG islands feature and tetrad grader is constituted using tetrad constituent feature,
Promoter Recognition judgement is carried out to same sample gene order by above three grader respectively, and considers what three provided
Recognition result, current sample gene order is determined when satisfaction first is pre-conditioned for promoter sequence, pre- to being unsatisfactory for first
If the sample gene order of condition, extract its five conjuncted constituent feature and constitute five conjuncted graders, by five conjuncted classification
Device carries out Promoter Recognition judgement to being unsatisfactory for the first pre-conditioned sample gene order, and it is pre- to meet second in recognition result
If during condition, it is determined that current sample gene order is promoter sequence, otherwise it is non-start up subsequence.The application takes into full account
The rigidity characteristic of gene, CpG islands feature and constituent feature, by hierarchical identification, the Promoter Recognition result for being finally given
Accuracy rate is higher.
Brief description of the drawings
In order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of application, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of process for recognising human gene promoter flow chart disclosed in the embodiment of the present application;
Fig. 2 is a kind of human gene Promoter Recognition system construction drawing disclosed in the embodiment of the present application;
Fig. 3 is the structure chart of feature extraction unit disclosed in the embodiment of the present application;
Fig. 4 is the structure chart of tetrad grader disclosed in the embodiment of the present application.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on
Embodiment in the application, it is all other that those of ordinary skill in the art are obtained under the premise of creative work is not paid
Embodiment, belongs to the scope of the application protection.
Embodiment one
Referring to Fig. 1, Fig. 1 is a kind of process for recognising human gene promoter flow chart disclosed in the embodiment of the present application.
As shown in figure 1, the method includes:
Step 101:The sample set that reception is made up of multiple sample gene orders;
Step 102:Cytimidine, the guanine CG preference profiles of each sample gene order are counted respectively, are counted
As a result;
Specifically, cytimidine C in each sample gene order is counted(Cytosine), guanine G(Guanine)CG's
Quantity, if the ratio of C, G quantity of the whole sample gene orders of CG is more than certain value, that is, thinks the sample gene order pair
CG has preference profiles, does not have CG preference profiles otherwise.
Step 103:All of sample gene order is divided into by two classes according to the statistics, a class has the CG
Preference profiles, it is another kind of without the CG preference profiles;
Step 104:Each class sample gene order after for division, extracts each of which sample gene order respectively
Rigidity characteristic, CpG islands feature and tetrad constituent feature;
Specifically, rigidity characteristic, CpG islands feature and tetrad constituent feature are the inherent structure features of gene,
Wherein CpG islands are meant that:(CpG island)Distribution of the CpG dinucleotides in human genome very heterogeneity, and in base
Because some sections organized, CpG keep or higher than normal probability, these sections are referred to as CpG islands.
Step 105:Rigid grader is constituted using the rigidity characteristic, the classification of CpG islands is constituted using CpG islands feature
Device and the utilization tetrad constituent feature constitute tetrad grader, the rigid grader, CpG islands classification
Device and the tetrad grader carry out Promoter Recognition judgement to same sample gene order respectively, and are given respectively corresponding
First recognition result;
Specifically, according to three category features for extracting, three kinds of graders are built respectively, by three kinds of graders respectively to
This gene order carries out Promoter Recognition judgement, and provides the first recognition result respectively, and three kinds of graders provide three altogether
One recognition result.
Step 106:Three first recognition results meet first it is pre-conditioned when, it is determined that current sample gene order
It is promoter sequence;
Specifically, three the first recognition results are considered, and when it meets pre-conditioned, it is determined that current sample gene
Sequence is promoter sequence.
Step 107:To being unsatisfactory for the first pre-conditioned sample gene order, its five conjuncted constituent feature is extracted simultaneously
Five conjuncted graders are constituted, the first pre-conditioned sample gene order that is unsatisfactory for is carried out by described five conjuncted graders
Promoter Recognition judges, and provides the second recognition result;
Step 108:When second recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is to open
Promoter sequences, are otherwise non-start up subsequence.
Specifically, further it is identified judging for the non-start up subsequence determined in step 106, it is full in recognition result
When foot second is pre-conditioned, the sample gene order is defined as promoter sequence.
Promoter Recognition method disclosed in the embodiment of the present application, cytimidine, bird are carried out by multiple sample gene orders
Multiple sample gene orders are divided into two classes by the statistics of purine CG preference profiles, and a class has CG preference profiles, another kind of not have
Have, following steps are performed respectively for each class sample gene order:The firm of each of which sample gene order is extracted respectively
Property feature, CpG islands feature and tetrad constituent feature, constitute rigid grader using rigidity characteristic, using CpG islands feature
Constitute CpG island graders and constitute tetrad grader using tetrad constituent feature, by above three grader point
It is other that Promoter Recognition judgement is carried out to same sample gene order, and consider the recognition result that three provides, meeting the
One determines current sample gene order for promoter sequence when pre-conditioned, to being unsatisfactory for the first pre-conditioned sample gene sequence
Row, extract its five conjuncted constituent feature and constitute five conjuncted graders, are preset to being unsatisfactory for first by five conjuncted graders
The sample gene order of condition carries out Promoter Recognition judgement, and when recognition result satisfaction second is pre-conditioned, it is determined that currently
Sample gene order is promoter sequence, is otherwise non-start up subsequence.The application taken into full account gene rigidity characteristic,
CpG islands feature and constituent feature, by hierarchical identification, the Promoter Recognition result accuracy rate for being finally given is higher.
Embodiment two
What we will be described in detail each step in embodiment one in the present embodiment implements process.
First, defining the sample set being made up of multiple sample gene orders is:
Wherein xi∈RL, yi∈ { " promoter ", " extron ", " introne ", " 3 ' UTR " }, N is sample
This number, L is sample gene order length.
It should be noted that 3 ' UTR are the non-translational regions at 3 ' ends in gene order.
Next, we count the cytimidine of each sample gene order, guanine CG preference profiles, specially:
To each sample gene order xiThe ratio of statistics wherein cytimidine C and guanine G contents, and it will have been counted
The sample set afterwards is expressed as:
WhereinnCIt is the number that cytimidine C in sample gene order occurs, nG
Represent the number of guanine G appearance in sample gene order.
By the statistics of CG preference profiles, all of sample gene order is divided for two classes, a class is inclined with CG
Good feature, it is another kind of without CG preference profiles.During division, we can first set threshold value w, whenWhen, representing should
Sample gene order has CG preference profiles, otherwise represents the sample gene order without CG preference profiles.
Next, two class sample gene orders after for division extract its architectural feature:
1st, the extraction process of rigidity characteristic:
Trinucleotide model is taken to calculate the rigidity characteristic of each sample gene order:
When the rigidity characteristic of each base position of sample gene order is calculated, calculated using 6 base sequences long,
The rigidity parameters value of the trinucleotide of cumulative four overlaps, the sample gene order after rigidity characteristic is extracted is expressed as:
Whereinj(j=1,2,...,L-5)
It is location index, tkBe each overlap trinucleotide in j-th rigidity parameters value of base positions.
For example:
6 bases are TATAAA, are calculated first since T base positions,
rT=tTAT+tATA+tTAA+tAAA。
2nd, CpG islands characteristic extraction procedure:
Calculate each sample gene order xiCytimidine C and guanine G total contents CG_con:
Calculate each sample gene order xiCpG islands observed value and predicted value ratio Obs/Exp:
Wherein, nC,nG,nCGThe number of cytimidine C, guanine G and dyad CG is represented respectively,
The sample gene order after the feature of CpG islands will be extracted to be expressed as:
Wherein
3rd, tetrad constituent characteristic extraction procedure:
If fprIt is the frequency that tetrad occurs in promoter,It is tetrad in a kinds non-start up
The frequency occurred in sequence, wherein a=1 represents extron, a=2 and represents introne, a=3 and represents 3 '-UTR, then based on tetrad
KL divergences are as follows:
WillI ∈ [Isosorbide-5-Nitraes4] arrange in descending order and make it be
Order:
na∈[1,44]
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naIndividual tetrad goes out
Existing frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
In the manner described above, each sample gene order x is countediNotable feature, notable feature set can represent
For:
(R=1 represents the notable feature for distinguishing promoter and extron, r=2 represent differentiation promoter and
The notable feature of introne, r=3 represents the notable feature for distinguishing promoter and 3'-UTR), whereinnaN before representinga
The frequency of the appearance of individual tetrad.
Feature extraction proceeds the step 105- steps 108 in embodiment one after finishing, will not be repeated here.
It should be noted that the process for extracting five conjuncted constituent features in step 107 is referred to above-mentioned tetrad
The process of body constituent feature extraction.It is as follows:
Equally set fprIt is the five conjuncted frequencies occurred in promoter,For five conjuncted in a kind non-start up
The frequency occurred in subsequence, wherein a=1 represents extron, a=2 and represents introne, a=3 and represents 3 '-UTR, then conjuncted based on five
KL divergences it is as follows:
WillI ∈ [Isosorbide-5-Nitraes5] arrange in descending order and make it be
Order:
na∈[1,45]
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naIndividual five conjuncted go out
Existing frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
In the manner described above, each sample gene order x is countediNotable feature, notable feature set can represent
For:
(R=1 represents the notable feature for distinguishing promoter and extron, r=2 represent differentiation promoter and
The notable feature of introne, r=3 represents the notable feature for distinguishing promoter and 3'-UTR), whereinnaN before representinga
The frequency of individual five conjuncted appearance.
Embodiment three
Referring to Fig. 2, Fig. 2 is a kind of human gene Promoter Recognition system construction drawing disclosed in the embodiment of the present application.
As shown in Fig. 2 the system includes:
Receiving unit 21, for receiving the sample set being made up of multiple sample gene orders;
Statistic unit 22, cytimidine, guanine CG preference profiles for counting each sample gene order respectively, obtains
To statistics;
Taxon 23, for all of sample gene order to be divided into two classes, class tool according to the statistics
There are the CG preference profiles, it is another kind of without the CG preference profiles;
Feature extraction unit 24, for each class sample gene order after for division, extracts each of which respectively
The rigidity characteristic of sample gene order, CpG islands feature and tetrad constituent feature;
The rigid grader 25 being made up of the rigidity characteristic, the CpG islands grader 26 being made up of CpG islands feature,
The tetrad grader 27 being made up of the tetrad constituent feature, the rigid grader 25, CpG islands grader
26 and the tetrad grader 27 Promoter Recognition judgement is carried out to same sample gene order respectively, and be given respectively correspondence
The first recognition result;
First promoter determining unit 28, for three first recognition results meet first it is pre-conditioned when, really
Settled preceding sample gene order is promoter sequence;
5-linked body characteristicses extraction unit 29, for being unsatisfactory for the first pre-conditioned sample gene order, extract its five
Conjuncted constituent feature;
Five conjuncted graders 30 being made up of described five conjuncted constituent features, described in the five conjuncted grader 30 pairs
Being unsatisfactory for the first pre-conditioned sample gene order carries out Promoter Recognition judgement, and provides the second recognition result;
Second promoter determining unit 31, for when second recognition result satisfaction second is pre-conditioned, it is determined that working as
Preceding sample gene order is promoter sequence, is otherwise non-start up subsequence.
The method that the specific course of work of said system may refer to corresponding embodiment one is discussed, and the application is implemented
The disclosed Promoter Recognition system of example, by the system that multiple sample gene orders are carried out with cytimidine, guanine CG preference profiles
Multiple sample gene orders are divided into two classes by meter, and a class has CG preference profiles, another kind of not have, for each class sample
Gene order performs following steps respectively:Extract respectively the rigidity characteristic of each of which sample gene order, CpG islands feature and
Tetrad constituent feature, constitute rigid grader using rigidity characteristic, CpG island graders are constituted using CpG islands feature with
And tetrad grader is constituted using tetrad constituent feature, by above three grader respectively to same sample gene sequence
Row carry out Promoter Recognition judgement, and consider the recognition result that three provides, and determine to work as when satisfaction first is pre-conditioned
Preceding sample gene order is promoter sequence, to being unsatisfactory for the first pre-conditioned sample gene order, extracts its five conjuncted groups
Into composition characteristics and five conjuncted graders of composition, by five conjuncted graders to being unsatisfactory for the first pre-conditioned sample gene order
Promoter Recognition judgement is carried out, and when recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is startup
Subsequence, is otherwise non-start up subsequence.The application has taken into full account the rigidity characteristic of gene, CpG islands feature and constituent
Feature, by hierarchical identification, the Promoter Recognition result accuracy rate for being finally given is higher.
Wherein, referring to Fig. 3, Fig. 3 is the structure chart of feature extraction unit disclosed in the embodiment of the present application.As illustrated, institute
Stating feature extraction unit 24 can include:
Rigidity characteristic extraction unit 241, for each class sample gene order after for division, extracts each of which
The rigidity characteristic of sample gene order;
CpG islands feature extraction unit 242, for each class sample gene order after for division, extracts each of which
The CpG islands feature of individual sample gene order;
Tetrad constituent feature extraction unit 243, for each class sample gene order after for division, extracts
The tetrad constituent feature of each of which sample gene order.
Wherein, referring to Fig. 4, Fig. 4 is the structure chart of tetrad grader disclosed in the embodiment of the present application.As illustrated, institute
Stating tetrad grader 27 includes:
First sub-classifier 271, Promoter Recognition judgement is carried out for the feature according to promoter and extron;
Second sub-classifier 272, Promoter Recognition judgement is carried out for the feature according to promoter and introne;
3rd sub-classifier 273, Promoter Recognition judgement is carried out for the feature according to promoter and 3 ' UTR.
It should be noted that five conjuncted graders 30 can also be set according to the structure of above-mentioned tetrad grader 27
Put.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by
One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation
Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant meaning
Covering including for nonexcludability, so that process, method, article or equipment including a series of key elements not only include that
A little key elements, but also other key elements including being not expressly set out, or also include for this process, method, article or
The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", does not arrange
Except also there is other identical element in the process including the key element, method, article or equipment.
Each embodiment is described by the way of progressive in this specification, and what each embodiment was stressed is and other
The difference of embodiment, between each embodiment identical similar portion mutually referring to.
The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or uses the application.
Various modifications to these embodiments will be apparent for those skilled in the art, as defined herein
General Principle can in other embodiments be realized in the case where spirit herein or scope is not departed from.Therefore, the application
The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one
The scope most wide for causing.
Claims (10)
1. a kind of process for recognising human gene promoter, it is characterised in that including:
The sample set that reception is made up of multiple sample gene orders;
Cytimidine, the guanine CG preference profiles of each sample gene order are counted respectively, obtain statistics;
All of sample gene order is divided into by two classes according to the statistics, a class has the CG preference profiles, separately
One class does not have the CG preference profiles;
Each class sample gene order after for division, respectively extract each of which sample gene order rigidity characteristic,
CpG islands feature and tetrad constituent feature;
Rigid grader is constituted using the rigidity characteristic, CpG island graders are constituted using CpG islands feature and institute is utilized
State tetrad constituent feature and constitute tetrad grader, the rigid grader, CpG islands grader and the tetrad
Body grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first identification knot respectively
Really;
Three first recognition results meet first it is pre-conditioned when, it is determined that current sample gene order is promoter sequence
Row;
To being unsatisfactory for the first pre-conditioned sample gene order, five conjuncted points of its five conjuncted constituent feature and composition are extracted
Class device, is unsatisfactory for the first pre-conditioned sample gene order and carries out Promoter Recognition sentencing by described five conjuncted graders to described
It is disconnected, and provide the second recognition result;
When second recognition result satisfaction second is pre-conditioned, it is determined that current sample gene order is promoter sequence, it is no
It is then non-start up subsequence.
2. method according to claim 1, it is characterised in that the sample set being made up of multiple sample gene orders
For:
Wherein xi∈RL, xiIt is sample gene order, yi∈ { " promoter ", " extron ", " introne ", " 3 '
UTR " }, N is number of samples, and L is sample gene order length.
3. method according to claim 2, it is characterised in that the born of the same parents for counting each sample gene order respectively are phonetic
Pyridine, guanine CG preference profiles, obtain statistics, specially:
To each sample gene order xiThe ratio of statistics wherein cytimidine C and guanine G contents, and after having counted
The sample set is expressed as:
WhereinnCIt is the number that cytimidine C in sample gene order occurs, nGRepresent sample
The number that guanine G occurs in this gene order.
4. method according to claim 3, it is characterised in that it is described according to the statistics by all of sample gene
Sequence is divided into two classes, and a class has a CG preference profiles, another kind of without the CG preference profiles, specially:
Threshold value w is set, whenWhen, representing the sample gene order has CG preference profiles, otherwise represents the sample gene
Sequence does not have CG preference profiles.
5. method according to claim 4, it is characterised in that the rigidity characteristic of each sample gene order is extracted
Process, specially:
Trinucleotide model is taken to calculate the rigidity characteristic of each sample gene order:
When the rigidity characteristic of each base position of sample gene order is calculated, calculated using 6 base sequences long, added up
Four rigidity parameters values of the trinucleotide of overlap, the sample gene order after rigidity characteristic is extracted is expressed as:
WhereinJ (j=1,2 ..., L-5) it is position
Put index, tkBe each overlap trinucleotide in j-th rigidity parameters value of base positions.
6. method according to claim 4, it is characterised in that the CpG islands feature of each sample gene order is carried
Process is taken, specially:
Calculate each sample gene order xiCytimidine and guanine total content CG_con:
Calculate each sample gene order xiCpG islands observed value and predicted value ratio Obs/Exp:
Wherein, nC,nG,nCGThe number of cytimidine C, guanine G and dyad CG is represented respectively,
Sample gene order is expressed as:
Wherein
7. method according to claim 4, it is characterised in that the tetrad of each sample gene order is constituted into
Divide characteristic extraction procedure, specially:
If fprIt is the frequency that tetrad occurs in promoter,It is tetrad in a kind non-start up subsequences
The frequency of appearance, wherein a=1 represent extron, a=2 and represent introne, a=3 and represent 3 '-UTR, then the KL based on tetrad
Divergence is as follows:
WillIt is d to arrange and make it in descending ordera(m), wherein
Order:
M is progressively increased into n by 1a, and calculate corresponding RaIf, m=naWhen, Ra>=98%, then by preceding naThe appearance of individual tetrad
Frequency is used as difference promoter and the notable feature of a kind non-start up subsequences.
8. a kind of human gene Promoter Recognition system, it is characterised in that including:
Receiving unit, for receiving the sample set being made up of multiple sample gene orders;
Statistic unit, cytimidine, guanine CG preference profiles for counting each sample gene order respectively, is counted
As a result;
Taxon, for all of sample gene order to be divided into two classes according to the statistics, a class has described
CG preference profiles, it is another kind of without the CG preference profiles;
Feature extraction unit, for each class sample gene order after for division, extracts each of which sample base respectively
Because of the rigidity characteristic of sequence, CpG islands feature and tetrad constituent feature;
The rigid grader being made up of the rigidity characteristic, the CpG islands grader being made up of CpG islands feature, by described four
The tetrad grader that conjuncted constituent feature is constituted, the rigid grader, CpG islands grader and the tetrad
Grader carries out Promoter Recognition judgement to same sample gene order respectively, and provides corresponding first recognition result respectively;
First promoter determining unit, for three first recognition results meet first it is pre-conditioned when, it is determined that currently
Sample gene order is promoter sequence;
5-linked body characteristicses extraction unit, for being unsatisfactory for the first pre-conditioned sample gene order, extracting its five conjuncted groups
Into composition characteristics;
Five conjuncted graders being made up of described five conjuncted constituent features, the five conjuncted grader is unsatisfactory for described
One pre-conditioned sample gene order carries out Promoter Recognition judgement, and provides the second recognition result;
Second promoter determining unit, for when second recognition result satisfaction second is pre-conditioned, it is determined that current sample
Gene order is promoter sequence, is otherwise non-start up subsequence.
9. system according to claim 8, it is characterised in that the feature extraction unit includes:
Rigidity characteristic extraction unit, for each class sample gene order after for division, extracts each of which sample base
Because of the rigidity characteristic of sequence;
CpG islands feature extraction unit, for each class sample gene order after for division, extracts each of which sample base
Because of the CpG islands feature of sequence;
Tetrad constituent feature extraction unit, for each class sample gene order after for division, extracts wherein every
One tetrad constituent feature of sample gene order.
10. system according to claim 8, it is characterised in that the tetrad grader includes:
First sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and extron;
Second sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and introne;
3rd sub-classifier, Promoter Recognition judgement is carried out for the feature according to promoter and 3 ' UTR.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410140707.2A CN103870719B (en) | 2014-04-09 | 2014-04-09 | A kind of process for recognising human gene promoter and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410140707.2A CN103870719B (en) | 2014-04-09 | 2014-04-09 | A kind of process for recognising human gene promoter and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103870719A CN103870719A (en) | 2014-06-18 |
CN103870719B true CN103870719B (en) | 2017-06-16 |
Family
ID=50909244
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410140707.2A Active CN103870719B (en) | 2014-04-09 | 2014-04-09 | A kind of process for recognising human gene promoter and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103870719B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376234B (en) * | 2014-12-03 | 2017-12-26 | 苏州大学 | promoter recognition method and system |
CN104462870A (en) * | 2015-01-09 | 2015-03-25 | 苏州大学 | Method and device for identifying human gene promoter |
CN104834834A (en) * | 2015-04-09 | 2015-08-12 | 苏州大学张家港工业技术研究院 | Construction method and device of promoter recognition system |
CN105550538B (en) * | 2016-02-03 | 2018-06-01 | 苏州大学 | A kind of process for recognising human gene promoter and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1784498A (en) * | 2003-03-28 | 2006-06-07 | 科根泰克股份有限公司 | Genomic profiling of regulatory factor binding sites |
WO2007050706A2 (en) * | 2005-10-27 | 2007-05-03 | University Of Missouri-Columbia | Dna methylation biomarkers in lymphoid and hematopoietic malignancies |
CN102282542A (en) * | 2008-10-14 | 2011-12-14 | 奇托尔·V·斯里尼瓦桑 | TICC-paradigm to build formally verified parallel software for multi-core chips |
-
2014
- 2014-04-09 CN CN201410140707.2A patent/CN103870719B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1784498A (en) * | 2003-03-28 | 2006-06-07 | 科根泰克股份有限公司 | Genomic profiling of regulatory factor binding sites |
WO2007050706A2 (en) * | 2005-10-27 | 2007-05-03 | University Of Missouri-Columbia | Dna methylation biomarkers in lymphoid and hematopoietic malignancies |
CN102282542A (en) * | 2008-10-14 | 2011-12-14 | 奇托尔·V·斯里尼瓦桑 | TICC-paradigm to build formally verified parallel software for multi-core chips |
Non-Patent Citations (9)
Title |
---|
DNA sequence and structural properties as predictors of human and mouse promoters;Pelin Akan,et al.,;《Gene》;20080229;第410卷(第1-2期);165-176 * |
DNA structural properties in the classification of genomic transcription regulation elements;Meysman P, et al.,;《Bioinformatics and biology insights》;20121231;第6卷;155-168 * |
G-四联体的生物学功能研究进展;谌錾,等;《生理科学进展》;20101231;第41卷(第5期);329-334 * |
Large-scale human promoter mapping using CpG islands;Ioshikhes I P,et al.,;《nature genetics》;20001231;第26卷;61-63 * |
Structure,function and evolution of CpG island promoters;Antequera F.;《Cellular and Molecular Life Sciences CMLS》;20031231;第60卷(第8期);1647-1658 * |
Towards accurate human promoter recognition: A review of currently used sequence features and classification methods;Zeng J, et al.,;《Briefings in Bioinformatics》;20090616;第10卷(第5期);498-508 * |
人类启动子识别算法研究;梅丽;《中国优秀硕士学位论文全文数据库基础科学辑》;20120515(第05期);A006~14 * |
基于KL散度和BP神经网络的人类基因启动子识别;李文举,等;《辽宁师范大学学报(自然科学版)》;20100331;第33卷(第1期);42-45 * |
基于碱基偏好分析和SVM的植物启动子识别;李文举,等;《辽宁师范大学学报(自然科学版)》;20120630;第35卷(第2期);183~187 * |
Also Published As
Publication number | Publication date |
---|---|
CN103870719A (en) | 2014-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103870719B (en) | A kind of process for recognising human gene promoter and system | |
CN104281649B (en) | Input method and device and electronic equipment | |
CN104462383B (en) | A kind of film based on a variety of behavior feedbacks of user recommends method | |
CN101187927B (en) | Criminal case joint investigation intelligent analysis method | |
CN109657629A (en) | A kind of line of text extracting method and device | |
CN108292369A (en) | Visual identity is carried out using deep learning attribute | |
CN105389713A (en) | Mobile data traffic package recommendation algorithm based on user historical data | |
WO2005015476A3 (en) | System and method for determining equivalency factors for use in comparative performance analysis of industrial facilities | |
CN104462868B (en) | A kind of full-length genome SNP site analysis method of combination random forest and Relief F | |
CN104503973A (en) | Recommendation method based on singular value decomposition and classifier combination | |
CN109543765A (en) | A kind of industrial data denoising method based on improvement IForest | |
CN107958338A (en) | Electricity consumption policy recommendation method and device, storage medium | |
CN107169411A (en) | A kind of real-time dynamic gesture identification method based on key frame and boundary constraint DTW | |
CN104200206B (en) | Double-angle sequencing optimization based pedestrian re-identification method | |
CN104102696A (en) | Content recommendation method and device | |
CN103092931A (en) | Multi-strategy combined document automatic classification method | |
CN102999926A (en) | Low-level feature integration based image vision distinctiveness computing method | |
CN106485096A (en) | MiRNA Relationship To Environmental Factors Forecasting Methodology based on random two-way migration and multi-tag study | |
CN107346478A (en) | Shipping paths planning method, server and system based on historical data | |
CN104020845A (en) | Acceleration transducer placement-unrelated movement recognition method based on shapelet characteristic | |
CN105825078A (en) | Small sample gene expression data classification method based on gene big data | |
CN107679553A (en) | Clustering method and device based on density peaks | |
CN106600044A (en) | Method and apparatus for determining vehicle sales quantity prediction model | |
CN106644035B (en) | Vibration source identification method and system based on time-frequency transformation characteristics | |
CN103955676B (en) | Human face identification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |