CN101533484A - Method for forecasting gene transferring horizontally in genome - Google Patents

Method for forecasting gene transferring horizontally in genome Download PDF

Info

Publication number
CN101533484A
CN101533484A CN200810101786A CN200810101786A CN101533484A CN 101533484 A CN101533484 A CN 101533484A CN 200810101786 A CN200810101786 A CN 200810101786A CN 200810101786 A CN200810101786 A CN 200810101786A CN 101533484 A CN101533484 A CN 101533484A
Authority
CN
China
Prior art keywords
gene
genome
transferring horizontally
dimensional space
horizontally
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810101786A
Other languages
Chinese (zh)
Inventor
陈阳
王守觉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Semiconductors of CAS
Original Assignee
Institute of Semiconductors of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Semiconductors of CAS filed Critical Institute of Semiconductors of CAS
Priority to CN200810101786A priority Critical patent/CN101533484A/en
Publication of CN101533484A publication Critical patent/CN101533484A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for forecasting genes transferring horizontally in a genome by utilizing a bionic pattern identification principle, which comprises the steps of: utilizing the homologous continuity-based bionic pattern identification principle to forecast the gene transferring horizontally in genome, extracting features of gene sequences, transforming the gene into points in a higher dimensional space, analyzing the point distribution of the same sample in the higher dimensional space, determining a constructing network of a geometric solid covering the sample subspace and forecasting the gene transferring horizontally.

Description

Method for forecasting gene transferring horizontally in genome
Technical field
The present invention relates to a kind of method of prediction level metastatic gene, be specifically related to a kind of identification principle of bionic mode structure isoformgene training network that utilizes gene transferring horizontally is carried out forecast method.
Background technology
Horizontal Gene Transfer (horizontal gene transfer, HGT), claim again the side direction transgenosis (lateral gene transfer, LGT), be meant between the difference bion, or the interchange of the inhereditary material that is carried out between the inner organelle of individual cells.The difference bion can be of the same race but contain the bion of different hereditary information, can be edge far away also, even not have the bion of sibship.Along with human and other biological gene group examining order are finished in succession, it is found that between the different plant species, even have a large amount of homologous genes to exist on the genome between the far biology of sibship, further confirmed the ubiquity and the edge far away of Horizontal Gene Transfer.The prediction of gene transferring horizontally is carried out qualitative and quantitative estimation for inhereditary material between understanding in the biological evolution process and the species all important meaning.And in recent years, find to exist in the physical environment to have the dna molecular of activity of conversion and the competent cell that can initiatively absorb foreign DNA in a large number, make people new understanding arranged to the Horizontal Gene Transfer that takes place in the environment.Further investigation to the ecological effect of Horizontal Gene Transfer and generation thereof will help genetically engineered biological is formed a fresh judgement, and make the bigger effect of application performance of technique for gene engineering and genetically modified organism.
The method of identification gene transferring horizontally has a variety of now, relatively be typically and utilize between the different plant species gene unusual high BLAST to hit to predict and differentiate, yet these two kinds of methods all need when genomic data is abundant just more effective by the method that constructing system takes place by chadogram.Also have class methods to be based on gene sequence characteristic in addition.These methods all are based on such hypothesis: genomic certain feature is that this genome is distinctive, if be that to deviate from that be exactly gene transferring horizontally with this distinctive feature in this genome.Now commonly used is a kind ofly comes the prediction level metastatic gene based on eight nucleotide frequency scorings (W8), and this method can the automatic setting threshold value for different genomes, and improves a lot than algorithm hit rate in the past.Also have a kind of method for forecasting gene transferring horizontally based on Support Vector Machine (SVM), its hit rate improves than W8 algorithm.But the hit rate of these two kinds of algorithms is not very desirable, W8 algorithm particularly, and hit rate is very low in some bacteriums groups.And Support Vector Machine need adopt the prediction of branch chain could improve some hit rates.
Summary of the invention
The object of the present invention is to provide a kind of new method for forecasting gene transferring horizontally.
For achieving the above object, the present invention adopts and comes the prediction level metastatic gene based on identification principle of bionic mode, extract gene sequence characteristic with statistical method, genetic transformation is become the point of higher dimensional space, analyze the stream shape that similar sample distributes at higher dimensional space, determine to cover the geometrical body building network of sample subspace, gene transferring horizontally is predicted.Its disposal route comprises the steps:
Step 1: adopt and extract gene sequence characteristic based on statistical method;
Step 2: all genes in the genome are changed into proper vector according to step 1 operation, and each gene is mapped to a point of higher dimensional space;
Step 3: analyze gene in the same genome and distribute, determine to cover the sample subspace, make up training network at the point of higher dimensional space;
Step 4: gene transferring horizontally is predicted with the network that makes up.
Further, described employing statistical method abstraction sequence feature, wherein, statistical method has a variety of, such as based on the WF method of statistics base word frequencies, based on the absolute codon usage frequency FCU method of statistics etc.
Further, described genetic transformation is become proper vector, wherein, because gene order is by A, T, G, C forms, so so long as add up the frequency that the various words that constitute of these 4 characters occur.If the word length of statistics is 1, is exactly 4 kinds of situations so, proper vector 4 dimensions.If the word length of statistics is 2,16 kinds of situations are so just arranged, proper vector is exactly 16 dimensions.Therefore the dimension of the proper vector that generally obtains is 4r, and wherein r is the length of word.
Further, described analyzing gene distributes at the point of higher dimensional space, mainly is the Euclidean distance between the calculation level, determines the ordering of sample point.
Further, described covering sample subspace employing geometrical body.
Further, the geometrical body of described covering sample subspace, wherein, generally we adopt the topological product of the simplest simple form of different dimensions and hypersphere to constitute.As: one-dimensional simplex is a straight line, is exactly to be similar to the shape of sausage after it and the hypersphere topological product so, and we name this geometrical body with super sausage neuron exactly in fact.
Further, describedly with the network that makes up gene transferring horizontally is predicted as: when gene in the test sample book during by the network coverage, described gene is not a gene transferring horizontally; When gene in the test sample book during not by the network coverage, described gene is a gene transferring horizontally.
The present invention is applied to the method for bionical pattern-recognition (BPR) in the gene transferring horizontally prediction, adopt statistical method to extract gene sequence characteristic, genetic transformation is become the point of higher dimensional space, analyze the stream shape that similar sample distributes at higher dimensional space, determine to cover the geometrical body building network of sample subspace, gene transferring horizontally is predicted with the network that makes up.The result is better than W8 method and SVM method, has improved hit rate.
Description of drawings
Fig. 1 is the two-dimensional space synoptic diagram of super sausage neuron different radii;
Fig. 2 is the algorithm flow chart that the present invention proposes.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
The present invention is a kind of method of utilizing bionical pattern recognition theory prediction level metastatic gene.Wherein, at first adopt statistical method to extract gene sequence characteristic, then genetic transformation is become the point of higher dimensional space, then analyze the stream shape that similar sample distributes at higher dimensional space, determine to cover the geometrical body of sample subspace then, then building network uses the network that makes up that gene transferring horizontally is predicted then.
We adopt the method abstraction sequence feature of statistics for gene order, adopt in the experiment based on the absolute codon usage frequency FCU method of statistics and come the abstraction sequence feature, mainly be because it had both comprised the information of gene codon use bias, also comprised the information of the amino acid composition of coded by said gene protein.Absolute codon statistical frequency mainly is the frequency (FD) of statistics bigeminy nucleotide, and the computing formula of FD is
f ij = o ij n j
J=0 wherein, 1,2,3.When j=0, add up the frequency of continuous bigeminy nucleotide, preceding two two frequencies that connect nucleotide of statistics codon when j=1, when j=2, the frequency of latter two two companies nucleotide of statistics codon, first and the 3rd two of statistics codon connects the frequency of nucleotide when j=3, and we can obtain the vector of one 64 dimension like this.
By above-mentioned feature extracting method, we can obtain the vector of one 64 dimension for each gene, and it is mapped to higher dimensional space, so a point in all corresponding one-tenth 64 dimensional feature space of each gene, the point of analyzing the isoformgene higher dimensional space distributes, and adopts following algorithm building network:
1) initialization feature S set aBe empty, S bComprise the sample characteristics vector of determining network structure that is useful on, neuronal ensemble S HSNBe sky;
2) from S bOptional proper vector is put into S a
3) from S aSelect a proper vector P a, from S bSelect a proper vector P b, guarantee ‖ p a-p bThe ‖ minimum is with P bAlso add S aIn;
4) repeat 3 until S bBe sky, S HSNBe the neuronal ensemble of building network.
This algorithm has generated a minimum spanning tree.
Constitute super sausage neuroid with minimum spanning tree that generates and hypersphere topological product, gene transferring horizontally is discerned.Super sausage neuron models as shown in Figure 1, it is the topological product of certain one-dimensional manifold in hypersphere and the space.Say on directly perceived that this higher-dimension geometrical body can be regarded as hypersphere and rolls and the summation in the zone of process along the specified track of certain one-dimensional manifold.It is convenient to consider to realize, this one-dimensional manifold can be similar to a chain of being made up of the end to end broken line of plurality of sections.The centre of sphere that makes certain hypersphere is along wherein one section line segment rolling, can obtain a kind of higher-dimension geometrical basic shape unit that is similar to sausage, adjacent per two neurons are connected to each other, can constitute a super sausage chain, the super sausage chain of each bar can be described the sample areas of some classifications in feature space.The descriptive equation of this model is as follows:
f ( X ) = sgn ( 2 - d 2 ( X , X 1 X 2 ‾ ) r 2 - 0.5 )
Wherein r is the neuron radius, and some X is to line segment X 1X 2The computing method of distance as follows:
d 2 ( X , X 1 X 2 &OverBar; ) = | | X - X 1 | | 2 , q ( X , X 1 , X 2 ) < 0 | | X - X 2 | | 2 , q ( X , X 1 , X 2 ) > | | X 1 - X 2 | | | | X - X 1 | | 2 - q 2 ( X , X 1 , X 2 ) , otherwise
q ( X , X 1 , X 2 ) = ( X - X 1 ) &CenterDot; ( X 1 - X 2 ) | | X 1 - X 2 | |
If test sample book is similar with training sample, f (X) 〉=0, otherwise f (X)<0.
Application example of the present invention is the prediction for the bacterial genomes gene transferring horizontally, and its specific implementation step is as follows:
1) chooses gene data.Since in bacterial genomes known gene transferring horizontally data seldom, so we adopt artificial method simulation to insert gene transferring horizontally in the bacterium group.Because the incident of horizontal transfer in the bacterial genomes is in occurring in nature outwardness, so generally select for use phage gene or bacterial gene as giving the body gene.The present patent application, choose in 27 kinds of phage genome totally 1615 genes as giving body gene data collection, and object gene data collection we select Escherichia coli (Escherichia coli K12), Bollinger body conveyor screw (Borrelia burgdorferi) and wax shape bacillocin (Bacillus cereus ZK) for use.These three kinds all is common pathogenic bacteria, and their genome sequence all comes from the GenBank database, and registration number is respectively NC_000913, NC_001318, and NC_006274.Our being inserted into the object gene data to the body gene and concentrating as gene transferring horizontally from concentrating for the body gene data to choose at random, choosing to the body gene dosage is 2% of object gene dosage.
2) every kind of object genome is predicted respectively that we adopt super sausage neuroid training objects cdna sample, the artificial gene order of inserting is as test sample book.Since we now the gene transferring horizontally of identification be artificial insertion bacterial genomes, and bacterial genomes itself also is the gene transferring horizontally that oneself is arranged.If algorithm is reasonably talked about, except can predict artificial insert also should be able to dope bacterial genomes gene transferring horizontally originally, still we have no idea to judge its recognition accuracy to this part.So we generally come the quality of measure algorithm with hit rate, just calculating our the artificial gene that inserts can have several can being come out by algorithm identified.In the present patent application, we to each bacterial genomes 100 insertions average.
HT = 1 100 &Sigma; i = 1 100 HT i ( G ) , G represents certain bacterial genomes
Table 1 is BPR, and SVM and W8 promptly pass through the result of ten times of cross validations in the comparison of predicting on the bacterium group gene transferring horizontally.Wherein the generalization of network is 88%, and makes comparisons with W8 method and SVM method.It is as shown in the table, our method improves a lot on to gene transferring horizontally, particularly we have improved 42.3% than W8 on to Escherichia coli (Escherichia coli K12) hit rate, improved 30.5% than SVM method, what wherein the SVM method adopted also is FCU method abstraction sequence feature.
Species The W8 hit rate The SVM hit rate The BPR hit rate
Escherichia coli 37.5% 49.3% 79.8%
Wax shape bacillocin 54.1% 57.1% 83.0%
The Bollinger body conveyor screw 75.8% 76.7% 94.9%
Table 1
3) practice examining of bionical algorithm for pattern recognition prediction HGT.At present, verified exists anti-mould through the ages art (Vancomycin-resistance) related gene that obtains by horizontal transfer at enterococcus faecalis (Enterococcus faecalis) genome, and one has 7.These genes are EF2293-EF2299 at ncbi database " locus-tag ", position in the enterococcus faecalis genome is respectively 2212353-2212961,2212967-2213995,2213988-2214959,2214956-2215783,2215801-221607,2216783-2218126,2218126-2218788.We are used as test sample book to these 7 genes, with remaining gene in the enterococcus faecalis genome as the training sample building network, we have identified these 7 genes all as a result, and this has also further confirmed the validity of bionical pattern-recognition in the gene transferring horizontally prediction.
Bionical pattern-recognition is based on people having the same aspiration and interest continuity principle, gene order in genome has intrinsic feature itself, these Feature Mapping are to satisfy people having the same aspiration and interest continuity to higher dimensional space, and be exactly to find the gene that deviates from the whole genome feature based on sequence signature prediction level metastatic gene, so we adopt, and prediction can obtain good effect to bacterium group gene transferring horizontally based on bionical mode identification method.The present invention just sets forth the new method of utilizing bionical pattern recognition theory prediction level metastatic gene, believes along with further research, and the method can have more widely at the gene recognition other field to be used.
The above; only be the embodiment among the present invention; but protection scope of the present invention is not limited thereto; anyly be familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected; all should be encompassed in of the present invention comprising within the scope, therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (8)

1. a method for forecasting gene transferring horizontally in genome is characterized in that, may further comprise the steps:
1) employing is extracted gene sequence characteristic based on statistical method;
2) all genes in the genome are changed into proper vector according to the step 1) operation, each gene is mapped to a point of higher dimensional space;
3) gene distributes at the point of higher dimensional space in the same genome of analysis, determines to cover the sample subspace, makes up training network;
4) with the network that makes up gene transferring horizontally is predicted.
2. method according to claim 1 is characterized in that, described statistical method comprises based on the WF method of statistics base word frequencies with based on the absolute codon usage frequency FCU method of statistics.
3. method according to claim 1 is characterized in that, described genetic transformation becomes in the proper vector step, and gene order is by A, T, and G, C forms, and the dimension of the proper vector that obtains is 4 r, wherein r is the length of word.
4. method according to claim 1 is characterized in that, described analyzing gene is meant analysis distribution relation between points in the some distribution of higher dimensional space, and the Euclidean distance between the calculation level is determined the ordering of sample point.
5. method according to claim 1 is characterized in that, geometrical body is adopted in described covering sample subspace.
6. method according to claim 5 is characterized in that, the geometrical body of described covering sample subspace is to adopt the topological product of the simplest simple form of different dimensions and hypersphere to constitute.
7. method according to claim 6 is characterized in that described geometrical body is super sausage neuron.
8. method according to claim 1 is characterized in that, describedly with the network that makes up gene transferring horizontally is predicted as: when gene in the test sample book during by the network coverage, described gene is not a gene transferring horizontally; When gene in the test sample book during not by the network coverage, described gene is a gene transferring horizontally.
CN200810101786A 2008-03-12 2008-03-12 Method for forecasting gene transferring horizontally in genome Pending CN101533484A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810101786A CN101533484A (en) 2008-03-12 2008-03-12 Method for forecasting gene transferring horizontally in genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810101786A CN101533484A (en) 2008-03-12 2008-03-12 Method for forecasting gene transferring horizontally in genome

Publications (1)

Publication Number Publication Date
CN101533484A true CN101533484A (en) 2009-09-16

Family

ID=41104066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810101786A Pending CN101533484A (en) 2008-03-12 2008-03-12 Method for forecasting gene transferring horizontally in genome

Country Status (1)

Country Link
CN (1) CN101533484A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294934A (en) * 2012-02-24 2013-09-11 塔塔咨询服务有限公司 Prediction of horizontally transferred gene
CN109243529A (en) * 2018-08-28 2019-01-18 福建师范大学 Gene transferring horizontally recognition methods based on local sensitivity Hash

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294934A (en) * 2012-02-24 2013-09-11 塔塔咨询服务有限公司 Prediction of horizontally transferred gene
EP2653991A3 (en) * 2012-02-24 2015-03-04 Tata Consultancy Services Limited Prediction of horizontally transferred gene
CN103294934B (en) * 2012-02-24 2018-02-23 塔塔咨询服务有限公司 The prediction of gene transferring horizontally
CN109243529A (en) * 2018-08-28 2019-01-18 福建师范大学 Gene transferring horizontally recognition methods based on local sensitivity Hash
CN109243529B (en) * 2018-08-28 2021-09-07 福建师范大学 Horizontal transfer gene identification method based on locality sensitive hashing

Similar Documents

Publication Publication Date Title
Xiong et al. Host selection shapes crop microbiome assembly and network complexity
Dionne et al. Landscape genetics and hierarchical genetic structure in Atlantic salmon: the interaction of gene flow and local adaptation
Jaarola et al. Colonization history in Fennoscandian rodents
Zhang et al. Dimension reduction using semi-supervised locally linear embedding for plant leaf classification
Fitzpatrick Power and sample size for nested analysis of molecular variance
Jacquemyn et al. Nonrandom spatial structuring of orchids in a hybrid zone of three Orchis species
González‐Resendiz et al. A bridge too far in naming species: a total evidence approach does not support recognition of four species in Desertifilum (Cyanobacteria)
CN109783979B (en) Leakage monitoring sensor layout optimization method under semi-supervised condition of urban water supply pipe network
Sovic et al. Origin of a cryptic lineage in a threatened reptile through isolation and historical hybridization
Herman et al. Range‐wide phylogeography of the four‐toed salamander: out of Appalachia and into the glacial aftermath
González et al. Declaring success in Sphagnum peatland restoration: identifying outcomes from readily measurable vegetation descriptors.
Listl et al. Do seed transfer zones for ecological restoration reflect the spatial genetic variation of the common grassland species Lathyrus pratensis?
CN101533484A (en) Method for forecasting gene transferring horizontally in genome
Bona et al. Unfavourable habitat conditions can facilitate hybridisation between the endangered Betula humilis and its widespread relatives B. pendula and B. pubescens
Zozomová-Lihová et al. Pleistocene range disruption and postglacial expansion with secondary contacts explain the genetic and cytotype structure in the western Balkan endemic Alyssum austrodalmaticum (Brassicaceae)
Nurlaila et al. K-means clustering model to discriminate copper-resistant bacteria as bioremediation agents
Uhrová et al. Species limits and phylogeographic structure in two genera of solitary African mole-rats Georychus and Heliophobius
Singh et al. Phylogenetic evaluation of the genus Nostoc and description of Nostoc neudorfense sp. nov., from the Czech Republic
Shepherd et al. Genetic structuring in the spotted gum complex (genus Corymbia, section Politaria)
Satish et al. Genome evolution of the cyanobacterium Nostoc linckia under sharp microclimatic divergence at" evolution Canyon," Israel
Yang et al. Genetic k-means-algorithm-based classification of direct load-control curves
CN114219371A (en) Moon-scale basin artificial water pollution load accounting method
Stackebrandt Phylogeny based on 16S rRNA/DNA
CN103150491B (en) Based on the frequency spectrum 3-periodically signal to noise ratio (S/N ratio) acquisition methods of nucleotide potential difference
Lee et al. Estimation of the accuracy of genomic breeding value in Hanwoo (Korean cattle)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090916