The content of the invention
In order to solve the above problems, the invention discloses one kind is more saved time, facilitated, and gained correction model is more complete
Face, accurate genome Metabolic Network Model automation modification method.
Technical scheme is as follows:
A kind of genome Metabolic Network Model automates modification method, comprises the following steps:
(1), the leak metabolin list in genome Metabolic Network Model, fills up the specific reaction of species;
(2), according to metabolin title in the specific reaction of species, the direction reacted in model is determined;
(3), determine that optimum response is interval in model.
Its further technical scheme is that step (1) includes:
(1A), using matlab softwares, genome metabolism network roughcast type is converted into into computer-readable format, and is carried out
Metabolin leak is searched;
(1B), the genome protein sequence of species is submitted to the automatic comment server KASS of KEGG websites, KASS is automatic
Annotation returns the Pathway list that the protein sequence occurs;
(1C) reaction path of leak metabolin, and the Pathway obtained in step (1B), are determined in roughcast type
The reaction path is found in list;
(1D) collection of illustrative plates of gene metabolism network, is obtained according to the reaction path of the leak metabolin found in step (1C)
URL addresses, to URL addresses http request is sent, and the Web page picture for obtaining server end response is designated as collection of illustrative plates T, and collection of illustrative plates T includes
Metabolic pathway square frame;
(1E) the metabolic pathway square frame of collection of illustrative plates T in step (1D), is clicked on, the page page comprising all reactions, page is entered
Face page includes No. EC of protein sequence, a concrete reaction in each No. EC correspondence collection of illustrative plates T, the URL addresses of No. EC
Point to specific reaction equation;
(1F) No. EC corresponding in page page No. KO and specific reactional equation Reaction, newly-built text, are obtained
Part KO-EC-Reaction.txt, by No. EC and corresponding No. KO, reactional equation Reaction write file KO-EC-
Reaction.txt;
(1G), by the content of file KO-EC-Reaction.txt in row read step (1F), searching loop extracts KO-
Reaction comprising leak metabolin in EC-Reaction files, new files EC-KO-Break.txt will be comprising leak metabolism
No. EC of thing, No. KO, the information of reactional equation Reaction is stored in file EC-KO-Break.txt;
(1H), determine whether the reaction comprising leak metabolin that step (1G) is extracted is the specific anti-of the genome
Should;
(1I), newly-built new-rec.txt files, specific reaction is saved in new-rec.txt files, travels through new-
Each reaction in rec.txt files, checks in roughcast type with the presence or absence of the reaction do not exist, and adds.
Its further technical scheme is that the step (1H) specifically includes following steps:
(1H1), by web crawlers technology, the page page of simultaneously analytical procedure (1E) is submitted to, extraction KO is in webpage
Corresponding all coordinates;
(1H2) pixel that inframe is chosen after the square box at No. KO place, is navigated to, the rgb value of its color is read;
If (1H3) value is 0 or 255, without color mark, judgement is not the specific reaction of species;If value
Between 0~255, then there is color mark, judgement is the specific reaction of species.
Its further technical scheme is:Step (2) specifically includes following steps:
(2A), by web crawlers technology, searched in KEGG, MetaCyc, tri- websites of MetRxn with metabolin title
The direction of each reaction, extracts and preserves each directional information of the reaction in 3 websites;
If (2B), reaction should be irreversible in two websites of MetaCyc and MetRxn, being judged as can not be converse
Should, otherwise it is judged as reversible reaction.
Its further technical scheme is that the implementation method of step (3) is:Each albumen interval website is calculated in the species
Weight on correspondence monoid data set, adopts weighted calculation, it is determined that most preferably to the interval result for returning of each website albumen afterwards
Reaction interval.
Its further technical scheme is that step (3) specifically includes following steps:
(3A), obtain corresponding No. KO according to the reaction per bar, search its corresponding genbank in KASS annotation results and compile
Number gb;Corresponding protein sequence is found in the species protein sequence storehouse.
(3B), protein sequence is submitted in the list of correspondence website, the location information for returning is obtained.
(3C), newly-built species correspondence monoid data set, and the weight of each website is calculated in new data set.
(3D), the generation of each reaction of weighted calculation is interval, it is determined that optimal interval and inserted in reaction equation.
Its further technical scheme is in step (3C), to specifically include following steps:
(3C1), according to the newly-built protein data collection of monoid of species:12 reaction intervals are chosen in each website, each reaction
100 reactions are chosen in interval, constitute the data set of 1200 protein sequences;The similitude of any two protein sequence in data set
Less than 25%;
(3C2) the correct protein sequence number of each interval prediction in each prediction website, is counted;
(3C3) the average identification correct number of each prediction website, is calculated, if X { X1, X2 ... X12 } is each prediction
The correctly predicted interval number in the interval of website 12, then each website averagely recognize that correct number is:D=(X1+X2+ ... X12)/
12;
(3C4) weight of 6 prediction websites, is calculated.
Its further technical scheme is that in step (3D), the formula of weighted calculation is:
Wherein, ViRepresent the judgement compartmental results of i-th protein sequence;WnFor the power of n-th albumen interval prediction website
Weight, wherein keeping:
I-th protein sequence predicting the outcome on n-th interval prediction website is represented, N represents the protein region of selection
Between predict the Number of websites, c represents the protein sequence number to be predicted, when the testing protein sequence to being input into is adjudicated, in advance
Survey the interval gained vote amount in each class interval of albumen to sort, testing protein sequence is divided into what the maximum interval of gained vote amount was located
Class.
The method have the benefit that:
The present invention proposes the weight marking mechanism predicted the outcome based on multiple albumen intervals, calculates optimum prediction albumen
It is interval, it is determined that specific reaction with a high credibility.By filling up for complete metabolism leak, the determination of the Direction of Reaction, reaction
Three steps of deciding field, complete the refining process of model, and using the method genome Metabolic Network Model amendment is carried out
Advantage is more to save time, facilitate, and gained correction model is more fully, accurately.
Specific embodiment
With reference to embodiment, the present invention is further illustrated.
Carry out by taking the makeover process of Spathaspora passalidarum NRRL Y-27907 genome roughcast types as an example
Illustrate that the leak reaction filled up automatically in roughcast type determines that the Direction of Reaction and reaction occur interval.Comprise the following steps that:
1st, the leak metabolin list returned according to matlab, fills up Spathaspora passalidarum NRRL Y-
27907 specific reaction.Fig. 1 is the flow chart of step 1, the flow process of automatic polishing breakpoint is indicated, in the way of program flow
The process of automatic polishing breakpoint is intuitively illustrated, and can be it is seen from figure 1 that each compound of cycle criterion successively
And the process of each reaction.
(1A), the roughcast type of genome metabolism network was imported to equipped with cobra kits and glpk linear programming phases
In matlab softwares, by xls2model programs by model conversion be computer-readable form, will roughcast type Excel
Table is read as meterological s-matrix.S-matrix (828 × 984) represents that the model is made up of 828 metabolins and 984 reactions.Together
When the lookup of metabolin leak is completed by Gapfind programs, its middle and upper reaches metabolin leak is promising 44, downstream metabolite leakage
Hole has 128.
(1B), Spathaspora passalidarum NRRL are submitted to the automatic comment server KASS of KEGG websites
Y-27907 genome protein sequences, KASS annotates the Pathway list for returning the protein sequence automatically.
(1C), for a specific leak metabolin, the reaction way of metabolin is determined in the Excel tables of roughcast type
Footpath, and find the metabolic pathway in KASS annotates the Pathway list for returning.With the leak that matlab software lookups go out
Explanation as a example by GLUGSAL [m] in metabolin list.The reaction path of metabolin is determined in the Excel tables of roughcast type
Subsystem is Arginine and Proline Metabolism, and is looked in KASS annotates the Pathway list for returning
To metabolic pathway Arginine and Proline Metabolism.
In FIG, process a is the data input of breakpoint compound, and process b is the corresponding metabolin of acquisition breakpoint compound
Reaction path Subsystem information, process c is the judgement that whether there is to reaction path Subsystem information.Process a is arrived
The breakpoint compound that process c this circulation there will be reaction path is found out, correspondence step (1C).
(1D), it is met according to the Arginine and Proline Metabolism approach found in step (1C)
The URL addresses of the gene metabolism network collection of illustrative plates of condition, to URL addresses the webpage figure that http request obtains server end response is sent
Piece is designated as picture T, and picture T is the structure chart of whole metabolism network, and the green enzyme number in picture T represents special comprising breakpoint
Property reaction.
Process d in Fig. 1 is to enter into after the reaction path Subsystem for finding in corresponding collection of illustrative plates, correspondence step
Suddenly (1D).
(1E) the metabolic pathway square frame in the upper left corner in the picture T in step (1D), is clicked on, the page comprising all reactions is entered
Face, a concrete reaction in the page in each No. EC correspondence collection of illustrative plates, specific reactional equation is pointed in its URL addresses
Formula.
(1F), from the beginning of first No. EC, No. EC is obtained corresponding No. KO in the page, navigate to afterwards corresponding to No. EC
URL addresses, analysis<table></table>Information in label, finally obtains specific reactional equation Reaction.It is newly-built
File KO-EC-Reaction.txt, by No. EC and corresponding No. KO and reactional equation Reaction file KO-EC- is write
Reaction.txt。
This step is process e in Fig. 1, and KO-EC-Reaction.txt files save all reactions in collection of illustrative plates.
(1G), according to the full name L-Glutamate 5-semialdehyde of leak metabolin GLUGSAL, read by row and walk
Suddenly in (1F) file KO-EC-Reaction.txt content, searching loop, obtain it is all comprising L-Glutamate 5-
The reaction of semialdehyde.3 reactions comprising leak metabolin are found, as shown in table 1.
Reaction of the table 1 comprising leak metabolin
(1H), determine whether 3 reactions extracted in step (1G) are Spathaspora passalidarum NRRL
Y-27907 specific reactions.The square frame that enzyme number in step (1G) corresponding to specific reaction is located has color mark, therefore can
By web crawlers technology, the webpage of simultaneously analytical procedure (1C) is submitted to, extract No. KO corresponding all coordinates, i.e. KO in webpage
The position of the square box that number corresponding enzyme number is located, navigates to the pixel that inframe is chosen after the square box at enzyme number place, reads
The rgb value of its color, if value is 0 or 255, without color mark, judgement is not the specific reaction of species;If value
Between 0~255, then there is color mark, judgement is the specific reaction of species.
3 corresponding No. KO of reactions in table 1 are followed successively by K00819, K00147, K00294.Webpage in opening steps (1E)
Source code, read K00819 coordinate be coords=490,400,536,417, the rgb value of reading between 0~255, then
For the specific reaction of the species.The coordinate of K00147 is coords=402,473,448,490, the rgb value of reading 0~
Then it is the specific reaction of the species between 255.K00294 coordinates are coords=823,605,869,622, the RGB of reading
Value is then the specific reaction of the species between 0~255.
(1I), the corresponding specific reaction of breakpoint compound is saved in file new_rec.txt, new_ is traveled through
Each reaction in rec.txt, checks in model and whether there is exist, and does not process, and does not exist, and adds.Traversal step
(1H) 3 reactions in, the reaction checked in model, polishing leak reaction L-glutamate 5-semialdehyde+NAD++
H2O=L-glutamate+NADH+H+。
Step (1G), step (1H) and step (1I) are the cyclic process of process g~process o in Fig. 1, specifically, just
It is to judge that reaction, whether comprising breakpoint, if do not included, judges whether this reaction is most in process n in process g first
Latter bar reaction, if it is, whole flow process terminates, if it is not, then into process o, reading next reaction.Process g, enter
Journey n, process o are a little flow process circulation.
If process g in FIG is i.e. middle to judge that reaction includes breakpoint, into process h, the reaction is searched in annotated map
Corresponding coordinate in spectrum, and this coordinate is read in process i, judge whether this coordinate is specific reaction in process j, if
It is that the reaction is then recorded in process p, if it is not, then judge whether this coordinate is last coordinate in process l, such as
Fruit is last coordinate, then into process n, i.e., into process g, process n, process o this little flow process circulation.If not most
Latter coordinate, then into process m, read next coordinate, judges whether this coordinate is specific reaction, repeats this circulation straight
To all of specific reaction is all found, into process q, Modifying model is carried out.In process r in judgment models whether
Included this reaction, if included, returns to process n, i.e., into process g, process n, process o this little flow process circulation, inspection
Look into next reaction.If not comprising this reaction, into process s, this reaction is added in model.
In step 1, the method that HTTP and Java control HttpClient combine is make use of, submitted to automatically
And analyzing web site script is semantic, all reactions comprising breakpoint compound in KEGG databases are obtained, by roughcast type in contrast;
And image processing algorithm is utilized, judge to belong to the specific reaction of breakpoint compound, and filled up network roughcast type
In.
2nd, because in KEGG websites, reaction is defined as reversible reaction, according to metabolin title L- in reaction
Glutamate 5-semialdehyde, are inquired about using the technology of web crawlers in two websites of MetaCyc and MetRxn,
The result of return is reversible, then the reaction is defined as reversible reaction.
In step 2, judged based on whether KEGG, MetaCyc, MetRxn online database is reversible to reaction.
3rd, according to the attribute of Spathaspora passalidarum NRRL Y-27907, have chosen the egg of 6 Mycophytas
White interval prediction website.
(3A) the leak metabolic response L-glutamate 5-semialdehyde+NAD of polishing, are judged++ H2O=L-
glutamate+NADH+H+Generation it is interval.Corresponding No. KO is reacted for K00294, in KASS annotation results its correspondence is searched
Genbank numberings gb, find corresponding protein sequence in species protein sequence storehouse is changed.
(3B), protein sequence is submitted in the list of correspondence website, the location information for returning is obtained.By the albumen of coding
Sequence uploads to 6 protein fungi subcellular fraction database bacterium in table 2, and predicting the outcome for return be, cello, Epiloc,
The interval that SLPFA is predicted is cytos, and the interval of the predictions of Psort II is er, and Bacello forecast intervals are mito, and Euloc is predicted
As a result it is golgi.
26 protein fungi subcellular fraction database bacterium of table
(3C), newly-built species correspondence monoid data set, and the weight of each website is calculated in new data set.Concrete bag
Include following steps:
(3C1), according to the newly-built protein data collection of monoid of species:12 reaction intervals are chosen in each website, each reaction
100 reactions are chosen in interval, constitute the data set of 1200 protein sequences;The similitude of any two protein sequence in data set
Less than 25%;
(3C2) the correct protein sequence number of each interval prediction in each prediction website, is counted;
(3C3) the average identification correct number of each prediction website, is calculated, if X { X1, X2 ... X12 } is each prediction
The correctly predicted interval number in the interval of website 12, then each website averagely recognize that correct number is:D=(X1+X2+ ... X12)/
12;
(3C4) weight of 6 prediction websites, is calculated.
(3D), the generation of each reaction of weighted calculation is interval, and the formula of weighted calculation is:
Wherein, ViRepresent the judgement compartmental results of i-th protein sequence;WnFor the power of n-th albumen interval prediction website
Weight, wherein keeping:
I-th protein sequence predicting the outcome on n-th interval prediction website is represented, N represents the protein region of selection
Between predict the Number of websites, c represents the protein sequence number to be predicted, when the testing protein sequence to being input into is adjudicated, in advance
Survey the interval gained vote amount in each class interval of albumen to sort, testing protein sequence is divided into what the maximum interval of gained vote amount was located
Class.
In this embodiment, RH2427 and PK7579 data sets are integrated, newdata collection calculates 6 websites in new number
According to the weight on collection.
By the forecast interval in step (3B), calculated according to the weight and formula of each website in table 3:
Y=0.208cytos+0.146er+0.170cytos+0.145mito+0.132cytos+0.1 99golgi=
0.51cytos+0.146er+0.145mito+0.199golgi
The weight of the database of table 3
Finally determine optimal interval and inserted in reaction equation.Predict the outcome in the sequence of each class interval gained vote amount, knot
It is really:0.51>0.199>0.146>0.145, then protein sequence is cytos in each interval probability>golgi>er>Mito,
Then optimum protein forecast interval is cytos.
In step 3, using multiple albumen interval location prediction website, Genome Scale Metabolic Network Model is carried out certainly
Dynamicization breakpoint polishing, the mechanism while weight predicted the outcome by albumen interval is given a mark, determines specific reaction with a high credibility.
3 steps of automation amendment obtain polishing leak metabolic response for L-glutamate5- by more than
semialdehyde[c]+NAD+[c]+H2O [c]=L-glutamate [c]+NADH [c]+H+[c]。
Each leak metabolin can pass through 3 steps in this method, realize automation amendment.After polishing breakpoint
Model conversation be computer-readable form (SBML) be simulated analysis.By xls2model programs by model E xcel table
It is read as meterological s-matrix.Model is extended to comprising 873 generations by the roughcast type comprising 828 metabolins and 984 reactions
Thank to the refined model of thing and 1243 reactions.Based on the COBRA instruments of Matlab platforms, glpk linear programming devices, input are opened
[allGaps, rootGaps, downstreamGaps]=gapFind (model, false, false) is ordered, the mould after polishing
, there is no any breakpoint in type.
The present invention is combined using HTTP and Java control HttpClient and is calculated using image procossing
Method, realizes submitting to automatically and analyzing web site script is semantic, based on KEGG, MetaCyc, MetRxn online database and multiple eggs
Location prediction website has carried out automation breakpoint polishing to Genome Scale Metabolic Network Model between white area, while passing through protein region
Between predict the outcome weight marking mechanism, determine specific reaction with a high credibility, complete the automation of metabolism network roughcast type
Makeover process.
Above-described is only the preferred embodiment of the present invention, the invention is not restricted to above example.It is appreciated that this
Art personnel directly derive without departing from the spirit and concept in the present invention or associate other improve and become
Change, be considered as being included within protection scope of the present invention.