CN104699997B

CN104699997B - Automatic genomic metabolic network model modifying method

Info

Publication number: CN104699997B
Application number: CN201510131784.6A
Authority: CN
Inventors: 张梁; 吴晓红; 薛卫; 李由然; 李赢; 丁重阳; 石贵阳
Original assignee: Jiangnan University
Current assignee: Guangzhou Kang Lun Biotechnology Co., Ltd.
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2017-05-10
Anticipated expiration: 2035-03-24
Also published as: CN104699997A

Abstract

The invention discloses an automatic genomic metabolic network model modifying method. According to the automatic genomic metabolic network model modifying method, website script semanteme can be submitted and analyzed automatically by combining a hypertext transfer protocol with a Java control HttpClient and utilizing an image processing algorithm; a genomic metabolic network model is subjected to automatic breakpoint supplementing on the basis of online databases including a KEGG (Kyoto Encyclopedia of Genes and Genomes) database, a MetaCyc database, a MetRxn database and a plurality of protein region positioning and predicting websites; specific reactions high in reliability can be determined according to a protein region predicting result and a weight rating mechanism, and accordingly, an automatic modifying process of a rough metabolic network model is completed. The automatic genomic metabolic network model modifying method has the advantages of time saving and convenience, and an obtained modified model is more comprehensive and more accurate.

Description

A kind of genome Metabolic Network Model automates modification method

Technical field

The invention belongs to field of bioinformatics, and in particular to one kind excavates biological data and utilization using computer Image processing algorithm and weight marking mechanism to genome Metabolic Network Model automate modification method.

Background technology

With emerging in large numbers for genome high-flux sequence data, and the generation of substantial amounts of biological data, genome metabolism Network model is configured to one of focus studied.Metabolism network builds and is one and spends a large amount of manpowers and the process of time, because The instrument that this substantial amounts of automation builds arises at the historic moment.Generally these automation tools stress the structure for paying close attention to metabolism network roughcast type Build and simulation process, only a small amount of automation tools are the makeover process for Metabolic Network Model.In generation, can be provided at present Thanking to the instrument of network model automation refining process has Model SEED, Pathway Tools, RAVEN and SuBliMinaL.

The structure of the model construction of metabolism network including roughcast type, the refining of model, the standard of Mathematical Modeling change, model is tested Card 4 processes of prediction.One high-quality Metabolic Network Model, it is necessary to reach model simulation results and biological actual growth phenotype Unanimously, otherwise have to constantly repeat refining makeover process, it is consistent with phenotype until simulating.The refining amendment of model is undoubtedly The process most taken time and effort in Metabolic Network Model building process, a small amount of model refinement instrument can not really realize fungi generation Thank to the automation of network model refining process.The refining process of model must include filling up for metabolism leak, and the Direction of Reaction is really It is fixed, reaction interval positioning etc..Model SEED and Pathway Tools can only provide the essence of procaryotic Metabolic Network Model Refining automation process, it is impossible to which the positioning of reaction interval is provided.RAVEN and SuBliMinaL is interval pre- based on Wolf PSORT albumen Surveying can realize automating orienting response interval on the basis of database, but Wolf PSORT are one is based on amino acid group Into the protein online database being characterized.Research shows that protein deciding field is based on amino acid composition, dipeptides and physics During this 3 kinds of feature heterozygosis of chemical characteristic, it is more accurate to predict the outcome.

The content of the invention

In order to solve the above problems, the invention discloses one kind is more saved time, facilitated, and gained correction model is more complete Face, accurate genome Metabolic Network Model automation modification method.

Technical scheme is as follows：

A kind of genome Metabolic Network Model automates modification method, comprises the following steps：

(1), the leak metabolin list in genome Metabolic Network Model, fills up the specific reaction of species；

(2), according to metabolin title in the specific reaction of species, the direction reacted in model is determined；

(3), determine that optimum response is interval in model.

Its further technical scheme is that step (1) includes：

(1A), using matlab softwares, genome metabolism network roughcast type is converted into into computer-readable format, and is carried out Metabolin leak is searched；

(1B), the genome protein sequence of species is submitted to the automatic comment server KASS of KEGG websites, KASS is automatic Annotation returns the Pathway list that the protein sequence occurs；

(1C) reaction path of leak metabolin, and the Pathway obtained in step (1B), are determined in roughcast type The reaction path is found in list；

(1D) collection of illustrative plates of gene metabolism network, is obtained according to the reaction path of the leak metabolin found in step (1C) URL addresses, to URL addresses http request is sent, and the Web page picture for obtaining server end response is designated as collection of illustrative plates T, and collection of illustrative plates T includes Metabolic pathway square frame；

(1E) the metabolic pathway square frame of collection of illustrative plates T in step (1D), is clicked on, the page page comprising all reactions, page is entered Face page includes No. EC of protein sequence, a concrete reaction in each No. EC correspondence collection of illustrative plates T, the URL addresses of No. EC Point to specific reaction equation；

(1F) No. EC corresponding in page page No. KO and specific reactional equation Reaction, newly-built text, are obtained Part KO-EC-Reaction.txt, by No. EC and corresponding No. KO, reactional equation Reaction write file KO-EC- Reaction.txt；

(1G), by the content of file KO-EC-Reaction.txt in row read step (1F), searching loop extracts KO- Reaction comprising leak metabolin in EC-Reaction files, new files EC-KO-Break.txt will be comprising leak metabolism No. EC of thing, No. KO, the information of reactional equation Reaction is stored in file EC-KO-Break.txt；

(1H), determine whether the reaction comprising leak metabolin that step (1G) is extracted is the specific anti-of the genome Should；

(1I), newly-built new-rec.txt files, specific reaction is saved in new-rec.txt files, travels through new- Each reaction in rec.txt files, checks in roughcast type with the presence or absence of the reaction do not exist, and adds.

Its further technical scheme is that the step (1H) specifically includes following steps：

(1H1), by web crawlers technology, the page page of simultaneously analytical procedure (1E) is submitted to, extraction KO is in webpage Corresponding all coordinates；

(1H2) pixel that inframe is chosen after the square box at No. KO place, is navigated to, the rgb value of its color is read；

If (1H3) value is 0 or 255, without color mark, judgement is not the specific reaction of species；If value Between 0～255, then there is color mark, judgement is the specific reaction of species.

Its further technical scheme is：Step (2) specifically includes following steps：

(2A), by web crawlers technology, searched in KEGG, MetaCyc, tri- websites of MetRxn with metabolin title The direction of each reaction, extracts and preserves each directional information of the reaction in 3 websites；

If (2B), reaction should be irreversible in two websites of MetaCyc and MetRxn, being judged as can not be converse Should, otherwise it is judged as reversible reaction.

Its further technical scheme is that the implementation method of step (3) is：Each albumen interval website is calculated in the species Weight on correspondence monoid data set, adopts weighted calculation, it is determined that most preferably to the interval result for returning of each website albumen afterwards Reaction interval.

Its further technical scheme is that step (3) specifically includes following steps：

(3A), obtain corresponding No. KO according to the reaction per bar, search its corresponding genbank in KASS annotation results and compile Number gb；Corresponding protein sequence is found in the species protein sequence storehouse.

(3B), protein sequence is submitted in the list of correspondence website, the location information for returning is obtained.

(3C), newly-built species correspondence monoid data set, and the weight of each website is calculated in new data set.

(3D), the generation of each reaction of weighted calculation is interval, it is determined that optimal interval and inserted in reaction equation.

Its further technical scheme is in step (3C), to specifically include following steps：

(3C1), according to the newly-built protein data collection of monoid of species：12 reaction intervals are chosen in each website, each reaction 100 reactions are chosen in interval, constitute the data set of 1200 protein sequences；The similitude of any two protein sequence in data set Less than 25%；

(3C2) the correct protein sequence number of each interval prediction in each prediction website, is counted；

(3C3) the average identification correct number of each prediction website, is calculated, if X { X1, X2 ... X12 } is each prediction The correctly predicted interval number in the interval of website 12, then each website averagely recognize that correct number is：D=(X1+X2+ ... X12)/ 12；

(3C4) weight of 6 prediction websites, is calculated.

Its further technical scheme is that in step (3D), the formula of weighted calculation is：

Wherein, V_iRepresent the judgement compartmental results of i-th protein sequence；W_nFor the power of n-th albumen interval prediction website Weight, wherein keeping：

I-th protein sequence predicting the outcome on n-th interval prediction website is represented, N represents the protein region of selection Between predict the Number of websites, c represents the protein sequence number to be predicted, when the testing protein sequence to being input into is adjudicated, in advance Survey the interval gained vote amount in each class interval of albumen to sort, testing protein sequence is divided into what the maximum interval of gained vote amount was located Class.

The method have the benefit that：

The present invention proposes the weight marking mechanism predicted the outcome based on multiple albumen intervals, calculates optimum prediction albumen It is interval, it is determined that specific reaction with a high credibility.By filling up for complete metabolism leak, the determination of the Direction of Reaction, reaction Three steps of deciding field, complete the refining process of model, and using the method genome Metabolic Network Model amendment is carried out Advantage is more to save time, facilitate, and gained correction model is more fully, accurately.

Description of the drawings

Fig. 1 is the flow chart of step 1 of the present invention.

Specific embodiment

With reference to embodiment, the present invention is further illustrated.

Carry out by taking the makeover process of Spathaspora passalidarum NRRL Y-27907 genome roughcast types as an example Illustrate that the leak reaction filled up automatically in roughcast type determines that the Direction of Reaction and reaction occur interval.Comprise the following steps that：

1st, the leak metabolin list returned according to matlab, fills up Spathaspora passalidarum NRRL Y- 27907 specific reaction.Fig. 1 is the flow chart of step 1, the flow process of automatic polishing breakpoint is indicated, in the way of program flow The process of automatic polishing breakpoint is intuitively illustrated, and can be it is seen from figure 1 that each compound of cycle criterion successively And the process of each reaction.

(1A), the roughcast type of genome metabolism network was imported to equipped with cobra kits and glpk linear programming phases In matlab softwares, by xls2model programs by model conversion be computer-readable form, will roughcast type Excel Table is read as meterological s-matrix.S-matrix (828 × 984) represents that the model is made up of 828 metabolins and 984 reactions.Together When the lookup of metabolin leak is completed by Gapfind programs, its middle and upper reaches metabolin leak is promising 44, downstream metabolite leakage Hole has 128.

(1B), Spathaspora passalidarum NRRL are submitted to the automatic comment server KASS of KEGG websites Y-27907 genome protein sequences, KASS annotates the Pathway list for returning the protein sequence automatically.

(1C), for a specific leak metabolin, the reaction way of metabolin is determined in the Excel tables of roughcast type Footpath, and find the metabolic pathway in KASS annotates the Pathway list for returning.With the leak that matlab software lookups go out Explanation as a example by GLUGSAL [m] in metabolin list.The reaction path of metabolin is determined in the Excel tables of roughcast type Subsystem is Arginine and Proline Metabolism, and is looked in KASS annotates the Pathway list for returning To metabolic pathway Arginine and Proline Metabolism.

In FIG, process a is the data input of breakpoint compound, and process b is the corresponding metabolin of acquisition breakpoint compound Reaction path Subsystem information, process c is the judgement that whether there is to reaction path Subsystem information.Process a is arrived The breakpoint compound that process c this circulation there will be reaction path is found out, correspondence step (1C).

(1D), it is met according to the Arginine and Proline Metabolism approach found in step (1C) The URL addresses of the gene metabolism network collection of illustrative plates of condition, to URL addresses the webpage figure that http request obtains server end response is sent Piece is designated as picture T, and picture T is the structure chart of whole metabolism network, and the green enzyme number in picture T represents special comprising breakpoint Property reaction.

Process d in Fig. 1 is to enter into after the reaction path Subsystem for finding in corresponding collection of illustrative plates, correspondence step Suddenly (1D).

(1E) the metabolic pathway square frame in the upper left corner in the picture T in step (1D), is clicked on, the page comprising all reactions is entered Face, a concrete reaction in the page in each No. EC correspondence collection of illustrative plates, specific reactional equation is pointed in its URL addresses Formula.

(1F), from the beginning of first No. EC, No. EC is obtained corresponding No. KO in the page, navigate to afterwards corresponding to No. EC URL addresses, analysis<table></table>Information in label, finally obtains specific reactional equation Reaction.It is newly-built File KO-EC-Reaction.txt, by No. EC and corresponding No. KO and reactional equation Reaction file KO-EC- is write Reaction.txt。

This step is process e in Fig. 1, and KO-EC-Reaction.txt files save all reactions in collection of illustrative plates.

(1G), according to the full name L-Glutamate 5-semialdehyde of leak metabolin GLUGSAL, read by row and walk Suddenly in (1F) file KO-EC-Reaction.txt content, searching loop, obtain it is all comprising L-Glutamate 5- The reaction of semialdehyde.3 reactions comprising leak metabolin are found, as shown in table 1.

Reaction of the table 1 comprising leak metabolin

(1H), determine whether 3 reactions extracted in step (1G) are Spathaspora passalidarum NRRL Y-27907 specific reactions.The square frame that enzyme number in step (1G) corresponding to specific reaction is located has color mark, therefore can By web crawlers technology, the webpage of simultaneously analytical procedure (1C) is submitted to, extract No. KO corresponding all coordinates, i.e. KO in webpage The position of the square box that number corresponding enzyme number is located, navigates to the pixel that inframe is chosen after the square box at enzyme number place, reads The rgb value of its color, if value is 0 or 255, without color mark, judgement is not the specific reaction of species；If value Between 0～255, then there is color mark, judgement is the specific reaction of species.

3 corresponding No. KO of reactions in table 1 are followed successively by K00819, K00147, K00294.Webpage in opening steps (1E) Source code, read K00819 coordinate be coords=490,400,536,417, the rgb value of reading between 0～255, then For the specific reaction of the species.The coordinate of K00147 is coords=402,473,448,490, the rgb value of reading 0～ Then it is the specific reaction of the species between 255.K00294 coordinates are coords=823,605,869,622, the RGB of reading Value is then the specific reaction of the species between 0～255.

(1I), the corresponding specific reaction of breakpoint compound is saved in file new_rec.txt, new_ is traveled through Each reaction in rec.txt, checks in model and whether there is exist, and does not process, and does not exist, and adds.Traversal step (1H) 3 reactions in, the reaction checked in model, polishing leak reaction L-glutamate 5-semialdehyde+NAD⁺+ H2O=L-glutamate+NADH+H⁺。

Step (1G), step (1H) and step (1I) are the cyclic process of process g～process o in Fig. 1, specifically, just It is to judge that reaction, whether comprising breakpoint, if do not included, judges whether this reaction is most in process n in process g first Latter bar reaction, if it is, whole flow process terminates, if it is not, then into process o, reading next reaction.Process g, enter Journey n, process o are a little flow process circulation.

If process g in FIG is i.e. middle to judge that reaction includes breakpoint, into process h, the reaction is searched in annotated map Corresponding coordinate in spectrum, and this coordinate is read in process i, judge whether this coordinate is specific reaction in process j, if It is that the reaction is then recorded in process p, if it is not, then judge whether this coordinate is last coordinate in process l, such as Fruit is last coordinate, then into process n, i.e., into process g, process n, process o this little flow process circulation.If not most Latter coordinate, then into process m, read next coordinate, judges whether this coordinate is specific reaction, repeats this circulation straight To all of specific reaction is all found, into process q, Modifying model is carried out.In process r in judgment models whether Included this reaction, if included, returns to process n, i.e., into process g, process n, process o this little flow process circulation, inspection Look into next reaction.If not comprising this reaction, into process s, this reaction is added in model.

In step 1, the method that HTTP and Java control HttpClient combine is make use of, submitted to automatically And analyzing web site script is semantic, all reactions comprising breakpoint compound in KEGG databases are obtained, by roughcast type in contrast； And image processing algorithm is utilized, judge to belong to the specific reaction of breakpoint compound, and filled up network roughcast type In.

2nd, because in KEGG websites, reaction is defined as reversible reaction, according to metabolin title L- in reaction Glutamate 5-semialdehyde, are inquired about using the technology of web crawlers in two websites of MetaCyc and MetRxn, The result of return is reversible, then the reaction is defined as reversible reaction.

In step 2, judged based on whether KEGG, MetaCyc, MetRxn online database is reversible to reaction.

3rd, according to the attribute of Spathaspora passalidarum NRRL Y-27907, have chosen the egg of 6 Mycophytas White interval prediction website.

(3A) the leak metabolic response L-glutamate 5-semialdehyde+NAD of polishing, are judged⁺+ H2O=L- glutamate+NADH+H⁺Generation it is interval.Corresponding No. KO is reacted for K00294, in KASS annotation results its correspondence is searched Genbank numberings gb, find corresponding protein sequence in species protein sequence storehouse is changed.

(3B), protein sequence is submitted in the list of correspondence website, the location information for returning is obtained.By the albumen of coding Sequence uploads to 6 protein fungi subcellular fraction database bacterium in table 2, and predicting the outcome for return be, cello, Epiloc, The interval that SLPFA is predicted is cytos, and the interval of the predictions of Psort II is er, and Bacello forecast intervals are mito, and Euloc is predicted As a result it is golgi.

26 protein fungi subcellular fraction database bacterium of table

(3C), newly-built species correspondence monoid data set, and the weight of each website is calculated in new data set.Concrete bag Include following steps：

(3C4) weight of 6 prediction websites, is calculated.

(3D), the generation of each reaction of weighted calculation is interval, and the formula of weighted calculation is：

In this embodiment, RH2427 and PK7579 data sets are integrated, newdata collection calculates 6 websites in new number According to the weight on collection.

By the forecast interval in step (3B), calculated according to the weight and formula of each website in table 3：

Y=0.208cytos+0.146er+0.170cytos+0.145mito+0.132cytos+0.1 99golgi= 0.51cytos+0.146er+0.145mito+0.199golgi

The weight of the database of table 3

Finally determine optimal interval and inserted in reaction equation.Predict the outcome in the sequence of each class interval gained vote amount, knot It is really：0.51>0.199>0.146>0.145, then protein sequence is cytos in each interval probability>golgi>er>Mito, Then optimum protein forecast interval is cytos.

In step 3, using multiple albumen interval location prediction website, Genome Scale Metabolic Network Model is carried out certainly Dynamicization breakpoint polishing, the mechanism while weight predicted the outcome by albumen interval is given a mark, determines specific reaction with a high credibility.

3 steps of automation amendment obtain polishing leak metabolic response for L-glutamate5- by more than semialdehyde[c]+NAD⁺[c]+H2O [c]=L-glutamate [c]+NADH [c]+H⁺[c]。

Each leak metabolin can pass through 3 steps in this method, realize automation amendment.After polishing breakpoint Model conversation be computer-readable form (SBML) be simulated analysis.By xls2model programs by model E xcel table It is read as meterological s-matrix.Model is extended to comprising 873 generations by the roughcast type comprising 828 metabolins and 984 reactions Thank to the refined model of thing and 1243 reactions.Based on the COBRA instruments of Matlab platforms, glpk linear programming devices, input are opened [allGaps, rootGaps, downstreamGaps]=gapFind (model, false, false) is ordered, the mould after polishing , there is no any breakpoint in type.

The present invention is combined using HTTP and Java control HttpClient and is calculated using image procossing Method, realizes submitting to automatically and analyzing web site script is semantic, based on KEGG, MetaCyc, MetRxn online database and multiple eggs Location prediction website has carried out automation breakpoint polishing to Genome Scale Metabolic Network Model between white area, while passing through protein region Between predict the outcome weight marking mechanism, determine specific reaction with a high credibility, complete the automation of metabolism network roughcast type Makeover process.

Above-described is only the preferred embodiment of the present invention, the invention is not restricted to above example.It is appreciated that this Art personnel directly derive without departing from the spirit and concept in the present invention or associate other improve and become Change, be considered as being included within protection scope of the present invention.

Claims

1. a kind of genome Metabolic Network Model automates modification method, it is characterized in that, comprises the following steps：

(3), determine that optimum response is interval in model；

Step (1) includes：

(1A), using matlab softwares, genome metabolism network roughcast type is converted into into computer-readable format, and carries out metabolism Thing leak is searched；

(1B), the genome protein sequence of species, KASS is submitted to annotate automatically to the automatic comment server KASS of KEGG websites Return the Pathway list that the protein sequence occurs；

(1C) reaction path of leak metabolin, and the Pathway list obtained in step (1B), are determined in roughcast type In find the reaction path；

(1D) URL of the collection of illustrative plates of gene metabolism network, is obtained according to the reaction path of the leak metabolin found in step (1C) Address, to URL addresses http request is sent, and the Web page picture for obtaining server end response is designated as collection of illustrative plates T, and collection of illustrative plates T includes generation Thank to approach square frame；

(1E) the metabolic pathway square frame of collection of illustrative plates T in step (1D), is clicked on, the page page comprising all reactions, the page is entered Page includes No. EC of protein sequence, a concrete reaction in each No. EC correspondence collection of illustrative plates T, and the URL addresses of No. EC refer to To specific reaction equation；

(1F) No. EC corresponding in page page No. KO and specific reactional equation Reaction, new files KO-, are obtained EC-Reaction.txt, by No. EC and corresponding No. KO, reactional equation Reaction write file KO-EC- Reaction.txt；

(1G), by the content of file KO-EC-Reaction.txt in row read step (1F), searching loop extracts KO-EC- Reaction comprising leak metabolin in Reaction files, new files EC-KO-Break.txt will include leak metabolin No. EC, No. KO, the information of reactional equation Reaction is stored in file EC-KO-Break.txt；

(1H), determine that whether the reaction comprising leak metabolin that step (1G) is extracted is the specific reaction of the genome；

2. genome Metabolic Network Model as claimed in claim 1 automates modification method, it is characterized in that, the step (1H) Specifically include following steps：

(1H1), by web crawlers technology, submit the page page of simultaneously analytical procedure (1E) to, extract No. KO correspondence in webpage All coordinates；

If (1H3) value is 0 or 255, without color mark, judgement is not the specific reaction of species；If value 0～ Between 255, then there is color mark, judgement is the specific reaction of species.

3. genome Metabolic Network Model as claimed in claim 1 automates modification method, it is characterized in that, step (2) is concrete Comprise the following steps：

(2A), by web crawlers technology, with metabolin title in KEGG, MetaCyc, MetRxn searches each in tri- websites The direction of reaction, extracts and preserves each directional information of the reaction in 3 websites；

If (2B), reaction should be irreversible in two websites of MetaCyc and MetRxn, it is judged as irreversible reaction, it is no Then it is judged as reversible reaction.

4. genome Metabolic Network Model as claimed in claim 1 automates modification method, it is characterized in that, the reality of step (3) Now method is：Weight of each albumen interval website on the species correspondence monoid data set is calculated, afterwards to each website egg The result returned between white area adopts weighted calculation, determines that optimum response is interval.

5. genome Metabolic Network Model as claimed in claim 4 automates modification method, it is characterized in that, step (3) is concrete Comprise the following steps：

(3A), obtain corresponding No. KO according to the reaction per bar, its corresponding genbank numbering is searched in KASS annotation results gb；Corresponding protein sequence is found in the species protein sequence storehouse；

(3B), protein sequence is submitted in the list of correspondence website, the location information for returning is obtained；

(3C), newly-built species correspondence monoid data set, and the weight of each website is calculated in new data set；

6. genome Metabolic Network Model according to claim 5 automates modification method, it is characterized in that, step (3C) tool Body is comprised the following steps：

(3C1), according to the newly-built protein data collection of monoid of species：Choose 12 reaction intervals, each reaction interval in each website 100 reactions are chosen, the data set of 1200 protein sequences is constituted；The similitude of any two protein sequence is less than in data set 25%；

(3C3) the average identification correct number of each prediction website, is calculated, if X { X1, X2 ... X12 } is each prediction website The correctly predicted interval number in 12 intervals, then each website averagely recognize that correct number is：D=(X1+X2+ ... X12)/12；

(3C4) weight of 6 prediction websites, is calculated.

7. genome Metabolic Network Model according to claim 5 automates modification method, it is characterised in that step (3D) In, the formula of weighted calculation is：

V_{i} = Σ_{n = 1}^{N} w_{n} * f_{n}^{i}, (i = 1, 2, ..., c),

Wherein, V_iRepresent the judgement compartmental results of i-th protein sequence；W_nFor the weight of n-th albumen interval prediction website, its Middle holding：

Σ_{n = 1}^{N} w_{n} = 1;

I-th protein sequence predicting the outcome on n-th interval prediction website is represented, N represents that the albumen of selection is interval pre- The Number of websites is surveyed, c represents the protein sequence number to be predicted, when the testing protein sequence to being input into is adjudicated, prediction egg Gained vote amount between white area in each class interval sorts, and testing protein sequence is divided into the maximum interval class being located of gained vote amount.