AU774973B2

AU774973B2 - Biological data processing

Info

Publication number: AU774973B2
Application number: AU54179/00A
Authority: AU
Inventors: Anthony Kosky; Thodoros Topaloglou
Original assignee: Ore Pharmaceuticals Inc
Current assignee: Ore Pharmaceuticals Inc
Priority date: 1999-06-29
Filing date: 2000-06-28
Publication date: 2004-07-15
Anticipated expiration: 2020-06-28
Also published as: WO2001001294A8; WO2001001294A3; CA2377823A1; WO2001001294A2; AU5417900A; EP1228447A2

Description

~1 WO 01/01294 PCT/IBOO/00863 BIOLOGICAL DATA PROCESSING FIELD OF THE INVENTION The present invention relates to automated database searching and in particular to automated access to biological databases.

BACKGROUND OF THE INVENTION One of the tasks performed in biological research is comparison of newly discovered biological data with data stored in databases. Over two hundred public biological databases are available around the world, many on the Internet. In general, databases include a plurality of records which have the form of an object class. The object class is formed of a plurality of fields, often in a hierarchy in which an object class includes one or more sub-object classes which in turn may include sub-sub object classes. The records may represent, for example, gene sequences and may have fields which include various data about the sequences, such as their length, origin and a view of the sequence. Information is extracted from databases by querying a management system associated with the database. A simple query includes a request to display one or more fields of records which fulfill a certain criteria.

The existing databases have different organization methodologies, different fields in each record and different query schemes. In order to access these databases with ease, an Object Protocol Model (OPM) suite of tools was developed. An OPM processor mediates between a user and databases associated with the OPM suite. A common organization methodology is used to represent the data in all the databases accessed via the OPM processor.

Queries addressed to databases via the OPM processor are provided, by a user to the OPM processor, in a structured form expressed in accordance with the common organization methodology. The OPM processor translates the queries from the structured OPM form to query forms compatible with the management systems of the specific databases to which the queries are addressed. The results from the specific databases are returned to the OPM processor which translates the results back to the organization methodology of the OPM suite.

Not only does the OPM suite allow a user to access a plurality of different databases in different forms, it also allows the user to access a plurality of databases using a single query.

For example, a complex query may request to display the records from a first database which have a gene length greater than of corresponding records of a second database which represent the same organism.

The use of a common organization methodology across databases allows using special tools for more easily generating queries and/or performing more complex queries. For 1 CONFIRMATION COPY WO 01/01294 PCT/IBOO/00863 example, a graphic user interface (GUI) of the OPM suite allows the user to prepare a query in a structured manner.

Some of the forms of biological data are complex data structures, such as gene sequences, which require special procedures for manipulation, for example, for performing comparisons. Homology search engines, such as BLAST, are used to compare gene sequences.

When a user wants to compare, for example, all the gene sequences classified in a certain month to one or more groups of gene sequences, the user retrieves all the desired classified gene sequences using OPM. Then, the user passes the retrieved data to a homology sequence server which performs the sequence comparison.

SUMMARY OF THE INVENTION One aspect of some embodiments of the invention provides a method for accessing data manipulation servers using a structured query format used to query databases. Optionally, the accessing of manipulation servers is integrated with the accessing of database information, for example by manipulating the results of the data access and/or by using the results of the data manipulation as data to be accessed or for restricting queries.

One aspect of some embodiments of the present invention relates to a multi-database query system which receives queries which relate to both database and data manipulation servers, such as homology search engines. The queries relate to the data manipulation servers as if they are database servers, allowing use of any tool of the multi-database query system developed for database queries, on queries which access data manipulation servers. Such tools include, for example, database linking tools, graphic query preparation tools and query optimization tools. By relating to databases and data manipulation servers from a single query, the data manipulation server may process results from the database as they are provided before the database runs through all its records. Alternatively or additionally, the results of a data manipulation step may be further queried. Thus, the response time required for a complex query may be substantially reduced. Alternatively or additionally, the amount of traffic on a network may be reduced and/or better spread out in time. Also, complex operations may require less of a user intervention.

In some embodiments of the present invention, the input to and/or output from of the data manipulation servers are modeled by structured objects. The modeled input objects may result from processing other sections of the query. The modeled output objects may be further processed by other sections of the query or even further manipulated by other (or the same) manipulation servers.

WO 01/01294 PCT/IB00/00863 In some embodiments of the invention, each data manipulation server associated with the query system has a translation server which mediates between the data manipulation server and the query system. The translation server receives commands from the server in a structured query form used by the query system and translates the commands to a form in which the data manipulation server receives commands. The translation server optionally also receives results from the data manipulation server and presents the results to the query system in objects organized according to structured object classes used by the query system.

There is thus provided in accordance with an embodiment of the invention, a multidatabase query system which queries a plurality of databases and servers, including an input which receives queries in a structured form, and a translation server which translates at least a part of a received query into commands recognized by a data manipulation server.

Optionally, the system comprises a processor which parses the received query into parts according to the databases and servers to which they relate. Alternatively or additionally, the structured form comprises a form used to query databases. Alternatively or additionally, the input receives a query which relates to at least one database and at least one data manipulation server. Alternatively or additionally, the translation server models results from the data manipulation server into database objects. Alternatively or additionally, the data manipulation server comprises a server which receives input from a least two different sources.

Optionally, the data manipulation server comprises a homology comparison engine.

There is also provided in accordance with an embodiment of the invention, a method of accessing a data manipulation server from a multi-database query system, including providing the query system with a query which includes a first directive assigning a value to at least one field of an input object associated with the data manipulation server and a second directive which determines a value of at least one field of an output object associated with the data manipulation server, and invoking the data manipulation server responsive to the second directive. Optionally, providing the query comprises preparing the query using a graphical interface designed for querying structured databases. Alternatively or additionally, the data manipulation server comprises a homology engine.

There is also provided in accordance with an embodiment of the invention, a method of performing a database search using a multi-database query system, including providing the query system with a query which includes at least one directive related to a database and at least one directive related to a data manipulation server, wherein the directives are stated in an identical structural format, translating the directives into commands recognized by the WO 01/01294 PCT/IBOO/00863 database and the data manipulation server, and submitting the commands respectively to the data manipulation server and to the database.

Alternatively or additionally, translating the directives comprises identifying, by a query processor, the directives directed to the database and the directives directed to the data manipulation server. Optionally, translating the directives comprises passing the directives to translation servers associated with the database or data manipulation server to which the directives are directed. Alternatively or additionally, the method comprises determining an order for the directives to be processed in and submitting the translated directives to the data manipulation server and to the database according to the determined order.

In some embodiments, the method comprises receiving results from said submission and translating the results into structured objects. Optionally, translating the results into structured objects comprises translating the results to structured objects related to the directives.

Alternatively or additionally, providing a query comprises providing a query in an Object Protocol Model (OPM)-like language.

BRIEF DESCRIPTION OF FIGURES Particular embodiments of the invention will be described with reference to the following description of embodiments in conjunction with the figures, wherein identical structures, elements or parts which appear in more than one figure are preferably labeled with a same or similar number in all the figures in which they appear, in which: Fig. 1 is a schematic illustration of a multi-database query system, in accordance with an embodiment of the invention; and SFig. 2 is a flowchart of the actions performed by the multi-database query system of Fig. 1, in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS Fig. 1 is a schematic illustration of a multi-database query system 20, in accordance with an embodiment of the invention. System 20 mediates between an end-user 22, and a plurality of service providers which include databases 24 and one or more data manipulation servers, such as a homology search engine 26. Error detection processes are another example of data manipulation servers. Engine 26 is a data manipulation server in that it provides processing services and is not primarily used for storing and providing information. In some embodiments of the invention, engine 26 does not store information and a user requesting WO 01/01294 PCTIB00/00863 processing services must provide the information to be processed or must provide a link to a database or file containing the information. Data manipulation servers may receive a single input of data, error detection processes which receive a single sequence, or a plurality of inputs, homology engines which compare sequences from two different sources. One of the objects of some embodiments of the invention is to allow end-user 22 to relate to homology engine 26 and/or to other data manipulation servers as if they were databases 24.

Databases 24 may be organized differently from each other and are not generally controllable by a supervisor of system 20. End user 22 provides system 20 with queries in a query-language of system 20, for example a structured query language, such as OPM. In some embodiments of the invention, a single query may be directed to more than one service provider. For example, a single query may be directed to a plurality of databases 24 and to homology engine 26.

In some embodiments of the invention, system 20 comprises a graphical user interface 28 which receives queries in a graphical form and translates them into the system's query language. Alternatively or additionally, system 20 comprises a command-line interface which receives commands from end-user 22 directly in the system's query language or possibly using natural language. Further alternatively or additionally, system 20 comprises a remote-unit interface 32 which receives queries from remote computer units.

System 20 further comprises a multi-database query processor 34 which receives queries from interfaces 28, 30 and/or 32 and processes them, as described hereinbelow. In some embodiments of the invention, query processor 34 and interfaces 28, 30 and/or 32 are implemented in software on a single computer 36 accessible to end-user 22. Alternatively, a distributed configuration is used.

In some embodiments of the invention, system 20 further comprises, for each database 24, an OPM translation server 38 that mediates between processor 34 and the respective service provider. In some embodiments of the invention, translation servers 38 translate queries from the query language of system 20 into query languages supported by the respective database 24. Optionally, translation servers 38 translate query results received from the databases 24 into the structural object classes of system In a similar manner, system 20 comprises an OPM translation server 42 which mediates between processor 34 and homology engine 26. In some embodiments of the invention, translation server 42 translates query portions from the query language of system into commands supported by homology engine 26. That is, the OPM language allows, in 1 WO 01/01294 PCT/IB00/00863 accordance with embodiments of the invention, phrasing queries that access homology engine 26 as a database. Translation server 42 translates query directives, such as limitations, into commands to be performed by homology engine 26. In addition, translation server 42 optionally translates the output from homology engine 26 into structural objects, in accordance with the query language used by system 20. An exemplary structural definition of objects used to access a homology engine from the OPM suite is described in Table 1.

Table 1 SCHEMA blast srv DESCRIPTION: "The OPM schema for a queryable blast server" CONTROLLED VALUE CLASS BlastEngine_Cv "wu_blast "ncbi_blast 2.0" DEFAULT: "wublast CONTROLLED VALUE CLASS BlastProgram_Cv {"blastn", "blastx", "blastp", "tblastn", "tblastx"} DEFAULT: "blastn" CONTROLLED VALUE CLASS StrandCv "bottom", "both"} DEFAULT: "both" CONTROLLED VALUE CLASS SortBy_Cv {"pvalue", "count", "highscore", "totalscore"} DEFAULT: "pvalue" CONTROLLED VALUE CLASS GenCode Cv ("Standard or Universal", 1), ("Vertebrate Mitochondrial", 2), ("Yeast Mitochondrial", 3), ("Mold, Protozan, ("Invertebrate Mitochondrial", WO 01/01294 WO 01/1 294PCT/IBOO/00863 ("Ciliate Macronuclear", 6), ("Encinodermate Mitochondrial",9), ("Alternative Ciliate Macronuclear", ("Eubactrial", 11), ("Alternative Yeast", 12), ("Ascidian Mitochondrial", 13), ("Flatworm Mitochondrial", 14) DEFAULT: "Standard or Universal" CODETYPE: SMIALLINT CONTROLLED VALUE CLASS FilterCv f ("none", 0), 1), 2), ("seg-Ixnu", 3), ("dust", 4) DEFAULT: "none" CODETYPE: SMALLINT CONTROLLED VALUE CLASS Matrix Cv I ("blosum62", 0), 1), ("blosum4O", 2), 3), 4), 6), ("blosum75", 7), 8), 9), WOO01/01294 ("blosuml 00", 11), ("GONNET", 12), ("parnlO", 13), ("pam2O", 14), ("pam3O", 16), 17), 1 8), 19), ("pam80", 21), ("pam 100", 22), ("pami 10", 23), ("paml120", 24), ("p=1l3O", ("paml4O", 26), ("pamn150", 27), ("p=1l6O", 28), ("pam 170", 29), ("pamn180", ("paml19O", 31), ("pam200", 32), ("pam2 10", 33), ("pam22O", 34), ("pam23O", ("pam24O", 36), ("pam250", 37), ("pam26O", 38), ("pam27O", 39), ("pam28O", ("pam29O", 41), ("pam300", 42), ("pam3 10", 43), PCTIBOO/OO863 WO 01/01294 WO 0101294PCT/IBOO/00863 ("pam32O", 44), ("pain330", ("pam34O", 46), 47), ("pam36O", 48), ("pam37O", 49), ("pam380", ("pani390", 51), ("pam400", 52), ("pam4lO0", 5 3), ("pam42O", 54), ("pam43O", ("pam44O", 56), 57) DEFAULT: "blosum62" CODETYPE: SMALLINT CONTROLLED VALUE CLASS DB_Cv f "testdb", "localdb", "dbest" I DEFAULT: "testdb" OBJECT CLASS BlastCall DESCRIPTION: "A blast call object represents a particular homology search using a blast engine"~ ID: callId ATTRIBUTE callId INTEGER REQUIRED ATTRIBUTE engine: BlastEngine Cv REQUIRED ATTRIBUTE program BlastProgram_Cv REQUIRED ATTRIBUTE query VARCHAR(2000) REQUIRED ATTRIBUTE datasource: DBCv REQUIRED ATTRIBUTE output: set-of Blast -Output REQUIRED ATTRIBUTE matrix: Matrix Cv OPTIONAL WO 01/01294 PCT/IBOO/00863 ATTRIBUTE strand: StrandCv OPTIONAL ATTRIBUTE sortby: SortBy_Cv OPTIONAL ATTRIBUTE dbgcode: GenCode_Cv OPTIONAL ATTRIBUTE filter: Filter Cv OPTIONAL ATTRIBUTE threshold: REAL OPTIONAL ATTRIBUTE alignments: INTEGER OPTIONAL ATTRIBUTE scores: INTEGER OPTIONAL ATTRIBUTE param_E: REAL OPTIONAL ATTRIBUTE param_S: REAL OPTIONAL ATTRIBUTE param_E2: REAL OPTIONAL ATTRIBUTE param_S2: REAL OPTIONAL ATTRIBUTE param_W: INTEGER OPTIONAL ATTRIBUTE param_T: INTEGER OPTIONAL ATTRIBUTE param_X: INTEGER OPTIONAL ATTRIBUTE param_N: INTEGER OPTIONAL ATTRIBUTE param_M: INTEGER OPTIONAL ATTRIBUTE param_B: INTEGER OPTIONAL ATTRIBUTE param_V: INTEGER OPTIONAL OBJECT CLASS Blast_Output DESCRIPTION: "The output of a specific blast call" ID: runId ATTRIBUTE runId: INTEGER REQUIRED ATTRIBUTE program: VARCHAR(8) REQUIRED ATTRIBUTE version: VARCHAR(20) REQUIRED ATTRIBUTE revision: VARCHAR(20) REQUIRED ATTRIBUTE build: VARCHAR(40) REQUIRED ATTRIBUTE queryld VARCHAR(20) REQUIRED ATTRIBUTE querySeq VARCHAR(2000) REQUIRED ATTRIBUTE queryLength: INTEGER REQUIRED ATTRIBUTE database DB_Cv REQUIRED ATTRIBUTE hits: set-of BlastHits REQUIRED ATTRIBUTE dbSize_Seqs INTEGER REQUIRED WO 01/01294 PCTIIBOOIOO863 WO 01/01294 PCT/IB00/00863 ATTRIBUTE dbSizeLetters INTEGER REQUIRED ATTRIBUTE dbFile VARCHAR(80) REQUIRED ATTRIBUTE dbReleased VARCHAR(40) REQUIRED ATTRIBUTE dbPosted: VARCHAR(40) REQUIRED ATTRIBUTE hitSatE: INTEGER REQUIRED ATTRIBUTE searchTime: VARCHAR(40) REQUIRED ATTRIBUTE totalTime: VARCHAR(40) REQUIRED ATTRIBUTE runDate: VARCHAR(40) REQUIRED ATTRIBUTE parameters: set-of OutputParameters REQUIRED OBJECT CLASS OutputParameters ID: paramId ATTRIBUTE paramd: INTEGER REQUIRED ATTRIBUTE strand: VARCHAR(10) REQUIRED ATTRIBUTE frame: VARCHAR(10) REQUIRED ATTRIBUTE matrixld: VARCHAR(10) REQUIRED ATTRIBUTE matrixName: VARCHAR(10) REQUIRED ATTRIBUTE lamdbaUsed: VARCHAR(10) REQUIRED ATTRIBUTE KUsed: VARCHAR(10) REQUIRED ATTRIBUTE HUsed: VARCHAR(10) REQUIRED ATTRIBUTE lamdba_Computed: VARCHAR(10) REQUIRED ATTRIBUTE K_Computed: VARCHAR(10) REQUIRED ATTRIBUTE H_Computed: VARCHAR(10) REQUIRED ATTRIBUTE paramEl: VARCHAR(10) REQUIRED ATTRIBUTE param_SI: VARCHAR(10) REQUIRED ATTRIBUTE param_W1: VARCHAR(10) REQUIRED ATTRIBUTE param_T1: VARCHAR(10) REQUIRED ATTRIBUTE param_X1: VARCHAR(10) REQUIRED ATTRIBUTE param_E2: VARCHAR(10) REQUIRED ATTRIBUTE paramS2: VARCHAR(10) REQUIRED OBJECT CLASS BlastHeader DESCRIPTION: "The header section of BLAST output" WO 01/01294 PCT/IB00/00863 ID: headerld ATTRIBUTE headerld: INTEGER REQUIRED ATTRIBUTE program: VARCHAR(8) REQUIRED ATTRIBUTE version: VARCHAR(20) REQUIRED ATTRIBUTE revision: VARCHAR(20) REQUIRED ATTRIBUTE build: VARCHAR(40) REQUIRED ATTRIBUTE queryld VARCHAR(20) REQUIRED ATTRIBUTE querySeq VARCHAR(2000) REQUIRED ATTRIBUTE database: DB_Cv REQUIRED ATTRIBUTE numOfSequences INTEGER REQUIRED ATTRIBUTE numOfLetters INTEGER REQUIRED OBJECT CLASS BlastHits DESCRIPTION: "Blast Hits" ID: accession ATTRIBUTE accession VARCHAR(12) REQUIRED ATTRIBUTE description VARCHAR(255) REQUIRED ATTRIBUTE score INTEGER REQUIRED ATTRIBUTE pvalue REAL REQUIRED ATTRIBUTE num INTEGER REQUIRED ATTRIBUTE length INTEGER OPTIONAL ATTRIBUTE hsp set-of BlastHSP OPTIONAL OBJECT CLASS BlastHSP ID: hspld ATTRIBUTE hspld INTEGER REQUIRED ATTRIBUTE score INTEGER REQUIRED ATTRIBUTE expect: REAL REQUIRED ATTRIBUTE pvalue: REAL REQUIRED ATTRIBUTE strand 1: VARCHAR(1) REQUIRED ATTRIBUTE strand2: VARCHAR(1) REQUIRED ATTRIBUTE identities REAL REQUIRED ATTRIBUTE positives REAL REQUIRED WO 01/01294 PCT/IB00/00863 ATTRIBUTE query (sequence, begin, end): (VARCHAR(500) REQUIRED, INTEGER REQUIRED, INTEGER REQUIRED) ATTRIBUTE target (sequence, begin, end): (VARCHAR(500) REQUIRED, INTEGER REQUIRED, INTEGER REQUIRED) ATTRIBUTE align VARCHAR(500) REQUIRED ATTRIBUTE t5_begin: INTEGER REQUIRED ATTRIBUTE t5_end INTEGER REQUIRED The structural definition of Table 1 is written in a language used to define OPM objects, described for example in Chen, Kosky, Markowitz, Szeto, and Topaloglou, 1998. "Advanced Query Mechanisms for Biological Databases" in Proceedings of the 6 th International Conference on Intelligent systems for Molecular biology (ISMB'98), the disclosure of which is incorporated herein by reference.

Alternatively or additionally, a single translation server 38 may be used for more than one service provider. Alternatively or additionally, OPM processor 34 performs some or all of the translation tasks of translation servers 38 and 42. In some embodiments of the invention, OPM servers 38 and 42 are situated on the same computer as their respective service providers 24 and 26. Alternatively, OPM servers 38 and 42 are located on computers proximal to their respective service providers 24 and 26, although translation servers may be located substantially anywhere.

In some embodiments of the invention, a multi-database directory 40 is used by processor 34 to determine to which service provider 24 and 26, the portions of a query are directed. Directory 40 summarizes the contents, organization methodologies and capabilities of databases 24 and engines 26. In some embodiments, a single directory is used for a plurality of query processors 34, such that adding additional service providers to system 20 requires only preparing a respective OPM server for the additional service providers and updating directory 40, while no changes are needed in processors 34.

In some embodiments of the present invention, the various components of system interact using a distributed-object technology, such as, the Common Object Request Broker Architecture (CORBA) which is described, for example, in the Web Site of the "Object Management Group" (OMG) at www.omg.org and was available on June 27, 1999. The disclosure of this web site is incorporated herein by reference. In some embodiments of the invention, a plurality of different CORBA interfaces are used in system 20 for different types WO 01/01294 PCT/IBOO/00863 of interactions between the components of system 20. In one example, a first CORBA interface is used for programming and a second interface is used for object transfer and/or sharing. Optionally, remote-unit interface 32 also comprises a CORBA interface.

Alternatively or additionally, other distributed-object technologies, such as, Microsoft's Component Object Model (COM) or the UNIX environment Remote procedure call (RPC), may be used to allow remote and/or non-remote components of system 20 to interact. Further alternatively or additionally, system 20 may be implemented in its entirety by a single process and/or on a single processor.

Table 2 SELECT 1 r.fragld, a =h.accessor FROM r in local:Fragments be in blast:Blast_Call bo in bc.output h bo.summary.sequence WHERE r.finished "today" and bc.querySeq r.sequence and bc.command "blastn" and bc.dataSource "dbEST" and (10) h.length 30 0 Table 2 illustrates a sample query received by query processor 34 from any of interfaces 28, 30 and 32. The query in table 2 is written according to the OPM query language described, for example, in the ISMB'98 publication referenced hereinabove. This OPM query language allows accessing a plurality of databases 24 from a single query. The query of table 2 relates to both a database 24 and an homology engine 26, the homology engine being accessed as if it were a database.

The query in table 2 is built of three sections. A first section labeled SELECT states the fields which are to appear in the output generated responsive to the query. In table 2 these fields are a "fragd" field of a variable r, and an "accessor" field of a variable h (the variables r and h are defined in the second section). A second section, labeled WHERE, defines the variables mentioned in the query by stating the database object classes to which they relate.

That is, the second section states which objects are candidates for fulfilling the query.

PCT/IB00/008 6 3 n table 2, the variable r for example, corresponds to a "Fragments" object class in a In table 2, the variable variable "bc" corresponds to an object database named "local". In the same way, a dumml ve, uke variable r which class named "Blast_Call" in a pseudo database "blast". However, unlike variable r which represents an actual field of data in a database 24, variable bc" does not represent any such field, and a database "blast" does not actually exist 34 refers to Rather, when the "blast" database is referred to in a query, processor rf homology engine 26. In some embodiments of the invention, translation server 42 performs any required translations to the input and output of homology engine 26, such that the homology engine appears to processor 34 as a database. In an exemplary embodiment of the homology engine appears to processor 26 is structured in a single present invention, the entire interface with homology engine 26 is structured in a single translation object, for example, in accordance with the BlastCall" object class in table 2, which is defined in Table 1. The translation object includes the input to and output from homology engine 26. For example, the BlastCall" object cla has fields which relate to the commands to engine 26, such as, a "command" field which states the type ocommand performed by engine 26, a querySeq" field which states an input sequence to be compared by th pe engine and a dataSource field which states a database of sequences to which the input the engine and a "dataSource" field w c h st a s an "output" field into sequence is compared. In addition, the "Blast Call" object class has an "output" field into which the output from homology engine 26 is preferably structurally stored. In the query f table 2, a dummy variable, bo, refers to the subobect thus simplifying the query statements. filter to be performed in a When a query relates to an action, such as a search or a filter to be performed i pseudo database, processor 34 first has the respective engine 26 perform any required comands to fl up the output fields of the object representing the pseudo database, e.g., commands to fill up the output fields foed Alternativel additionally, as the "Blast Call", and only then the search is performed. lternatiely or t pro essing output records become available from homology engne 26 they are sent for fther processing In some cases, the records can be processed even before all the fields are available from engine In some cases, the records can be processed eve doa mn t e i h 26. One example of a query optimization as applied to data manipulation servers is that the query translator instructs the engine to pepare only those result fields which are actually required for further processing or display. Another example of optimization is allowing some of the fields to be provided at a later time than other fields. Modifying the order of generation of fields, even between records, may be useful if the some fields are required for further data manipulation or for a querying against a slow database and are thus time critical. For some types of data manipulation, it may even be useful to start the manipulation with only part of pCT/IB000S63 wo oi/O1294filsonexmlwhrits the fields and then repeat the manipulation with the rest of the fields. One example where it is useful to start manipulating before all the fields are available is where the manipulation can be carried out, at least to some extent, without the field or where the value of the field or the range of possible values of the field can be known. Thus, for example, a DNA homology can be failed based on both of the strands not matching, even before it is known which strand needs to be matched. Once the strand information is available, the group of accepted matches can be further limited using that information.

Thus, system 20 can have different parts of a query evaluated in parallel, in particular, time consuming parts performed by data manipulation servers. For example, homology engine 26 may begin to operate as records from another part of a query become available, and/or the output from engine 26 may be processed as it is provided, without waiting for all the results.

This parallelism is possible because homology engine 26 is accessed from within the query.

An advantage of some embodiments of the invention is the savings in response time and in communication and CPU resources of complex queries due to this parallelism.

In some cases, such parallel processing of data manipulation may require the data manipulation server or the data manipulation program itself to be modified to take the timing information into account. In one example, a blast server may associate the actual partial information used with a result record set, so that it can further limit the search results after the a t hidscino h ur, aee HRsae h conditions to be fulfilled by fact.

A third section of the query, labeled WHERE, states the conditions to be fulfilled by those objects selected by the query. n table 2 these conditions include that a field named "finished" of the variable r must have a value "today", a field "querySeq" of the variable be must have a value equal to the value of the field "sequence" of variable r, etc. In this section, the conditions on database objects and on pseudo database objects are stated substantially in the same way.

Fig. 2 is a flowchart of the actions performed in processing a query by system 20, in accordance with an embodiment of the present invention. Upon receiving a query, such as the query in table 2, processor 34 divides (60) the query into parts which are performed by the various service providers 24 and 26. Processor 34 determines, for example using methods known in the art, to which service provider each line in the query is directed. In an exemplary embodiment of the present invention, the determination is performed by reference to directory In the query of table 2, processor 34 determines from the second line that variable r is to be searched in the database 24 named "local". From the third line it is determined that variable bc PCT/IBOO/O8 6 3 WO 01/01294 pCTB000063 is to be "searched" in engine 26 named "blast" Therefore, lines 2 and 6 of the query are is to be "searched" in engine 26 named "blat ology engine 26directed to the database "local" and lines 3, 7, 8 and 9 are directed to homology engine 26.

Lines 1, 4, 5 and 10 do not refer to any database and therefore they are processed by processor 34. Processor 34 then determines (62) the cross-dependence of the parts of the query, i.e., which parts require data from other parts and therefore must receive the data from the other parts before they are performed. In table 2, it is determined from the line 7 that the query part directed to homology engine 26 requires output from another query part.

Thereafter, processor 34 sends (64) to OPM translation servers 38 and/or 42 a first round of query parts belonging to their respective service providers 24 and 26. The query parts sent in the first round are those which do not require results from other queries. In table 2, the part relating to variable r, lines 2 and 6, are sent to the OPM server 38 of database "local".

These lines designate a query for all the Fragment objects in the database which have a value "today" in their "finished" field. The OPM server translates (66) the received query part into a language recognized by database "local". The translated query part is passed to the database 24 which processes (68) the queryand returns (70) the results of the query to the respective

OPM

server 38. The OPM server 38 translates (72) the results received from the database 24 into the OPM result format and passes the translated results to processor 34.

If (74) the query includes additional query parts which were not performed yet, e.g., query parts dependent on results from other queries, steps 64, 66, 68, 70 and 72 are repeated for the additional query parts. In the example of table 2, the query part formed of lines 3, 7, 8 and 9 is passed to the translation server 42 of homology engine 26. The translation server 42 translates (66) the query part into commands performed by homology engine 26. For each sequence of variable r in the output of database "local", translation server 42 sends a "blastn" command to engine 26 to perform a homology comparison between the sequence and the database "dbEST". The results received from engine 26 are summarzed (72) by translation server 42 in the "output" field of the "Blast_Call" object.

In some embodiments of the present invention, system 20 begins a second round of processing query parts before a first round on which the second round depends, is finished.

Rather, as the first round provides records as results, the second round can manipulate them.

Once all the query parts were handled by their respective service providers 24 and 26, processor 34 performs (76) any remaining operations in the queries and provides (78) the user with 'he results required in the SELECT section of the query. In the example of table 2, pcTIBOOOOS 63 WO 01/01294 Prt i eld processor 34 performs the comparison in line 10 of the query. Variable h refers to the field "sequence" of the sub-object "summary" of the object "output", which represents the results from the blast comparison. Sequences having a length greater than 300 are selected from the blast results. The user is then provided with the value of the "accessor" field of the variable h and with the value of the "fragld" field of the variable r, for all the objects which fulfill the query. sahmlg ehohwvr te query The above description has focused on BLAST as a homology method, however, other types of homology servers may also be used, for example BLASTX, BLASTN and BLASTP.

Additionally, other types of data manipulation may be provided, for example, error correction, in which a sequence is corrected for various types of errors. Another type of data manipulation server is for example a server which guesses a ternary structure of a protein from its sequence, for example the number of alpha helixes or the protein's affinity to a certain DNA sequence.

Alternatively to guessing the structure, the server may provide a grading facility which grades a list of provided sequences for affinity to the protein (or for similarity of their derived protein) or which selects those sequences which have a certain affinity.

As can be appreciated, some of these data manipulation servers require only one input record set while others, require more than one input record set. For example, a homology search can compare a first set of records against records in a second database (fixed value) or against a second set of provided records. In some cases, three or more inputs may be provided, for example where a third record set includes a list of rules which apply when comparing the two record sets. In some cases, all the record sets need to be fully specified before the manipulation can be performed. In other cases, only one or possibly not even one of the record sets needs to be fully specified before starting the manipulation. The considerations for optimizing and performing in parallel can be applied to the availability of record sets as well.

n some embodiments of the invention, the definitions of how the data manipulation server operates in the absence of data and/or the relative computation time for different tasks thereby are stored in directory 40, optionally along with other information useful for optimizing queries which include data manipulation.

An advantage of some of the above embodiments is that it is possible to use substantially any tool developed for manipulation of databases to access data manipulation servers. For example, graphic interface 28 may be an interface developed solely for preparing queries for database servers, as described, for example, in Kosky, Chen, Markowitz, and Szeto, E. "Exploring Heterogeneous Biological Databases: Tools and Applications", PCT/IB00/0 0 8 6 3 So onal Conferenc Extending Database Technology proceedings of the 6th International Conferenpringer-Verlag, 1998, pp. 499- (EDBT'9 8 Lecture Notes in Computer Science, Vol. 1377, Springe la 1 P 513, the disclosure of which is incorporated herein by reference. A user may use this interface to prepare sophisticated queries which include access to data manipulation servers, such as homology search engines. may be applied, i Likewise, optimization tools designed for database queries may be applied, in Likewise, optimer which include reference to data accordance with the above embodiments, to queries whichlude reference manipulation servers. Such optimization is especially important for queeswhich reference data manipulation servers because usually these server require substantally more processing time than databases.

10 tim e thermore, the results of the queries are optionally provided in a single common format which allows use of a single standard output interface to display the results In addition, variables representing database and pseudo database objects my e together using methods for linking databases described, for example, in the EDBT'9 8 Sp ublication referenced hereinabove. These linking methods allow simpler statement of queries publication referenced hereinabove does not need to know the structure of the and hence more transparency to the user who does not need to know various servers used. hich relate to data Although the above described embodiments refer to queries which relate to data manipulation servers as to databases, some embodiments of the invention relate to queries which include commands to be performed by data manipulation servers, not necessarily in the manner in which includedatabases are searched. For example, a query may include an explicit scomman d to be carried out by asesa data manipulation server, homology engine 26. Such commands are referred to herein as application specific data type (ASDT) commands.

Table 3 SELECT 1= r.fragld, a =h.accessor FROM r in local:Fragments b in blast:Output (3) h bo.summary.sequence (4) WHERE r.finished "today" and r.sequence.blast("dbEST") and r b.query r.sequence and h.length 3 0 0 PCT/IB00/008 6 3 S01/ query similar to the query of table 2 in which homology engine 26 is Table 3 shows a query similar to the queryable by OPM processor 34. Line activated using explicit commands written in a format accep el ible values of 6 in table 3 is a command to perform "blast" on the "sequence" fields of the possi e v es variable r. The blast is performed against a database EST". The results from performin blast command appear in a variable b which is defined in line 3 of table 3.

In an embodiment of the present invention, when processor 34 encounters an ASDT In an embodiment of the present inve fth the database involved, command, such as the "blast" command on line 6, it first checks with the database involved, the "local" database, whether the database supports the command in the specific syntax processor 34 consults drectory 40 to determine a serverwhich has the routine invoked by then, command. Processor 34 consults dirasses the ASDT command, with whatever data objects to by the command Processor 34 p the determined server. Alternatively, the command is which the command relates, directly tothe determined serv is optionally passed to passed through translation server 42. The output from t he s manipulsation ofe the processor 34 in a structured form, as described above, so as to allow easy maniulation of the results. In this embodiment, processor 34 does not model homology engine 26 as a database results. In this embodiment, processor 4which accesses 24, but does access the homology engine from within a complex query which accesses databases. in the WHERE section of the query.

The ASDT commands do not necessaly appear e section of the query. The Table 4 shows a query in which a command appears in the SELECT seti the he command is processed after the query is evaluated, at a stage of presenting the results of the query.

Table 4 SELECT x.gelld Sx.image.crop(0,0,200,400).display0 FROM x in Gel WHERE x.gelld "gel0 0 0 111" In table 4, an "image" field of the variables x which satisfy the query are passed to a routine "crop", which returns a piece of an image having specified coordinates. The results from the routine "crop" are passed to a routine dispay" which displays the result in an desired manner. be evaluated by a data The routines referenced by the ASDT comman ma aad a at manipulation server as described above with reference to the blast command evaluated by homology engine 26. Alternatively or additionally, some routines may be situated within pCT/IB0000 863

W

processor 301/01294 tory 40. The statement of the commands within a query rather than addition, invoking the commands from within the query allows performing the co and comunmication resources. ^es In some cases users accessing databases are frequently interested in attributes which may be extracted from the image of a complex data field, for example, a gel. Such attributes may be extracted from the image of a comtY or specific lanes of include, for example, the length of an image of the gel, its average intensity or specific lanes of the image. Therefore, some databases have redundant data fields which have values for these attributes. By using ASDT commands these redundant fields are not needed. The routines invoked by the ASDT commands may be stored in the database 24, on a separate data manipulation server, in directory 40 and/or in processor 34.

It is noted that the ASDT commands may be invoked implicitlyh commas describeand, a with reference to Fig. 2. In some embodiments of the invention, for each command, a command data object is defined which includes input and output fields of the command An access to an output field of the object is translated by system 20 as an implicit invocation of the command. methods may be varied in many ways, It will be appreciated that the above described methodused. It shouin aso be including, changing the order of steps, and the exact implementation used. It should also be appreciated that the above described description of methods and apparatus are to be interpreted as including apparatus for carrying out the methods and methods of using the fo apparatus. Especially, the above methods should be interpreted to describe sofware for carrying out a complete method as described above, a part thereof or software which modifies an existing software to perform as described above. In addition, the scope of the invention includes such software stored in a computer readable media, such as a disk, stored in a memory or executing on a computer. detailed descriptions Of The present invention has been described using nlimiting detailed descriptions ofthe embodiments thereof that are provided by way of example and are not intended to limit th scope of the invention. It should be understood that features and/or steps described with respect to one embodiment may be used with other embodiments and that not all embodiments of the invention have all of the features and/or steps shown in a particular figure or described with respect to one of the embodiments. Variations of embodiments described will occur to persons of the art.

PCTIB00/00863 WO 01/01294ments describe the best mode It is noted that some of the above described embodiments es d contemplated by the inventors and therefore include structure, acts or details of structures ad acts that may not be essential to the invention and which are es e s eame uct and acts described herein are replaceable by equivalents which perform the same function, and acts described herein are replacin the art. Therefore, the scope of the even if the structure or acts are different, as limitations as used in the claims. hen used in invention is limited only by the elements and limitations as used in the e the following claims, the terms "comprise", "include", "have" and their conjugates mean "including but not limited to".

Claims

1. A multi-database query system for querying a plurality of biological databases containing biological data, comprising: an input which receives a query in a structured form; a processor for receiving the query and dividing the query into a plurality of query parts, wherein the plurality of query parts corresponds to at least one database of the plurality of biological databases and at least one condition statement; and at least one translation server which translates at least one of the plurality of query parts into commands recognized by a data manipulation server associated with a biological database of the plurality of biological databases and returns results of the query parts to the processor; wherein the processor determines whether the query includes unprocessed parts and, if the query has unprocessed parts, sends at least one unprocessed part to the at least one translation server, repeating the process until all unprocessed parts are processed, and wherein the processor further applies one or more conditions within the at least one condition statement to the processed query and generates a user output meeting the one or more conditions.

2. The system according to claim 1, wherein the translation server models results from the data manipulation server into database objects. The system according to any of claims 1 and 2, wherein the data manipulation server comprises a server that receives input from a least two different sources.

4. The system according to any of claims 1-3, further comprising a directory in communication with the processor, wherein the processor refers to the directory to determine how to divide the query. The system according to any of claims 1-4, wherein the at least one translation server comprises at least two translation servers associated with at least one local biological database and at least one remote database.

6. The system according to claim 5, wherein the query parts are sent to the at least two translation servers in parallel.

7. The system according to any one of claims 1-6, wherein the data manipulation server is a homology search engine.

8. The system according to claim 7, wherein the homology search engine is BLAST.

9. The system according to any one of claims 1-8, wherein the processor further determines cross-dependence of the query parts and, if cross-dependence is found, further divides the query parts into independent parts and dependent parts. The system according to claim 9, wherein the query parts are cross-dependent and are sent to the at least two translation servers sequentially so that independent parts are sent before dependent parts. V. 11. The system according to any of claims 1-10, wherein the query is formed in an 000 Object Protocol Model (OPM)-like language. o.oo :06. 012. In a multi-database query system, a method of querying a plurality of biological o* databases containing biological data, comprising: inputting a query in a structured form; 900: receiving the query in a processor and dividing the query into a plurality of query parts, wherein the plurality of query parts corresponds to at least one database of the o plurality of biological databases and at least one condition statement; 0o using at least one translation server, translating at least one of the plurality of query parts into commands recognized by a data manipulation server associated with a biological database of the plurality of biological databases and returns results of the query parts to the processor; determining whether the query includes unprocessed parts and, if the query has unprocessed parts, sending at least one unprocessed part to the at least one translation server; repeating steps and until all unprocessed parts of the query are processed; applying one or more conditions within the at least one condition statement to the processed query; and generating a user output meeting the one or more conditions.

13. The method according to claim 12, wherein the at least one translation server models results from the data manipulation server into database objects.

14. The method according to either claim 12 or claim 13, wherein the data manipulation server comprises a server that receives input from a least two different sources. The method according to any of claims 12-14, further comprising consulting a directory in communication with the processor to determine how to divide the query.

16. The method according to any of claims 12-15, wherein the at least one translation server comprises at least two translation servers associated with at least one local biological database and at least one remote database. o• 17. The method according to any one of claim 12-16, wherein the data manipulation server is a homology search engine. a fee- S. S S

18. The method according to claim 17, wherein the homology search engine is BLAST.

19. The method according to claim 16, wherein the query parts are sent to the at least two translation servers in parallel. The method according to any one of claims 16-19, further comprising determining cross-dependence of the query parts and, if cross-dependence is found, dividing the query parts into independent parts and dependent parts.

21. The method according to any one of claims 16-19, wherein the query parts are cross-dependent and further comprising sending to the at least two translation servers sequentially with independent parts are sent before dependent parts.

22. The method according to any of claims 12-21, wherein inputting a query comprises submitting a query in an Object Protocol Model (OPM)-like language.

23. A multi-database query system which queries a plurality of biological databases and servers substantially as herein described.

24. A method of querying a plurality of biological databases using a multi-database oo query system substantially as herein described. oooeo o •o o oo oo *.*oo ooo o* go* o*ooo