WO2001001294A2 - Traitement de donnees biologiques - Google Patents

Traitement de donnees biologiques Download PDF

Info

Publication number
WO2001001294A2
WO2001001294A2 PCT/IB2000/000863 IB0000863W WO0101294A2 WO 2001001294 A2 WO2001001294 A2 WO 2001001294A2 IB 0000863 W IB0000863 W IB 0000863W WO 0101294 A2 WO0101294 A2 WO 0101294A2
Authority
WO
WIPO (PCT)
Prior art keywords
query
data manipulation
database
server
directives
Prior art date
Application number
PCT/IB2000/000863
Other languages
English (en)
Other versions
WO2001001294A3 (fr
WO2001001294A8 (fr
Inventor
Thodoros Topaloglou
Anthony Kosky
Original Assignee
Gene Logic Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gene Logic Inc. filed Critical Gene Logic Inc.
Priority to EP00938960A priority Critical patent/EP1228447A2/fr
Priority to CA002377823A priority patent/CA2377823A1/fr
Priority to US10/018,461 priority patent/US6931396B1/en
Priority to AU54179/00A priority patent/AU774973B2/en
Publication of WO2001001294A2 publication Critical patent/WO2001001294A2/fr
Publication of WO2001001294A8 publication Critical patent/WO2001001294A8/fr
Publication of WO2001001294A3 publication Critical patent/WO2001001294A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to automated database searching and in particular to automated access to biological databases.
  • databases include a plurality of records which have the form of an object class.
  • the object class is formed of a plurality of fields, often in a hierarchy in which an object class includes one or more sub-object classes which in turn may include sub-sub object classes.
  • the records may represent, for example, gene sequences and may have fields which include various data about the sequences, such as their length, origin and a view of the sequence.
  • Information is extracted from databases by querying a management system associated with the database.
  • a simple query includes a request to display one or more fields of records which fulfill a certain criteria.
  • OPM Object Protocol Model
  • An OPM processor mediates between a user and databases associated with the OPM suite.
  • a common organization methodology is used to represent the data in all the databases accessed via the OPM processor. Queries addressed to databases via the OPM processor are provided, by a user to the OPM processor, in a structured form expressed in accordance with the common organization methodology.
  • the OPM processor translates the queries from the structured OPM form to query forms compatible with the management systems of the specific databases to which the queries are addressed.
  • the results from the specific databases are returned to the OPM processor which translates the results back to the organization methodology of the OPM suite.
  • OPM suite allows a user to access a plurality of different databases in different forms, it also allows the user to access a plurality of databases using a single query.
  • a complex query may request to display the records from a first database which have a gene length greater than of corresponding records of a second database which represent the same organism.
  • GUI graphic user interface
  • Some of the forms of biological data are complex data structures, such as gene sequences, which require special procedures for manipulation, for example, for performing comparisons.
  • Homology search engines such as BLAST, are used to compare gene sequences.
  • OPM O-Methylated nucleic acid sequence
  • the user retrieves all the desired classified gene sequences using OPM. Then, the user passes the retrieved data to a homology sequence server which performs the sequence comparison.
  • One aspect of some embodiments of the invention provides a method for accessing data manipulation servers using a structured query format used to query databases.
  • the accessing of manipulation servers is integrated with the accessing of database information, for example by manipulating the results of the data access and/or by using the results of the data manipulation as data to be accessed or for restricting queries.
  • One aspect of some embodiments of the present invention relates to a multi-database query system which receives queries which relate to both database and data manipulation servers, such as homology search engines.
  • the queries relate to the data manipulation servers as if they are database servers, allowing use of any tool of the multi-database query system developed for database queries, on queries which access data manipulation servers.
  • tools include, for example, database linking tools, graphic query preparation tools and query optimization tools.
  • the data manipulation server may process results from the database as they are provided before the database runs through all its records. Alternatively or additionally, the results of a data manipulation step may be further queried.
  • the response time required for a complex query may be substantially reduced.
  • the amount of traffic on a network may be reduced and/or better spread out in time. Also, complex operations may require less of a user intervention.
  • each data manipulation server associated with the query system has a translation server which mediates between the data manipulation server and the query system.
  • the translation server receives commands from the server in a structured query form used by the query system and translates the commands to a form in which the data manipulation server receives commands.
  • the translation server optionally also receives results from the data manipulation server and presents the results to the query system in objects organized according to structured object classes used by the query system.
  • a multi- database query system which queries a plurality of databases and servers, including an input which receives queries in a structured form, and a translation server which translates at least a part of a received query into commands recognized by a data manipulation server.
  • the system comprises a processor which parses the received query into parts according to the databases and servers to which they relate.
  • the structured form comprises a form used to query databases.
  • the input receives a query which relates to at least one database and at least one data manipulation server.
  • the translation server models results from the data manipulation server into database objects.
  • the data manipulation server comprises a server which receives input from a least two different sources.
  • the data manipulation server comprises a homology comparison engine.
  • a method of accessing a data manipulation server from a multi-database query system including providing the query system with a query which includes a first directive assigning a value to at least one field of an input object associated with the data manipulation server and a second directive which determines a value of at least one field of an output object associated with the data manipulation server, and invoking the data manipulation server responsive to the second directive.
  • providing the query comprises preparing the query using a graphical interface designed for querying structured databases.
  • the data manipulation server comprises a homology engine.
  • a method of performing a database search using a multi-database query system including providing the query system with a query which includes at least one directive related to a database and at least one directive related to a data manipulation server, wherein the directives are stated in an identical structural format, translating the directives into commands recognized by the database and the data manipulation server, and submitting the commands respectively to the data manipulation server and to the database.
  • the data manipulation server comprises a homology comparison engine.
  • translating the directives comprises identifying, by a query processor, the directives directed to the database and the directives directed to the data manipulation server.
  • translating the directives comprises passing the directives to translation servers associated with the database or data manipulation server to which the directives are directed.
  • the method comprises determining an order for the directives to be processed in and submitting the translated directives to the data manipulation server and to the database according to the determined order.
  • the method comprises receiving results from said submission and translating the results into structured objects.
  • translating the results into structured objects comprises translating the results to structured objects related to the directives.
  • providing a query comprises providing a query in an
  • OPM Object Protocol Model
  • FIG. 1 is a schematic illustration of a multi-database query system, in accordance with an embodiment of the invention.
  • Fig. 2 is a flowchart of the actions performed by the multi-database query system of Fig. 1, in accordance with an embodiment of the present invention.
  • FIG. 1 is a schematic illustration of a multi-database query system 20, in accordance with an embodiment of the invention.
  • System 20 mediates between an end-user 22, and a plurality of service providers which include databases 24 and one or more data manipulation servers, such as a homology search engine 26. Error detection processes are another example of data manipulation servers.
  • Engine 26 is a data manipulation server in that it provides processing services and is not primarily used for storing and providing information. In some embodiments of the invention, engine 26 does not store information and a user requesting processing services must provide the information to be processed or must provide a link to a database or file containing the information.
  • Data manipulation servers may receive a single input of data, e.g., error detection processes which receive a single sequence, or a plurality of inputs, e.g., homology engines which compare sequences from two different sources.
  • One of the objects of some embodiments of the invention is to allow end-user 22 to relate to homology engine 26 and/or to other data manipulation servers as if they were databases 24.
  • Databases 24 may be organized differently from each other and are not generally controllable by a supervisor of system 20. End user 22 provides system 20 with queries in a query-language of system 20, for example a structured query language, such as OPM.
  • a single query may be directed to more than one service provider. For example, a single query may be directed to a plurality of databases 24 and to homology engine 26.
  • system 20 comprises a graphical user interface 28 which receives queries in a graphical form and translates them into the system's query language.
  • system 20 comprises a command-line interface 30 which receives commands from end-user 22 directly in the system's query language or possibly using natural language.
  • system 20 comprises a remote-unit interface 32 which receives queries from remote computer units.
  • System 20 further comprises a multi-database query processor 34 which receives queries from interfaces 28, 30 and/or 32 and processes them, as described hereinbelow.
  • query processor 34 and interfaces 28, 30 and/or 32 are implemented in software on a single computer 36 accessible to end-user 22. Alternatively, a distributed configuration is used.
  • system 20 further comprises, for each database 24, an OPM translation server 38 that mediates between processor 34 and the respective service provider.
  • translation servers 38 translate queries from the query language of system 20 into query languages supported by the respective database 24.
  • translation servers 38 translate query results received from the databases 24 into the structural object classes of system 20.
  • system 20 comprises an OPM translation server 42 which mediates between processor 34 and homology engine 26.
  • translation server 42 translates query portions from the query language of system 20 into commands supported by homology engine 26. That is, the OPM language allows, in accordance with embodiments of the invention, phrasing queries that access homology engine 26 as a database.
  • Translation server 42 translates query directives, such as limitations, into commands to be performed by homology engine 26.
  • translation server 42 optionally translates the output from homology engine 26 into structural objects, in accordance with the query language used by system 20.
  • An exemplary structural definition of objects used to access a homology engine from the OPM suite is described in Table 1.
  • CONTROLLED VALUE CLASS GenCode Cv ⁇ ("Standard or Universal", 1), ("Vertebrate Mitochondrial", 2), ("Yeast Mitochondrial", 3), ("Mold, Protozan, .. “,4), ("Invertebrate Mitochondrial", 5), ("Ciliate Celluclear", 6), ("Encinodermate Mitochondrial",9), ("Alternative Ciliate Celluclear", 10), (“Eubactrial", 11), ("Alternative Yeast", 12),
  • a blast call object represents a particular homology search using a blast engine
  • ATTRIBUTE program BlastProgram_Cv REQUIRED ATTRIBUTE query : VARCHAR(2000) REQUIRED
  • ATTRIBUTE datasource DB Cv REQUIRED
  • ATTRIBUTE param X INTEGER OPTIONAL ATTRIBUTE param N: INTEGER OPTIONAL
  • ATTRIBUTE querySeq VARCHAR(2000) REQUIRED ATTRIBUTE queryLength: INTEGER REQUIRED
  • ATTRIBUTE database DB Cv REQUIRED
  • ATTRIBUTE dbSize_Seqs INTEGER REQUIRED ATTRIBUTE dbSizeJLetters : INTEGER REQUIRED ATTRIBUTE dbFile : VARCHAR(80) REQUIRED ATTRIBUTE dbReleased : VARCHAR(40) REQUIRED ATTRIBUTE dbPosted : VARCHAR(40) REQUIRED ATTRIBUTE hitSatE : INTEGER REQUIRED ATTRIBUTE searchTime : VARCHAR(40) REQUIRED ATTRIBUTE totalTime : VARCHAR(40) REQUIRED ATTRIBUTE runDate : VARCHAR(40) REQUIRED ATTRIBUTE parameters: set-of [1,] OutputParameters REQUIRED
  • ATTRIBUTE strand VARCHAR(IO) REQUIRED ATTRIBUTE frame: VARCHAR( 10) REQUIRED ATTRIBUTE matrixld: VARCHAR(IO) REQUIRED ATTRIBUTE matrixName: VARCHAR(IO) REQUIRED ATTRIBUTE lamdba_Used: VARCHAR(IO) REQUIRED ATTRIBUTE K Used: VARCHAR(IO) REQUIRED ATTRIBUTE HJ sed: VARCHAR( 10) REQUIRED
  • ATTRIBUTE headerld INTEGER REQUIRED ATTRIBUTE program: VARCHAR(8) REQUIRED ATTRIBUTE version: VARCHAR(20) REQUIRED ATTRIBUTE revision: VARCHAR(20) REQUIRED ATTRIBUTE build: VARCHAR(40) REQUIRED ATTRIBUTE queryld : VARCHAR(20) REQUIRED ATTRIBUTE querySeq : VARCHAR(2000) REQUIRED ATTRIBUTE database : DB_Cv REQUIRED ATTRIBUTE numOfSequences : INTEGER REQUIRED ATTRIBUTE numOfLetters : INTEGER REQUIRED ATTRIBUTE program: VARCHAR(8) REQUIRED ATTRIBUTE version: VARCHAR(20) REQUIRED ATTRIBUTE revision: VARCHAR(20) REQUIRED ATTRIBUTE build: VARCHAR(40) REQUIRED ATTRIBUTE queryld : VARCHAR(20) REQUIRED ATTRIBUTE querySeq : VARCHAR(2000) REQUIRED
  • ATTRIBUTE score INTEGER REQUIRED
  • ATTRIBUTE pvalue REAL REQUIRED ATTRIBUTE num : INTEGER REQUIRED
  • ATTRIBUTE score INTEGER REQUIRED
  • ATTRIBUTE pvalue REAL REQUIRED ATTRIBUTE strandl : VARCHAR(l) REQUIRED
  • Table 1 The structural definition of Table 1 is written in a language used to define OPM objects, described for example in Chen, I.A.; Kosky, A.S.; Markowitz, V.M.; Szeto, E.; and Topaloglou, T., 1998. "Advanced Query Mechanisms for Biological Databases” in Proceedings of the 6 International Conference on Intelligent systems for Molecular biology (ISMB'98), the disclosure of which is incorporated herein by reference.
  • a single translation server 38 may be used for more than one service provider.
  • OPM processor 34 performs some or all of the translation tasks of translation servers 38 and 42.
  • OPM servers 38 and 42 are situated on the same computer as their respective service providers 24 and 26.
  • OPM servers 38 and 42 are located on computers proximal to their respective service providers 24 and 26, although translation servers may be located substantially anywhere.
  • a multi-database directory 40 is used by processor 34 to determine to which service provider 24 and 26, the portions of a query are directed.
  • Directory 40 summarizes the contents, organization methodologies and capabilities of databases 24 and engines 26.
  • a single directory is used for a plurality of query processors 34, such that adding additional service providers to system 20 requires only preparing a respective OPM server for the additional service providers and updating directory 40, while no changes are needed in processors 34.
  • the various components of system 20 interact using a distributed-object technology, such as, the Common Object Request Broker Architecture (CORBA) which is described, for example, in the Web Site of the "Object Management Group” (OMG) at www.omg.org and was available on June 27, 1999. The disclosure of this web site is incorporated herein by reference.
  • CORBA Common Object Request Broker Architecture
  • a plurality of different CORBA interfaces are used in system 20 for different types of interactions between the components of system 20.
  • a first CORBA interface is used for programming and a second interface is used for object transfer and/or sharing.
  • remote-unit interface 32 also comprises a CORBA interface.
  • system 20 may be implemented in its entirety by a single process and/or on a single processor.
  • COM Microsoft's Component Object Model
  • RPC Remote procedure call
  • Table 2 illustrates a sample query received by query processor 34 from any of interfaces 28, 30 and 32.
  • the query in table 2 is written according to the OPM query language described, for example, in the ISMB'98 publication referenced hereinabove. This OPM query language allows accessing a plurality of databases 24 from a single query.
  • the query of table 2 relates to both a database 24 and an homology engine 26, the homology engine being accessed as if it were a database.
  • the query in table 2 is built of three sections.
  • a first section labeled SELECT states the fields which are to appear in the output generated responsive to the query.
  • these fields are a "fragld" field of a variable r, and an "accessor” field of a variable h (the variables r and h are defined in the second section).
  • a second section, labeled WHERE defines the variables mentioned in the query by stating the database object classes to which they relate. That is, the second section states which objects are candidates for fulfilling the query.
  • the variable r for example, corresponds to a "Fragments" object class in a database named "local".
  • variable "be” corresponds to an object class named "Blast_Call" in a pseudo database "blast”.
  • variable r which represents an actual field of data in a database 24
  • variable "be” does not represent any such field, and a database "blast” does not actually exist.
  • processor 34 refers to homology engine 26.
  • translation server 42 performs any required translations to the input and output of homology engine 26, such that the homology engine appears to processor 34 as a database.
  • the entire interface with homology engine 26 is structured in a single translation object, for example, in accordance with the "Blast_Call" object class in table 2, which is defined in Table 1.
  • the translation object includes the input to and output from homology engine 26.
  • the "Blast_Call” object class has fields which relate to the commands to engine 26, such as, a "command” field which states the type of command performed by engine 26, a “querySeq” field which states an input sequence to be compared by the engine and a “dataSource” field which states a database of sequences to which the input sequence is compared.
  • the "Blast_Call” object class has an "output” field into which the output from homology engine 26 is preferably structurally stored.
  • a dummy variable, "bo” refers to the sub-object "output”, thus simplifying the query statements.
  • processor 34 When a query relates to an action, such as a search or a filter to be performed in a pseudo database, processor 34 first has the respective engine 26 perform any required commands to fill up the output fields of the object representing the pseudo database, e.g., "Blast_Call", and only then the search is performed. Alternatively or additionally, as the output records become available from homology engine 26 they are sent for further processing. In some cases, the records can be processed even before all the fields are available from engine 26.
  • One example of a query optimization as applied to data manipulation servers is that the query translator instructs the engine to prepare only those result fields which are actually required for further processing or display. Another example of optimization is allowing some of the fields to be provided at a later time than other fields.
  • Modifying the order of generation of fields, even between records, may be useful if the some fields are required for further data manipulation or for a querying against a slow database and are thus time critical. For some types of data manipulation, it may even be useful to start the manipulation with only part of the fields and then repeat the manipulation with the rest of the fields.
  • One example where it is useful to start manipulating before all the fields are available is where the manipulation can be carried out, at least to some extent, without the field or where the value of the field or the range of possible values of the field can be known.
  • a DNA homology can be failed based on both of the strands not matching, even before it is known which strand needs to be matched. Once the strand information is available, the group of accepted matches can be further limited using that information.
  • system 20 can have different parts of a query evaluated in parallel, in particular, time consuming parts performed by data manipulation servers.
  • homology engine 26 may begin to operate as records from another part of a query become available, and/or the output from engine 26 may be processed as it is provided, without waiting for all the results.
  • This parallelism is possible because homology engine 26 is accessed from within the query.
  • An advantage of some embodiments of the invention is the savings in response time and in communication and CPU resources of complex queries due to this parallelism. In some cases, such parallel processing of data manipulation may require the data manipulation server or the data manipulation program itself to be modified to take the timing information into account.
  • a blast server may associate the actual partial information used with a result record set, so that it can further limit the search results after the fact.
  • a third section of the query labeled WHERE, states the conditions to be fulfilled by those objects selected by the query. In table 2 these conditions include that a field named "finished” of the variable r must have a value "today”, a field “querySeq” of the variable be must have a value equal to the value of the field "sequence" of variable r, etc.
  • the conditions on database objects and on pseudo database objects are stated substantially in the same way.
  • Fig. 2 is a flowchart of the actions performed in processing a query by system 20, in accordance with an embodiment of the present invention.
  • processor 34 Upon receiving a query, such as the query in table 2, processor 34 divides (60) the query into parts which are performed by the various service providers 24 and 26.
  • Processor 34 determines, for example using methods known in the art, to which service provider each line in the query is directed. In an exemplary embodiment of the present invention, the determination is performed by reference to directory 40.
  • processor 34 determines from the second line that variable r is to be searched in the database 24 named "local”. From the third line it is determined that variable be is to be "searched" in engine 26 named "blast".
  • lines 2 and 6 of the query are directed to the database "local" and lines 3, 7, 8 and 9 are directed to homology engine 26.
  • Lines 1, 4, 5 and 10 do not refer to any database and therefore they are processed by processor 34.
  • Processor 34 determines (62) the cross-dependence of the parts of the query, i.e., which parts require data from other parts and therefore must receive the data from the other parts before they are performed.
  • table 2 it is determined from the line 7 that the query part directed to homology engine 26 requires output from another query part.
  • processor 34 sends (64) to OPM translation servers 38 and/or 42 a first round of query parts belonging to their respective service providers 24 and 26.
  • the query parts sent in the first round are those which do not require results from other queries.
  • the part relating to variable r i.e., lines 2 and 6 are sent to the OPM server 38 of database "local". These lines designate a query for all the Fragment objects in the database which have a value "today" in their "finished” field.
  • the OPM server translates (66) the received query part into a language recognized by database "local”.
  • the translated query part is passed to the database 24 which processes (68) the query and returns (70) the results of the query to the respective OPM server 38.
  • the OPM server 38 translates (72) the results received from the database 24 into the OPM result format and passes the translated results to processor 34.
  • the query includes additional query parts which were not performed yet, e.g., query parts dependent on results from other queries, steps 64, 66, 68, 70 and 72 are repeated for the additional query parts.
  • the query part formed of lines 3, 7, 8 and 9 is passed to the translation server 42 of homology engine 26.
  • the translation server 42 translates (66) the query part into commands performed by homology engine 26.
  • translation server 42 sends a "blastn" command to engine 26 to perform a homology comparison between the sequence and the database "dbEST”.
  • the results received from engine 26 are summarized (72) by translation server 42 in the "output" field of the "Blast_Call" object.
  • system 20 begins a second round of processing query parts before a first round on which the second round depends, is finished. Rather, as the first round provides records as results, the second round can manipulate them.
  • processor 34 performs (76) any remaining operations in the queries and provides (78) the user with the results required in the SELECT section of the query.
  • processor 34 performs the comparison in line 10 of the query.
  • Variable h refers to the field "sequence" of the sub-object "summary” of the object "output”, which represents the results from the blast comparison. Sequences having a length greater than 300 are selected from the blast results. The user is then provided with the value of the "accessor" field of the variable h and with the value of the "fragld” field of the variable r, for all the objects which fulfill the query.
  • BLAST as a homology method
  • other types of homology servers may also be used, for example BLASTX, BLASTN and BLASTP.
  • other types of data manipulation may be provided, for example, error correction, in which a sequence is corrected for various types of errors.
  • Another type of data manipulation server is for example a server which guesses a ternary structure of a protein from its sequence, for example the number of alpha helixes or the protein's affinity to a certain DNA sequence.
  • the server may provide a grading facility which grades a list of provided sequences for affinity to the protein (or for similarity of their derived protein) or which selects those sequences which have a certain affinity.
  • a homology search can compare a first set of records against records in a second database (fixed value) or against a second set of provided records.
  • three or more inputs may be provided, for example where a third record set includes a list of rules which apply when comparing the two record sets.
  • all the record sets need to be fully specified before the manipulation can be performed.
  • only one or possibly not even one of the record sets needs to be fully specified before starting the manipulation.
  • the considerations for optimizing and performing in parallel can be applied to the availability of record sets as well.
  • the definitions of how the data manipulation server operates in the absence of data and/or the relative computation time for different tasks thereby are stored in directory 40, optionally along with other information useful for optimizing queries which include data manipulation.
  • graphic interface 28 may be an interface developed solely for preparing queries for database servers, as described, for example, in Kosky, A.S., Chen, LA., Markowitz, V.M., and Szeto, E. "Exploring Heterogeneous Biological Databases: Tools and Applications", Proceedings of the 6th International Conference on Extending Database Technology (EDBT'98), Lecture Notes in Computer Science, Vol. 1377, Springer- Verlag, 1998, pp. 499- 513, the disclosure of which is incorporated herein by reference.
  • a user may use this interface to prepare sophisticated queries which include access to data manipulation servers, such as homology search engines.
  • optimization tools designed for database queries may be applied, in accordance with the above embodiments, to queries which include reference to data manipulation servers. Such optimization is especially important for queries which reference data manipulation servers because usually these servers require substantially more processing time than databases.
  • results of the queries are optionally provided in a single common format which allows use of a single standard output interface to display the results.
  • variables representing database and pseudo database objects may be linked together using methods for linking databases described, for example, in the EDBT'98 publication referenced hereinabove. These linking methods allow simpler statement of queries and hence more transparency to the user who does not need to know the structure of the various servers used.
  • a query may include an explicit command to be carried out by a data manipulation server, e.g., homology engine 26.
  • commands are referred to herein as application specific data type (ASDT) commands.
  • Table 3 shows a query similar to the query of table 2 in which homology engine 26 is activated using explicit commands written in a format acceptable by OPM processor 34. Line
  • processor 34 when processor 34 encounters an ASDT command, such as the "blast" command on line 6, it first checks with the database involved, i.e., the "local" database, whether the database supports the command in the specific syntax. Then, processor 34 consults directory 40 to determine a server which has the routine invoked by the command. Processor 34 passes the ASDT command, with whatever data objects to which the command relates, directly to the determined server. Alternatively, the command is passed through translation server 42. The output from the server is optionally passed to processor 34 in a structured form, as described above, so as to allow easy manipulation of the results. In this embodiment, processor 34 does not model homology engine 26 as a database 24, but does access the homology engine from within a complex query which accesses databases.
  • Table 4 shows a query in which a command appears in the SELECT section of the query. The command is processed after the query is evaluated, at a stage of presenting the results of the query.
  • routines referenced by the ASDT commands may be evaluated by a data manipulation server as described above with reference to the blast command evaluated by homology engine 26. Alternatively or additionally, some routines may be situated within processor 34 or in directory 40.
  • the statement of the commands within a query rather than invoking the commands on the results received from a query, is simpler to the user. In addition, invoking the commands from within the query allows performing the command before the results are passed to end-user 22. In many cases this conserves substantial communication resources.
  • attributes which may be extracted from the image of a complex data field, for example, a gel.
  • attributes include, for example, the length of an image of the gel, its average intensity or specific lanes of the image. Therefore, some databases have redundant data fields which have values for these attributes. By using ASDT commands these redundant fields are not needed.
  • the routines invoked by the ASDT commands may be stored in the database 24, on a separate data manipulation server, in directory 40 and/or in processor 34.
  • ASDT commands may be invoked implicitly as described above with reference to Fig. 2.
  • a command data object is defined which includes input and output fields of the command.
  • An access to an output field of the object is translated by system 20 as an implicit invocation of the command.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un système de demande à bases de données multiples qui interroge une pluralité de bases de données et de serveurs. Ce système comprend une entrée qui reçoit les demandes sous une forme structurée et un serveur de traduction qui traduit au moins une partie d'une demande reçue en commandes reconnues par un serveur de traitement de données.
PCT/IB2000/000863 1999-06-29 2000-06-28 Traitement de donnees biologiques WO2001001294A2 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP00938960A EP1228447A2 (fr) 1999-06-29 2000-06-28 Traitement de donnees biologiques
CA002377823A CA2377823A1 (fr) 1999-06-29 2000-06-28 Traitement de donnees biologiques
US10/018,461 US6931396B1 (en) 1999-06-29 2000-06-28 Biological data processing
AU54179/00A AU774973B2 (en) 1999-06-29 2000-06-28 Biological data processing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14142499P 1999-06-29 1999-06-29
US60/141,424 1999-06-29

Publications (3)

Publication Number Publication Date
WO2001001294A2 true WO2001001294A2 (fr) 2001-01-04
WO2001001294A8 WO2001001294A8 (fr) 2001-05-17
WO2001001294A3 WO2001001294A3 (fr) 2002-05-23

Family

ID=22495630

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2000/000863 WO2001001294A2 (fr) 1999-06-29 2000-06-28 Traitement de donnees biologiques

Country Status (4)

Country Link
EP (1) EP1228447A2 (fr)
AU (1) AU774973B2 (fr)
CA (1) CA2377823A1 (fr)
WO (1) WO2001001294A2 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1349081A1 (fr) * 2002-03-28 2003-10-01 LION Bioscience AG Méthode et appareil pour interroger des bases de données relationnelles
GB2409919A (en) * 2001-07-06 2005-07-13 Livedevices Ltd Accessing a relational database from an embedded computing device
US7444308B2 (en) 2001-06-15 2008-10-28 Health Discovery Corporation Data mining platform for bioinformatics and other knowledge discovery
US7921068B2 (en) 1998-05-01 2011-04-05 Health Discovery Corporation Data mining platform for knowledge discovery from heterogeneous data types and/or heterogeneous data sources

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5495606A (en) * 1993-11-04 1996-02-27 International Business Machines Corporation System for parallel processing of complex read-only database queries using master and slave central processor complexes
US5835755A (en) * 1994-04-04 1998-11-10 At&T Global Information Solutions Company Multi-processor computer system for operating parallel client/server database processes
US5873083A (en) * 1995-10-20 1999-02-16 Ncr Corporation Method and apparatus for extending a relational database management system using a federated coordinator

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5495606A (en) * 1993-11-04 1996-02-27 International Business Machines Corporation System for parallel processing of complex read-only database queries using master and slave central processor complexes
US5835755A (en) * 1994-04-04 1998-11-10 At&T Global Information Solutions Company Multi-processor computer system for operating parallel client/server database processes
US5873083A (en) * 1995-10-20 1999-02-16 Ncr Corporation Method and apparatus for extending a relational database management system using a federated coordinator

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MATSUDA H ET AL: "QUERYING MOLECULAR BIOLOGY DATABASES BY INTEGRATION USING MULTIAGENTS" IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, INSTITUTE OF ELECTRONICS INFORMATION AND COMM. ENG. TOKYO, JP, vol. E82-D, no. 1, January 1999 (1999-01), pages 199-207, XP000831460 ISSN: 0916-8532 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7542947B2 (en) 1998-05-01 2009-06-02 Health Discovery Corporation Data mining platform for bioinformatics and other knowledge discovery
US7921068B2 (en) 1998-05-01 2011-04-05 Health Discovery Corporation Data mining platform for knowledge discovery from heterogeneous data types and/or heterogeneous data sources
US8126825B2 (en) 1998-05-01 2012-02-28 Health Discovery Corporation Method for visualizing feature ranking of a subset of features for classifying data using a learning machine
US7444308B2 (en) 2001-06-15 2008-10-28 Health Discovery Corporation Data mining platform for bioinformatics and other knowledge discovery
GB2409919A (en) * 2001-07-06 2005-07-13 Livedevices Ltd Accessing a relational database from an embedded computing device
GB2409919B (en) * 2001-07-06 2005-11-23 Livedevices Ltd Improvements relating to internet-connected devices
US8321337B2 (en) 2001-07-06 2012-11-27 Live Devices Limited Internet-connected devices
EP1349081A1 (fr) * 2002-03-28 2003-10-01 LION Bioscience AG Méthode et appareil pour interroger des bases de données relationnelles
WO2003083713A1 (fr) * 2002-03-28 2003-10-09 Lion Bioscience Ag Procede et appareil destines a l'interrogation des bases de donnees relationnelles

Also Published As

Publication number Publication date
EP1228447A2 (fr) 2002-08-07
AU5417900A (en) 2001-01-31
WO2001001294A3 (fr) 2002-05-23
CA2377823A1 (fr) 2001-01-04
WO2001001294A8 (fr) 2001-05-17
AU774973B2 (en) 2004-07-15

Similar Documents

Publication Publication Date Title
US6041344A (en) Apparatus and method for passing statements to foreign databases by using a virtual package
US5826258A (en) Method and apparatus for structuring the querying and interpretation of semistructured information
US6457003B1 (en) Methods, systems and computer program products for logical access of data sources utilizing standard relational database management systems
US7447624B2 (en) Generation of localized software applications
US7533136B2 (en) Efficient implementation of multiple work areas in a file system like repository that supports file versioning
US7315853B2 (en) Service-oriented architecture for accessing reports in legacy systems
US7409401B2 (en) Method and system for supporting multivalue attributes in a database system
US5859972A (en) Multiple server repository and multiple server remote application virtual client computer
US8290947B2 (en) Federated search
US9367588B2 (en) Method and system for assessing relevant properties of work contexts for use by information services
US7027975B1 (en) Guided natural language interface system and method
US6236997B1 (en) Apparatus and method for accessing foreign databases in a heterogeneous database system
US20050028156A1 (en) Automatic method and system for formulating and transforming representations of context used by information services
US8117194B2 (en) Method and system for performing multilingual document searches
US20080215564A1 (en) Query rewrite
US6931396B1 (en) Biological data processing
US7792857B1 (en) Migration of content when accessed using federated search
US7543004B2 (en) Efficient support for workspace-local queries in a repository that supports file versioning
AU2010241304B2 (en) Systems, methods, and software for retrieving information using multiple query languages
EP1909170B1 (fr) Procédé et système pour générer automatiquement une interface de communication
AU774973B2 (en) Biological data processing
US20040111416A1 (en) System and method for communicating data to a process
Mahoui et al. A dynamic workflow approach for the integration of bioinformatics services
US8301657B1 (en) Set-level database access for performing row-sequential operations
Li et al. WAAC: An End-to-End Web API Automatic Calls Approach for Goal-Oriented Intelligent Services

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: C1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

CFP Corrected version of a pamphlet front page
CR1 Correction of entry in section i
WWE Wipo information: entry into national phase

Ref document number: 2377823

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 54179/00

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2000938960

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

WWP Wipo information: published in national office

Ref document number: 2000938960

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 10018461

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: JP

WWG Wipo information: grant in national office

Ref document number: 54179/00

Country of ref document: AU