WO1998049632A1 - System and method for entity-based data retrieval - Google Patents

System and method for entity-based data retrieval Download PDF

Info

Publication number
WO1998049632A1
WO1998049632A1 PCT/US1998/008243 US9808243W WO9849632A1 WO 1998049632 A1 WO1998049632 A1 WO 1998049632A1 US 9808243 W US9808243 W US 9808243W WO 9849632 A1 WO9849632 A1 WO 9849632A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
field
database
search
type
Prior art date
Application number
PCT/US1998/008243
Other languages
French (fr)
Inventor
Scott Huffman
Catherine Baudin
Robert A. Nado
Original Assignee
Price Waterhouse, Llp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Price Waterhouse, Llp filed Critical Price Waterhouse, Llp
Priority to AU72557/98A priority Critical patent/AU7255798A/en
Publication of WO1998049632A1 publication Critical patent/WO1998049632A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Definitions

  • Databases employing this type of search and schema typically include data warehousing and multi-database SQL systems. Moreover, the problems of creating the database maps from the local to a global schema and of rigidly normalizing the data across all databases are compounded when different databases are created at different times by different users in different locations. Integrating these disparate databases into the global schema thus requires the expenditure of large amounts of time and effort, as well as the resources to continually conform new databases to the global schema.
  • the present invention provides a unique system that overcomes the disadvantages of the prior art systems by providing a method and system to search multiple databases.
  • the present invention uses limited meta-information about each database, and various heuristic rules in generating an index database for conducting an entity-based search across multiple databases that have highly heterogeneous data values. This allows a search that is more powerful and precise than a full-text search, but does not require the high costs of strongly typed, data normalized and complete and detailed mapping of each database to a global schema. It is therefore an object of the present invention to provide a system and method for creating an entity-based search for searching across a plurality of databases by classifying data into entity types. The system and method selects a database.
  • the database contain fields (in general) where records of the database store data in fields.
  • the system and method selects a field, assigns an entity type to the field, and creates an index of the entity type assigned to the field so that a search across the plurality of databases based upon the entity type can be accomplished. It is also another object of this invention to provide a system and method for creating an entity-based search across a plurality of databases by classifying data into entity types wherein the system and method assigns a first entity type to a field in a database; assigns a second entity type to a query; searches the database for the query wherein the search uses the first and second entity types to determine a match; and outputs the result of the search.
  • FIG. 1 is a block diagram of a system depicting various aspects of the invention.
  • FIGS. 2-4 show exemplary databases and their structures.
  • FIG. 5 is a block diagram depicting the interrelationships in one embodiment between the databases of the present invention.
  • FIGS. 6a-6c is a flow chart depicting steps according to one aspect of the present invention.
  • FIGS. 7a-7k show exemplary databases and their structures.
  • FIG. 8 is a flow chart of the field name analysis steps according to one aspect of the present invention.
  • FIG 9 is a flow chart of the field value analysis steps according to one aspect of the present invention.
  • FIG. 10 is an exemplary query screen form.
  • FIGS. 11 and 12 are exemplary profile output screens.
  • FIG. 1 depicts a system according to one aspect of the invention.
  • the apparatus of FIG. 1 includes Computer 101 for executing program instructions. These program instructions can be stored in RAM/ROM 103 and/or Mass-Storage Device 102.
  • Peripherals 104 which may include a monitor, keyboard, and mouse to allow an operator to receive, manipulate and input information.
  • Computer 101 can access Databases 105, which may be local or remote databases or tables, via Network 106.
  • Network 106 can include local area networks, wide area networks, or global networks such as the Internet.
  • Mass Storage Device 102 stores the database, depicted as Collections Database 107, to implement one aspect of the invention.
  • Mass Storage Device 102 is not limited to a single physical disk, it may in fact be composed of several disks such as a disk cluster, a RAID disk array, etc.
  • Databases 105 includes Lotus Notes databases, although other database programs, such as Oracle, Informix, Sybase, Microsoft Access and Claris FileMaker Pro databases, as well as semi-structured Web pages and Internet accessible databases (usually via the World Wide Web), can also be used.
  • Collections Database 107 is comprised of several tables using, in the preferred embodiment, Microsoft Access. Again, other database programs (relational, object-oriented or otherwise) can be used to create Collections Database 107.
  • a database is a collection of related documents.
  • a Lotus Notes collection and a document are analogous to a table and to a row or a record, in a relational database (such as a Microsoft Access database), respectively, and the terms “collection” and “database” will hereinafter be used interchangeably. Accordingly, although the preferred embodiment searches across multiple Lotus Notes databases, it is not limited to Lotus Notes databases, but can also be used with other types of databases and Web pages.
  • the document/row/record stores data in a field or fields.
  • the term "field" can also refer not only to the specific data storage element in a document/row/record, but in the abstract, to the field of a collection or database, of which the field of a document/row/record is an instance of the collection or database field.
  • Entity can be a person, such as "Bob Smith” or a company, such as “Apple Computer,” or even a skill, such as "SAP Programming.”
  • Each entity has attributes associated with it, which can include an entity type, and, optionally, an entity role.
  • entity types include person, company, skills, telephone number, office and dollar entity types.
  • Entity roles further describe and refine the entity attributes and are generally entity type specific. Examples of entity roles for a person entity type can include partner, manager, contact, author and reviewer entity roles. Examples of entity roles for a company entity type can include client and vendor entity roles.
  • an entity such as a "Bob Smith” may have a person entity type and can, for example, have entity roles of a partner on an engagement or an entity role of a contact, depending on the collection and results of analysis by Computer 101.
  • collections can also be classified into collection types (or database types for other non-Notes databases, used hereinafter interchangeably).
  • a collection of telephone numbers can be classified, for example, into a directories collection type, while a collection of time spent on various client matters might be classified as a client-engagements collection type.
  • Field meta-information such as the field names, name patterns, and field values are used to further generate these collection type and fields' entity type and entity role attributes. These attribute meta-information are subsequently used to more efficiently conduct an entity-based search across multiple databases.
  • Computer 101 analyzes the fields in a collection to determine fields which hold references to entities by using the field meta-information. These fields are then assigned an entity type and, optionally, an entity role, which are then further used to assign a collection type to the collection.
  • FIGS. 2 through 4 depict the structure of exemplary Databases 105 as may have been created by different departments.
  • FIG. 2 shows Accounting Collection 201 that includes the following five fields, set forth in Table 1, for each document:
  • FIG. 3 shows Experts Collection 301 which include the following five fields, set forth in Table 2 for each document: TABLE 2
  • FIG. 4 shows Sales Collection 401 which include the following four fields, set forth in Table 3, for each document:
  • FIG. 5 shows Collection Database 107 as comprised of several tables, and their interrelationships, in one embodiment of the present invention. A description of each table is given below.
  • Symbol Table 501 stores symbol information. These symbols have a name and type, stored in the fields SymbolName and SymbolType, respectively. For instance, symbols are stored referring to the different entity types, entity roles and collection types. The fields in Symbol Table 501 are explained in Table 4.
  • Symbol Table 501 is populated with predetermined values for access by Computer 101.
  • FIG. 7a depicts the structure of Symbol Table 501 with sample records and data.
  • Collection Table 502 stores information pertaining to each collection to be searched. It is related to Symbol Table 501, via the MediaTypelD field, to identify the collection's media type, e.g., Notes, Oracle, Sybase, Web, etc. This is shown schematically as Relation 525 in FIG. 5. The fields in Collection Table 502 are explained in Table 5. Table 5
  • FIG 7b depicts the structure of Collection Table 502 with sample records and data
  • CoUectionlnfo Table 503 stores information relating specifically to a Collection It is related to Collection Table 502 via the CollectionID field This is shown schematically as Relation 521 in FIG 5
  • the fields in CoUectionlnfo Table 503 are explained in Table 6
  • FIG. 7c depicts the structure of CoUectionlnfo Table 503 with sample records and data.
  • WebCollectionlnfo Table 504 stores information relating specifically to collections of Web pages. It is related to Collection Table 502 via the CollectionID field. This is shown schematically as Relation 522 in FIG. 5. The fields in WebCollectionlnfo Table 504 are explained in Table 7.
  • WebCollectionlnfo Table 504 is a separate table. Other methods are available to store this information, for example, this information can be stored in CoUectionlnfo Table 503 as one table.
  • FIG. 7d depicts the structure of WebCollectionlnfo Table 504 with sample records and data.
  • CollectionFeatures Table 505 stores information about the attributes of a collection. These attributes include, but are not limited to, Collection Type, Secondary Collection Type, Division, and GeographicRegion.
  • the Collection Type attribute is the primary attribute while the Secondary Collection Type may optionally describe the primary Collection Type in greater detail.
  • GeographicRegion attribute can describe whether the collection contains data relating to companies in the United States or to companies in Europe. These attributes are used by Computer 101 to help focus the search results.
  • This table is related to Collection Table 502 via the CollectionID field. This is shown schematically as Relation 523 in FIG. 5. The fields in CollectionFeatures Table 505 are explained in Table 8.
  • FIG. 7e depicts the structure of CollectionFeatures Table 505 with sample records and data.
  • a table may have an entry in the AttributelD field of CollectionFeatures Table 505 pointing in Symbol Table 501 indicating that the collection has a "Collection Type" and a NaluelD which points to an entry in Symbol Table 501 having the value of "client-engagements," thus indicating that the collection of a client-engagements collection type.
  • Document Table 506 stores meta-information about each document indexed by Computer 101. It is related to Collection Table 502 via the CollectionID field. This is shown schematically as Relation 524 in FIG. 5. The fields in Document Table 506 are explained in Table 9. Table 9
  • FIG. 7f depicts the structure of Document Table 506 with sample records and data.
  • Fields Table 507 contains information relating to the individual fields of a collection. It is related to Collection Table 502 via the CollectionID field and to Symbol Table 501 via the TypelD and RolelD fields. These relations are shown schematically in FIG. 5 as Relation 526 and Relation 529, respectively. The fields in Fields Table 507 are explained in Table 10.
  • FIG. 7g depicts the structure of Fields Table 507 with sample records and data.
  • SampleFieldNalues Table 508 contains sample values for each field in Fields Table 507. These values are used by Computer 101 to infer entity types and entity roles for each field. It is related to Fields Table 507 via the FieldID field. This is shown schematically as Relation 530 in FIG. 5. The fields in SampleFieldNalues Table 508 are explained in Table 11. FIG. 7k depicts the structure of SampleFieldValues Table 508.
  • Index Table 509 stores in each record the values of each indexed field of each document that is indexed according to the IndexMark field. It is related to Document Table 506, Symbol Table 501 and Fields Table 507 via the DocumentID, TypelD and RolelD, and FieldID fields. This is shown schematically in FIG. 5 as Relation 531, 532 and 533, respectively. The fields in Index Table 509 are explained in Table 12 and illustrated in FIG.7h.
  • Relation Table 510 stores the relations used in a profile summary. The fields in Relation Table 510 are explained in Table 13 and illustrated in FIG. 7i.
  • RelationMapping Table 511 stores the detailed information to specify for each relation and relevant collection the fields whose values are used. It is related to Collection Table 502 and Relation Table 510 via the CollectionID and RelationID fields. This shown schematically in FIG. 5 as Relation 527 and Relation 528, respectively. The fields in RelationMapping Table 511 are explained in Table 14 and illustrated in FIG. 7j. Table 14
  • Collection Table 502 refers to a collection (which can, for example, be a Notes collection, a relational database, or a Web page) to be analyzed and indexed.
  • CollectionID which is generally a unique, numerical value.
  • Computer 101 then stores the collection's name, location and media type in the CollectionName, LocationURL and MediaTypelD field, respectively. The remaining fields contain null values.
  • the process of determining which collections should be included in the table could involve Computer 101 searching a network for all available Notes collections, or importing a file containing the collection name and location, or even involving an operator who instructs Computer 101 to select a particular collection.
  • Computer 101 can also scan over existing collections stored in Collection Table 502 to determine if any collection has changed its field structure. If so, then Computer 101 indicates that the collection needs to be reanalyzed and reclassified. This can be done, for instance, by setting the LastlndexedDate field to a null value.
  • the data shown in FIG. 7b is an example of the data stored using Accounting Collection 201, Experts Collection 301, Sales Collection 401, and a Web page.
  • Computer 101 also creates or updates records in CoUectionlnfo Table 503 (FIG. 7c) and WebCollectionlnfo Table 504 (FIG. 7d).
  • CollectionID 1 from Collection Table 502, which refers to Accounting Collection 201
  • Computer 101 creates or updates the relevant record in CoUectionlnfo Table 503 with data relating to Accounting Collection 201, such as the server name and collection file name, as shown in FIG. 7c.
  • Computer 101 inserts the relevant information into WeblnfoID record 1 with a default page depth level of 2.
  • the depth level is the number of levels down to index viewing a list of links on a Web page as a hierarchical tree structure.
  • CoUectionlnfo Table 503 and WebCollectionlnfo Table 504 may be populated with the desired collection records and information and then Computer 101 populates Collection Table 502 from these tables.
  • Computer 101 would have to update the CollectionID field in CoUectionlnfo Table 503 and WebCollectionlnfo Table 504 for new collections once it has created a new record in Collection Table 502.
  • Computer 101 proceeds to step S602 to select a collection to analyze and classify from Collection Table 502. In general, all newly added or structurally modified collections need to be analyzed and classified.
  • Computer 101 can determine if a collection is newly added or structurally modified by examining, for instance, the LastlndexedDate field of Collection Table 502. If this field has a null value, then the collection is considered newly added. Another way would be to use the MediaTypelD field instead of the LastlndexedDate field. Yet another way would be to flag (using a Flag field, not shown) all the newly added or structurally modified collections to Collection Table 502 and use the flags to select the collections for analysis and classification. Computer 101 proceeds to step S603 once it has selected a collection. In step S603, Computer 101 extracts the fields and fields' meta- information from the selected collection.
  • Computer 101 uses for illustrative purposes Accounting Collection 201 and Experts Collection 301 to create or updates the relevant records in Fields Table 507 (FIG. 7g).
  • FIG. 7g sample records and data are depicted which pertain to these two collections.
  • Computer 101 queries Accounting Collection 201 requesting the names and data types of all fields in that collection.
  • each type of database program (Lotus Notes, Oracle, Sybase, Informix, etc.) has a way for Computer 101 to obtain the field information as a function call.
  • Computer 101 creates a new record and stores the CollectionID value, field name and field data type.
  • the Synopsis Mark field if set to "Yes," indicates that the value of this field is to be used as part of a synopsis string to be presented to the user.
  • the Synopsis Mark field indicates that the value of this field is to be used as part of a synopsis string to be presented to the user.
  • only values in the Client, Matter and PrtnrName fields of Accounting Collection 201 and only the Name, Number and Expertise fields of Experts Collection 301 will be used to form the synopsis text which will be displayed for each document if selected as a match.
  • Computer 101 also populates SampleFieldValues Table 508 by extracting and storing a number of data entries for each field that will be analyzed, as shown in FIG. 7k. The amount of data to be extracted can vary. For example, Computer 101 can default to retrieving the first 100 entries with data, and more if necessary later on.
  • Computer 101 proceeds to step S604 to select a field to analyze and classify.
  • Computer 101 can determine if a field has been added or modified by examining for each record the TypelD and RolelD fields in Fields Table 507. Where the IndexMark value in a record is "Yes" and the TypelD and RolelD fields have a null value, then this could indicate that the field in the selected collection needs to be analyzed.
  • Computer 101 in prior steps, for example, in steps S602 or S603, would have set any relevant existing records' TypelD and RolelD field values to a null value indicating that the field has to be reanalyzed, reclassifed and reindexed.
  • Computer 101 uses the IndexMark field only to determine if the field needs to be indexed for later entity-based searching — in the preferred embodiment all fields are analyzed and classified.
  • Computer 101 analyzes the field to determine the field's entity type attribute.
  • a field's entity type describes the semantic nature of the data in the field. For example, a field with peoples' names would have aperson entity type. Other exemplary entity types include company, skill, telephone number, office and dollar amount.
  • Computer 101 analyzes the field's name and a sample of data stored in the field as a way to determine the field's entity type.
  • FIG. 8 shows in greater detail one embodiment for analyzing a field name.
  • Computer 101 tokenizes the field name. Methods for generating tokens from a given input string are well known in the art. For example, for a field with the name Client, Computer 101 returns as a token [Client].
  • Computer 101 For a field with the name PrtnrName, Computer 101 returns as tokens [Prtnr] and [Name]. In step S802, Computer 101 uses the tokens generated in step S801 to create a list of possible entity types. This list may be weighted depending, for example, the closeness of a match.
  • the token [Client] for the Client field might be indicative that the field contains a name of an entity type such as aperson or company.
  • the token [Prtnr] might be indicative that the field relates to a partner, and thus of a person entity type.
  • Computer 101 maps the tokens to a dictionary of terms, which can be as simple as a list of terms, or as advanced as being a domain dependent list, and obtains a list of possible entity type matches.
  • a heuristic rule used in this step would be that if a field contains a token [By] in the last position of the field name, that might be indicative of an "action" and thus that the field may contain people or company names.
  • the patterns of the tokens are used to infer that a field named ContactPhone is preferably of the entity type telephone while a field named PhoneContact is preferably of the entity type person.
  • a system does not have to implement all these steps, depending on the confidence level desired for a proper attribute classification.
  • the result from this step is another list of possible entity types, which may be weighted, if desired.
  • Computer 101 take the results of steps S802 and S803 and generates a set of possible entity types. This list can be generated, for example, by listing only the intersection of the results from both the field name and pattern matching analysis routines. This list can then be further processed by removing duplicate entity types and choosing only those entity types whose confidence ranking exceeds a predetermined threshold.
  • Computer 101 also analyzes a sample of the data stored within the field.
  • Computer 101 can use many different methods, either alone or in combination, depending on the level of analysis desired. Different methods of analysis include heuristic matches, keyword-based matches, and formatting criteria.
  • FIG. 9. shows one combination of these field value analysis methods.
  • Computer 101 first uses the possible entity types generated by field name analysis routines (i.e., FIG. 8) and then uses those possible entity types to conduct a field value analysis (i.e., FIG. 9) to further refine the analysis.
  • a heuristic match used in step S901
  • Computer 101 compares the values in a field against typical values for a given entity type.
  • Computer 101 may compare the values in a field against a table or list of person names. A high number of matches would be indicative that the data contains person names. Likewise, a high number of matches while comparing the values in a field against a table or list of office locations within a company, or a table of company names would be indicative that the field contains an entity type of office or company, respectively. In the preferred embodiment, this can be accomplished by comparing each sample field value for a given field from SampleFieldNalues Table 508 to different lists of person names, company names or skills. A high number of matches from a person names list and a low number of matches from the company location list would be indicative, for example, that the field has the attribute o ⁇ a person entity type.
  • step S902 the data in the fields are scanned for specific keywords. For instance, the presences of the terms "Inc.”, "Corp.” or "Co.” may be indicative that the field contains company names, and is thus of a company entity type.
  • step S903 Computer 101 examines how the data in the fields are formatted. As an example, short strings with capitalized words might be indicative of a person's name, and thus be a person entity type.
  • the confidence level of a correct entity type identification varies for each of the different methods. In the methods shown in Fig. 9, the confidence level of a correct identification ranges from a high confidence for the heuristic match to a low confidence level using the formatting criteria.
  • Computer 101 takes the possible entity types generated by field name analysis routines (i.e., FIG. 8) and uses those possible types to conduct a field value analysis (i.e., FIG. 9). Computer 101 then selects the entity type with the highest combined confidence level above a predetermined threshold. If the confidence level for any of the possible types is not above the predetermined threshold, then additional field samples are added to SampleFieldNalues Table 508 for additional analysis. When Computer 101 has determined the selected field's entity type, it enters that information in the appropriate record and TypelD field in Fields Table 507. Thus, for the Client field of Accounting Database 201, the field is classified as the entity type company. Likewise, the PrtnrName field is classified as entity type person.
  • Entity roles for a company entity type may include client, supplier and vendor.
  • Computer 101 uses in the preferred embodiment the field's entity type, determined in step S605, and the field's name.
  • the token [Prtnr] may be indicative of an entity role of partner. Routines similar to those used in classifying the field's entity type can be used to also classify the field's entity role.
  • the entity role Once the entity role has been determined, it is entered into Fields Table 507. After the field's entity role has been determined, Computer 101 proceeds to step S607. If there are more fields in the selected collection to be classified, Computer 101 returns to step S604.
  • a collection's attributes can include a collection type, which is a description of what a collection is. These types can include, for example, directory, newsletter, contact, company data, or client-engagements collections.
  • Computer 101 uses a set of rules to assist in the classification of the collection. For instance, if a collection contains a field with an entity type/7ersow (whose role is not a collection editor or author) and a field with an entity type telephone number, then that collection can be classified as of collection type contact.
  • Another rule could be that if a collection contains a field with an entity type of person in the entity role of a mentor or reviewer, and fields with entity types company and dollar amounts, then the collection can be classified as of collection type engagements. Yet another rule could be that if a collection contains fields with entity types company and dollar amounts, but no field of a person entity type, then the collection can be classified as of collection type company data.
  • FIG. 7e depicts sample records for CollectionFeatures Table 505 indicating that Accounting Collection 201 has a collection type of client-engagements. It is possible that a collection has multiple attributes and thus have multiple associated records in CollectionsFeatures Table 505.
  • Computer 101 then proceeds to step S609. If there are additional collections to analyze, then flow proceeds back to step S602, otherwise, flow proceeds to step S610 of FIG. 6b. This completes the field and collection classification steps.
  • Computer 101 can, if desired, engage the operator in confirming or assigning the entity or collection type or entity role. Because these multiple collections are not related to one another in a standard relational format, their data is highly heterogenous and not strongly typed. Thus, there is no single unique key which identifies the entities across the different collections and documents.
  • Computer 101 will perform a set of normalization and indexing routines.
  • the indexing routine includes determining document meta-information, generating a synopsis of each document and storing original and normalized forms of each value in an indexed filed into Index Table 509 (FIG. 7h).
  • Computer 101 selects a collection to index.
  • Computer 101 selects a collection from Collection Table 502.
  • Computer 101 determines if the selected collection has been modified since the last time it was indexed.
  • step S612 the records in Document Table 506 (FIG.
  • Step S613 selects a new or modified document from the selected collection. This, of course, presupposes that new or modified documents were previously identified, such as in step S612 of FIG. 6b. Other ways include selecting those documents which are dated after the date stored in the LastlndexedDate field for that collection.
  • step S614 Computer 101 creates a new record for the document in Document Table 506 and generates a synopsis for that document.
  • a synopsis is a brief description of the document displayed to the user when the document is returned as a hit, i.e., a match to a search query.
  • the synopsis can, for example, take the form of a concatenation of the values in selected fields.
  • each record in Document Table 506 contains information about each document in the collection to be searched. In particular, information pertaining to the collection and the unique document identifier as maintained by each specific collection's database program is stored in fields CollectionID and TuplelD, respectively.
  • the synopsis is generated from a concatenation of the fields for that collection identified to be used in the synopsis, as determined by the values in the SynopsisMark field of Fields Table 507.
  • Fields Table 507 has three fields in Accounting Collection 201 selected to be part of a synopsis, as indicated by the "Yes" value in the SynopsisMark field, specifically the Client, Matter and PrtnrName fields.
  • the values in these fields are concatenated and stored in the Synopsis field in Document Table 506 to facilitate its retrieval.
  • Other ways of generating a synopsis field are well within the skill of one skilled in the art.
  • Computer 101 In this step, Computer 101 generates normal forms for the values in selected fields (as indicated by IndexMark field of Fields Table 507) for each document, and stores the original value as well as the normal forms in Index Table 509.
  • normalization is specific to a field's entity type.
  • exemplary steps of normalizing the field's value include: (1) removing any leading "The” terms, expanding abbreviations, such as "Inc.” and “Co.”, uniformly capitalizing terms, and combining leading initials (e.g., H B N becomes "HBN" and "R. A. Nado" becomes "RA Nado”); and (2) looking up the value's leading words
  • normalization can include: (1) breaking names down into components, such as first, middle and last name (accounting, of course, for different variants, e.g., "First Middle Last", "Last, Mr.
  • Computer 101 then stores the unnormalized and normal forms in Index Table 509, as shown in FIG. 7h, and depicted with sample records relating to Accounting Collection 201. Also associated with each record is data identifying the source of the value.
  • One of these normal forms is designated as a canonical form, to be displayed when the value is matched as a "hit", by the appropriate value stored in the Canonical field.
  • a Soundex index can also be generated from the field value and stored in the Soundex field to provide an alternative or enhanced search functionality.
  • other collections may not have the RecID field whose value is to be stored in the TuplelD field of Document Table 506, and other means to associate the normal form to the collection and document could then be used.
  • Index Table 509 can have many other forms, depending on the information desired to be retained, as well as the database program used. Referring back to FIG. 6b, once step S615 has been completed, Computer 101 proceeds to step S616 to determine if there are more documents to process. If there are, flow proceeds back to step S613. If not, Computer 101 updates the date in the LastlndexedDate field for the selected collection in Collection Table 502 and then proceeds to step S617. In step S617, Computer 101 determines if there are any additional collections to process. If there are, flow proceeds back to step S610. If not, flow proceeds to FIG. 6c. This completes the normalization and indexing steps. FIG.
  • Step S618 a user enters one or more search strings and identifies the entity types of each search string.
  • a user searching for the text string "IBM" would also identify to Computer 101 the query entity type, which in this example is the company entity type associated with that string.
  • Computer 101 can analyze the search string in a similar fashion to the field analysis and assign attributes to the search string.
  • FIG. 10 shows a sample query input screen. Input Boxes 1001(a), 1002(a) and 1003(a) hold the user's search strings. Entity Type Boxes 1001(b), 1002(b) and 1003(b) hold the respective search string's entity type.
  • search strings can be conjoined ("ANDed" together) to find documents containing all search strings.
  • Other embodiments allow any boolean combination of any number of search strings, an example of which is shown in FIG. 10.
  • Input Box 1001 (a) has a text string value of "IBM”
  • Entity Type Box 1001 (b) has an entity type value of company.
  • Other optional search qualifiers may include a date range, a collection type, and methods of sorting the resulting matches ("hits") and the number of hits to be returned.
  • the entity-based search can include a full- text search engine to search across the multiple collections not only based on entities, but also in conjunction with a full-text search.
  • a query can consist of not only a search for entity types, but also for text strings.
  • the DocumentID from Document Table 506 can be used by both the full-text and entity searches to identify the documents found.
  • Computer 101 normalizes the search string and looks up the normalized search string against the normalized index entries ih Index Table 509.
  • the search string can have multiple normalized forms depending on the entity type. For example, if a document contains the term "Sun microsystems" of the company entity type, that phrase is normalized to both "Sun Microsystems" and "Sun” and stored as two entries in Index Table 509.
  • search string is "Sun microsystems"
  • “Sun Microsystems” using the normalization methods described with respect to step S615) to search for in Index Table 509.
  • Each record with an entry in the field IndexWord in Index Table 509 that matches the "Sun Microsystem” normalized form will be selected. Records with the index term “Sun” will, however, not be matched.
  • the search string was "Sun”
  • the result of the search will match records which contain the term "Sun” and may include records with the index term "Sun Microsystems.” If the search string was of entity type person, then there may be multiple normalized forms to search.
  • the search term "Mike Willis” can be normalize into three search terms: "Willis”, “Willis/Michael”, and "Willis/M.”
  • Any record with an index term matching any of these three normalized search forms will be selected.
  • These index records identify the relevant documents .
  • the group of matched documents is called a match set. It is of course possible to search the collections without the use of Index Table 509 by normalizing the field values and query at runtime, i.e., during the search. This will, however, significantly impact the performance of the search and is generally tenable only for very small collections or groups of collections.
  • Computer 101 generates a match score for each entry in the match set.
  • matches are graded according to their match score. Take, as an example, the search string "Bob Smith”, which normalizes to “Smith/Robert.” This string matches “Smith/Robert” very well, “Smith/R” not as well, and “Smith/Susan” not at all. Moreover, because an entity can be referred to in many different formats, generating a match score requires the application of different techniques and weighting factors and can be type dependent.
  • Words appearing in one string and not the other reduces the match score by an amount proportional to the total number of words.
  • Entity-type specific methods penalize a match less if a general word, such as "Company" is not matched. If the entity type is person, then a match on the normalized forms of the names may be sufficient.
  • the preferred embodiment can use stop words (a list of words and phrases maintained by the system that are to be ignored when generating a match score) to refine the match score.
  • the present invention is not limited to the described weighting and scoring methods. These and other methods are well known in the art and can also be used without departing from the scope or spirit of the present invention.
  • Computer 101 generally relies on the normalization to be strong enough such that a match on the normal form is sufficient without requiring further word-by-word matching.
  • Computer 101 may assign a high match score (e.g., a "1" out of a number from 0 to 1) to the match. If normalization is weak, for instance, where no synonyms were found for a company during normalization, and thus only the first significant word was used, then a match score may be used to determine if a match is strong enough to be reported as a hit. After the match scores are generated, Computer 101 proceeds to step S623 where the match scores are processed. Usually, a minimum threshold is established so that matches with a match score below that threshold are not considered.
  • Computer 101 sorts, if desired, the match sets, and formats the results to be presented to the user as a hit list. Examples of sort can include sorting by collection type, collection names, by entity roles, dates, etc. If a boolean combination of several search strings were used, then a match set for each search string would be generated (steps S619 through S623 may be repeated for each search string) and Computer 101 would combine the sets (for an OR search, or find the intersection for an AND search), and eliminate duplicates to generate the hit list. Flow then proceeds to step S625 to display the hit list to the user.
  • Another aspect of the present invention is the ability to obtain a profile of an entity based on the index entries related to the entity.
  • a profile provides a variety of useful information about an entity, and this information can be drawn from various documents in different collections.
  • a profile of a company could include the company's SIC code, net income, total assets, net revenues and who it is a client of.
  • a profile of a person could include that person's phone number, clients and background.
  • the present invention uses the profiled entity's entity types and, optionally, entity roles, along with predetermined relations between selected collection fields to extract profile information from selected fields of documents in various collections.
  • the values of fields containing profile information are stored in Index Table 509, making it unnecessary to have access to the original collections to retrieve those values.
  • Key to this profiling is the mapping of a "domain” to a "range”, i.e., mapping from one type of thing (the domain) to another type of thing (the range).
  • FIG. 7i there are shown sample records from selected fields of Relation Table 510.
  • this relation maps the company entity type to the background entity type -- that is, if a profile of a particular company (an entity of type company) was requested, the system could also retrieve documents in a collection with fields assigned to the background entity type.
  • FIG. 7j shows sample records of RelationMapping Table 511 which illustrate this mapping of a relation's Domain and Range to pairs of fields, the DomainField and RangeField, respectively, of a collection.
  • An example of such a profile is illustrated in FIG. 11. It is also possible that for any given relation, there are several RelationMapping Table 511 entries, thereby obtaining additional profile information from, for instance, the same collection or different collections. An example of this situation would be a profile, similar to that described with respect to FIG.
  • FIG. 11 shows a simplified screen for the profile data displayed by Computer 101 for a person.
  • the relations that should be part of a given profile is stored as a simple list of RelationlDs.
  • Computer 101 To extract the profile information Computer 101 first conducts a search of Index Table 509 in a manner as previously described. The result is a first match set of records where the normalized values of the search string match the normalized values stored in Index Table 509 and have the same entity type as the search string. Next, Computer 101 filters out those records from Index Table 509 whose FieldID value is not a member of the DomainField set. The members of the DomainField set consists of the DomainField FieldID values given by RelationMapping Table 511 for the desired profile.
  • Each element in the DomainField set also has associated with it a RangeField set where the members of the RangeField set consist of the RangeField FieldID values for a given DomainField FieldID value, or element.
  • the profile for a person entity type has two RelationMapIDs, 1 and 3, and thus the DomainField set has only one element, the FieldID value 7.
  • the RangeField set for this element consists of two elements, the FieldID values 8 and 9.
  • the remaining records retrieved from Index Table 509 will have matched a normalized form of "Catherine Baudin" and have a FieldID value in the DomainField set (i.e., 7) to form a second match set.
  • Computer 101 For each record in the second match set, Computer 101 retrieves those records from Index Table 509 which also match that second match set record's DocumentID value and have FieldID values that are a member of the RangeField set associated with that second match set record's FieldID value, to form a third match set.
  • the FieldID value at this point is necessarily an element in the DomainField set.
  • the FieldID value of each record in the second match set is 7 (since there is only one element in the DomainField set), and thus, the RangeField set is comprised of FieldID values 8 and 9.
  • Computer 101 therefore only retrieves those records from Index Table 509 whose DocumentID matches the second match set record's DocumentID value and whose FieldID is 8 or 9.
  • the records from the third match set thus contains the desired profile values, which can then be further manipulated, formatted and output for display as shown in FIG. 11.
  • Computer 101 can combine the steps creating the first and second match sets to instead conduct an entity-based search in Index Table 509 for the normalized form(s) of "Catherine Baudin", with the person entity type and a FieldID value in the set of DomainField FieldID values and using the resulting set of records as the second match set for later steps.
  • separate tables are instead created for each type of profile information during the indexing of a collection.
  • FIG. 12 depicts an example of a profile for a company which includes the company's SIC Code.
  • the company profile also contains an "inverse" profile relation.
  • An inverse profile relation uses another existing relation and indicates to Computer 101 to switch that relation's DomainField values with the RangeField values during the search and filtering steps. For instance, the inverse profile information in FIG.
  • a company profile however, has a Domain of the company, entity type, and thus, the inverse relation indicates that the Client relation of Relation Table 510 should be used, but that the DomainField values and RangeField values of the associated RelationMapping Table 511 record should be exchanged for use in the previously described profile search and filtering process of Index Table 509.
  • the partner information from Index Table 509 is subsequently displayed, i.e., the partners' names whose client is the company being profiled.
  • This feature provides the added advantage that inverse relations do not require separate entries in RelationMapping Table 511.
  • Yet another aspect of the invention is the ability to conduct boolean searches using both a full-text search and an entity-based search.
  • Computer 101 conducts both full-text and entity-based searches and combines the search results according to the desired boolean criteria.
  • This type of search could be used, for instance, in a profile search where the full-text search was also conducted to provide information about documents which mention the entity being profiled.
  • Yet another aspect of the invention is the ability to obtain cross- references to other entities by analyzing the results of an entity-based search on a first entity and then finding references to other entities in the documents identified by the first entity-based search. This provides a means to obtain information about what other entities the entity being search is associated or related with. These references can then be used, if desired, as the basis for new searches. In the manner described above, the present invention thus provides a system and method to search multiple databases in a focused and efficient manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system and method for searching across multiple databases by classifying the data into entities. One aspect of the invention includes using a Computer 101 to analyze a database's fields and assign those fields specific entity types and entity roles. These are subsequently used to conduct, facilitate, and focus the search.

Description

SYSTEM AND METHOD FOR ENTITY-BASED DATA RETRIEVAL
BACKGROUND OF THE INVENTION 1. Field of the Invention This invention relates to a system and method for searching across multiple databases.
2. Description of the Related Art The increasing decentralization of databases, both of traditional databases and information sharing database architectures, such as the World Wide Web ("WWW") and Lotus Notes, allow individuals to easily create new databases and add information to new and existing databases on an informal and ongoing basis. Although the retrieval of information from a single database is easily accomplished, to harness the information contained in the multitude of databases distributed across a corporation or global network requires the ability to search across these multiple databases. One way to search multiple databases is to create a full-text index database of each database and query the index using a full-text search. This approach requires very little meta-information about each database and allows a user to query multiple databases with little or no knowledge of each database schema being searched. In addition, the user does not need to know the location of each database. The cost of implementing such a full-text index is low and this method can easily accommodate the addition of new databases. However, a full- text search returns results of low precision, does not generally employ data normalization, and is of low expressiveness. An alternative way to search across multiple databases is to impose a global schema over each individual database. This method requires significant amounts of meta-information about each database and requires that each database be strongly typed, structured and normalized. By imposing these rigid requirements on each database, a query can search over consistent, normalized terms, be highly expressive, and return search results with a high degree of precision. Unfortunately, this method also imposes a high cost by requiring an inflexible global schema, homogeneous data values, data normalization rules, and complete and detailed database maps from each individual database to the global schema. Databases employing this type of search and schema typically include data warehousing and multi-database SQL systems. Moreover, the problems of creating the database maps from the local to a global schema and of rigidly normalizing the data across all databases are compounded when different databases are created at different times by different users in different locations. Integrating these disparate databases into the global schema thus requires the expenditure of large amounts of time and effort, as well as the resources to continually conform new databases to the global schema.
SUMMARY OF THE INVENTION In view of the foregoing, the present invention provides a unique system that overcomes the disadvantages of the prior art systems by providing a method and system to search multiple databases. The present invention uses limited meta-information about each database, and various heuristic rules in generating an index database for conducting an entity-based search across multiple databases that have highly heterogeneous data values. This allows a search that is more powerful and precise than a full-text search, but does not require the high costs of strongly typed, data normalized and complete and detailed mapping of each database to a global schema. It is therefore an object of the present invention to provide a system and method for creating an entity-based search for searching across a plurality of databases by classifying data into entity types. The system and method selects a database. The database contain fields (in general) where records of the database store data in fields. The system and method selects a field, assigns an entity type to the field, and creates an index of the entity type assigned to the field so that a search across the plurality of databases based upon the entity type can be accomplished. It is also another object of this invention to provide a system and method for creating an entity-based search across a plurality of databases by classifying data into entity types wherein the system and method assigns a first entity type to a field in a database; assigns a second entity type to a query; searches the database for the query wherein the search uses the first and second entity types to determine a match; and outputs the result of the search. It is yet another object of this invention to provide a system and method for creating an entity-based search across a plurality of databases by classifying data into entity types wherein the system and method assigns a first entity type to a field in a database; assigns a second entity type to a query; searches an index for the query wherein the search uses the first and second entity types to determine a match; and outputs the result of the search. It is another an object of this invention to provide a system and method for extracting meta-information from a database. It is another object of this invention to provide a system and method for heuristically matching query terms to data in a database. These and additional objects of this invention can be obtained by reference to the following detailed description of the preferred embodiments thereof in connection with the attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a system depicting various aspects of the invention. FIGS. 2-4 show exemplary databases and their structures. FIG. 5 is a block diagram depicting the interrelationships in one embodiment between the databases of the present invention. FIGS. 6a-6c is a flow chart depicting steps according to one aspect of the present invention. FIGS. 7a-7k show exemplary databases and their structures. FIG. 8 is a flow chart of the field name analysis steps according to one aspect of the present invention. FIG 9 is a flow chart of the field value analysis steps according to one aspect of the present invention. FIG. 10 is an exemplary query screen form. FIGS. 11 and 12 are exemplary profile output screens.
DETAπ.ED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 depicts a system according to one aspect of the invention. In particular, the apparatus of FIG. 1 includes Computer 101 for executing program instructions. These program instructions can be stored in RAM/ROM 103 and/or Mass-Storage Device 102. Also connected to Computer 101 are Peripherals 104, which may include a monitor, keyboard, and mouse to allow an operator to receive, manipulate and input information. Computer 101 can access Databases 105, which may be local or remote databases or tables, via Network 106. Network 106 can include local area networks, wide area networks, or global networks such as the Internet. In the preferred embodiment Mass Storage Device 102 stores the database, depicted as Collections Database 107, to implement one aspect of the invention. Of course, Mass Storage Device 102 is not limited to a single physical disk, it may in fact be composed of several disks such as a disk cluster, a RAID disk array, etc. In addition, Databases 105 includes Lotus Notes databases, although other database programs, such as Oracle, Informix, Sybase, Microsoft Access and Claris FileMaker Pro databases, as well as semi-structured Web pages and Internet accessible databases (usually via the World Wide Web), can also be used. Collections Database 107 is comprised of several tables using, in the preferred embodiment, Microsoft Access. Again, other database programs (relational, object-oriented or otherwise) can be used to create Collections Database 107. In Lotus Notes a database is a collection of related documents. A Lotus Notes collection and a document are analogous to a table and to a row or a record, in a relational database (such as a Microsoft Access database), respectively, and the terms "collection" and "database" will hereinafter be used interchangeably. Accordingly, although the preferred embodiment searches across multiple Lotus Notes databases, it is not limited to Lotus Notes databases, but can also be used with other types of databases and Web pages. The document/row/record stores data in a field or fields. The term "field" can also refer not only to the specific data storage element in a document/row/record, but in the abstract, to the field of a collection or database, of which the field of a document/row/record is an instance of the collection or database field. To efficiently search across multiple databases the present invention classifies the data into entities and facilitates and focuses searches by using these entities and relationships among the entities. Examples of entities can be a person, such as "Bob Smith" or a company, such as "Apple Computer," or even a skill, such as "SAP Programming." Each entity has attributes associated with it, which can include an entity type, and, optionally, an entity role. Examples of entity types include person, company, skills, telephone number, office and dollar entity types. Entity roles further describe and refine the entity attributes and are generally entity type specific. Examples of entity roles for a person entity type can include partner, manager, contact, author and reviewer entity roles. Examples of entity roles for a company entity type can include client and vendor entity roles. For instance, an entity such a "Bob Smith" may have a person entity type and can, for example, have entity roles of a partner on an engagement or an entity role of a contact, depending on the collection and results of analysis by Computer 101. Using the entity types and entity roles, collections can also be classified into collection types (or database types for other non-Notes databases, used hereinafter interchangeably). A collection of telephone numbers can be classified, for example, into a directories collection type, while a collection of time spent on various client matters might be classified as a client-engagements collection type.
Field meta-information, such as the field names, name patterns, and field values are used to further generate these collection type and fields' entity type and entity role attributes. These attribute meta-information are subsequently used to more efficiently conduct an entity-based search across multiple databases. In particular, Computer 101 analyzes the fields in a collection to determine fields which hold references to entities by using the field meta-information. These fields are then assigned an entity type and, optionally, an entity role, which are then further used to assign a collection type to the collection.
FIGS. 2 through 4 depict the structure of exemplary Databases 105 as may have been created by different departments. In particular, FIG. 2 shows Accounting Collection 201 that includes the following five fields, set forth in Table 1, for each document:
TABLE 1
Figure imgf000008_0001
FIG. 3 shows Experts Collection 301 which include the following five fields, set forth in Table 2 for each document: TABLE 2
Figure imgf000009_0001
FIG. 4 shows Sales Collection 401 which include the following four fields, set forth in Table 3, for each document:
TABLE 3
Figure imgf000009_0002
These collections are accessible by Computer 101 to be analyzed and searched. Although only three collections are shown here for illustrative purposes, and will be used to more clearly illustrate various aspects of the invention, there may be hundreds or thousands of different databases. With respect to the field RecID in each collection, this field is a unique identifier assigned by the particular database application for each document in the table. This identifier is usually automatically assigned by a database program and can either explicitly be a separate field, as in the examples shown, or internally maintained by each database program. Of course, Databases 105 can be other non-Lotus Notes databases, such as traditional relational databases, and thus, for example, Accounting Collection 201 would consist of a table with five fields in each record. FIG. 5 shows Collection Database 107 as comprised of several tables, and their interrelationships, in one embodiment of the present invention. A description of each table is given below.
Symbol Table 501 stores symbol information. These symbols have a name and type, stored in the fields SymbolName and SymbolType, respectively. For instance, symbols are stored referring to the different entity types, entity roles and collection types. The fields in Symbol Table 501 are explained in Table 4.
Table 4
Figure imgf000010_0001
By using a unique SymbolID as a pointer, other tables can save storage space by only storing the SymbolID, and not the full text of the SymbolName or SymbolType. Of course, other methods, such as drop down menus and pick lists stored with each table or database application could be used instead. Symbol Table 501 is populated with predetermined values for access by Computer 101. FIG. 7a depicts the structure of Symbol Table 501 with sample records and data.
Collection Table 502 stores information pertaining to each collection to be searched. It is related to Symbol Table 501, via the MediaTypelD field, to identify the collection's media type, e.g., Notes, Oracle, Sybase, Web, etc. This is shown schematically as Relation 525 in FIG. 5. The fields in Collection Table 502 are explained in Table 5. Table 5
Figure imgf000011_0001
FIG 7b depicts the structure of Collection Table 502 with sample records and data
CoUectionlnfo Table 503 stores information relating specifically to a Collection It is related to Collection Table 502 via the CollectionID field This is shown schematically as Relation 521 in FIG 5 The fields in CoUectionlnfo Table 503 are explained in Table 6
Table 6
Figure imgf000011_0002
Although the above table is described for a Notes database, it is not limited to Notes databases. Moreover, other tables can also be used which are media-type specific, and any modifications needed would be well within the skill of one skilled in the art. For example, if the database is not a Notes database but rather an Oracle database, then the fields ViewNa e and ViewID may not be needed and can be eliminated for a table which is specific to Oracle databases. FIG. 7c depicts the structure of CoUectionlnfo Table 503 with sample records and data.
WebCollectionlnfo Table 504 stores information relating specifically to collections of Web pages. It is related to Collection Table 502 via the CollectionID field. This is shown schematically as Relation 522 in FIG. 5. The fields in WebCollectionlnfo Table 504 are explained in Table 7.
Table 7
Figure imgf000012_0001
In the preferred embodiment, WebCollectionlnfo Table 504 is a separate table. Other methods are available to store this information, for example, this information can be stored in CoUectionlnfo Table 503 as one table. FIG. 7d depicts the structure of WebCollectionlnfo Table 504 with sample records and data. CollectionFeatures Table 505 stores information about the attributes of a collection. These attributes include, but are not limited to, Collection Type, Secondary Collection Type, Division, and GeographicRegion. For example, the Collection Type attribute is the primary attribute while the Secondary Collection Type may optionally describe the primary Collection Type in greater detail. As another example, the GeographicRegion attribute can describe whether the collection contains data relating to companies in the United States or to companies in Europe. These attributes are used by Computer 101 to help focus the search results. This table is related to Collection Table 502 via the CollectionID field. This is shown schematically as Relation 523 in FIG. 5. The fields in CollectionFeatures Table 505 are explained in Table 8.
Table 8
Figure imgf000013_0001
FIG. 7e depicts the structure of CollectionFeatures Table 505 with sample records and data. As an example, a table may have an entry in the AttributelD field of CollectionFeatures Table 505 pointing in Symbol Table 501 indicating that the collection has a "Collection Type" and a NaluelD which points to an entry in Symbol Table 501 having the value of "client-engagements," thus indicating that the collection of a client-engagements collection type. Again, it is well within the skill of one skill in the art to use different tables for different media types, if desired.
Document Table 506 stores meta-information about each document indexed by Computer 101. It is related to Collection Table 502 via the CollectionID field. This is shown schematically as Relation 524 in FIG. 5. The fields in Document Table 506 are explained in Table 9. Table 9
Figure imgf000014_0001
FIG. 7f depicts the structure of Document Table 506 with sample records and data.
Fields Table 507 contains information relating to the individual fields of a collection. It is related to Collection Table 502 via the CollectionID field and to Symbol Table 501 via the TypelD and RolelD fields. These relations are shown schematically in FIG. 5 as Relation 526 and Relation 529, respectively. The fields in Fields Table 507 are explained in Table 10. FIG. 7g depicts the structure of Fields Table 507 with sample records and data.
Table 10
Figure imgf000015_0001
SampleFieldNalues Table 508 contains sample values for each field in Fields Table 507. These values are used by Computer 101 to infer entity types and entity roles for each field. It is related to Fields Table 507 via the FieldID field. This is shown schematically as Relation 530 in FIG. 5. The fields in SampleFieldNalues Table 508 are explained in Table 11. FIG. 7k depicts the structure of SampleFieldValues Table 508.
Table 11
Figure imgf000015_0002
Index Table 509 stores in each record the values of each indexed field of each document that is indexed according to the IndexMark field. It is related to Document Table 506, Symbol Table 501 and Fields Table 507 via the DocumentID, TypelD and RolelD, and FieldID fields. This is shown schematically in FIG. 5 as Relation 531, 532 and 533, respectively. The fields in Index Table 509 are explained in Table 12 and illustrated in FIG.7h.
Table 12
Figure imgf000016_0001
Relation Table 510 stores the relations used in a profile summary. The fields in Relation Table 510 are explained in Table 13 and illustrated in FIG. 7i.
Table 13
Figure imgf000017_0001
RelationMapping Table 511 stores the detailed information to specify for each relation and relevant collection the fields whose values are used. It is related to Collection Table 502 and Relation Table 510 via the CollectionID and RelationID fields. This shown schematically in FIG. 5 as Relation 527 and Relation 528, respectively. The fields in RelationMapping Table 511 are explained in Table 14 and illustrated in FIG. 7j. Table 14
Figure imgf000018_0001
Referring now to FIG. 6a, there is shown a high-level schematic process flow diagram of steps performed by Computer 101 in practicing one aspect of the present invention. In particular, in step S601, Computer 101 begins to populate Collection Table 502 (FIG. 7b). Each record in Collection Table 502 refers to a collection (which can, for example, be a Notes collection, a relational database, or a Web page) to be analyzed and indexed. As each record is created, Computer 101 automatically assigns the record a CollectionID, which is generally a unique, numerical value. Computer 101 then stores the collection's name, location and media type in the CollectionName, LocationURL and MediaTypelD field, respectively. The remaining fields contain null values. The process of determining which collections should be included in the table could involve Computer 101 searching a network for all available Notes collections, or importing a file containing the collection name and location, or even involving an operator who instructs Computer 101 to select a particular collection. During step S601, Computer 101 can also scan over existing collections stored in Collection Table 502 to determine if any collection has changed its field structure. If so, then Computer 101 indicates that the collection needs to be reanalyzed and reclassified. This can be done, for instance, by setting the LastlndexedDate field to a null value. The data shown in FIG. 7b is an example of the data stored using Accounting Collection 201, Experts Collection 301, Sales Collection 401, and a Web page. Computer 101 also creates or updates records in CoUectionlnfo Table 503 (FIG. 7c) and WebCollectionlnfo Table 504 (FIG. 7d). For purposes of illustration, using CollectionID = 1 from Collection Table 502, which refers to Accounting Collection 201, Computer 101 creates or updates the relevant record in CoUectionlnfo Table 503 with data relating to Accounting Collection 201, such as the server name and collection file name, as shown in FIG. 7c. As another example, using CollectionID = 2 from Collection Table 502, which refers to a Web page, Computer 101 inserts the relevant information into WeblnfoID record 1 with a default page depth level of 2. The depth level is the number of levels down to index viewing a list of links on a Web page as a hierarchical tree structure. Alternatively, CoUectionlnfo Table 503 and WebCollectionlnfo Table 504 may be populated with the desired collection records and information and then Computer 101 populates Collection Table 502 from these tables. Of course, Computer 101 would have to update the CollectionID field in CoUectionlnfo Table 503 and WebCollectionlnfo Table 504 for new collections once it has created a new record in Collection Table 502. Next, Computer 101 proceeds to step S602 to select a collection to analyze and classify from Collection Table 502. In general, all newly added or structurally modified collections need to be analyzed and classified. Computer 101 can determine if a collection is newly added or structurally modified by examining, for instance, the LastlndexedDate field of Collection Table 502. If this field has a null value, then the collection is considered newly added. Another way would be to use the MediaTypelD field instead of the LastlndexedDate field. Yet another way would be to flag (using a Flag field, not shown) all the newly added or structurally modified collections to Collection Table 502 and use the flags to select the collections for analysis and classification. Computer 101 proceeds to step S603 once it has selected a collection. In step S603, Computer 101 extracts the fields and fields' meta- information from the selected collection. Using for illustrative purposes Accounting Collection 201 and Experts Collection 301, Computer 101 creates or updates the relevant records in Fields Table 507 (FIG. 7g). Referring to FIG. 7g, sample records and data are depicted which pertain to these two collections. In particular, Computer 101 queries Accounting Collection 201 requesting the names and data types of all fields in that collection. Generally, each type of database program (Lotus Notes, Oracle, Sybase, Informix, etc.) has a way for Computer 101 to obtain the field information as a function call. For each field, Computer 101 creates a new record and stores the CollectionID value, field name and field data type. Since Accounting Collection 201 has a CollectionID value of 1, the CollectionID field in Fields Table 507 has a value of 1 for all records relating to Accounting Collection 201. If the collection is not newly added, then Computer 101 creates, deletes, or updates the appropriate records. In a similar fashion, Computer 101 populates Fields Table 507 with the data pertaining to Experts Collection 301. Referring to the Index Mark and Synopsis fields in Fields Table 507, Computer 101 may set the values in these fields initially to a "Yes" value, with confirmation by an operator. Index Mark, if set to "Yes" indicates that this field should be normalized, indexed and used for entity-based searching. Thus, all fields in both collections, except the RecID fields, will be used. Likewise, the Synopsis Mark field, if set to "Yes," indicates that the value of this field is to be used as part of a synopsis string to be presented to the user. In this example, only values in the Client, Matter and PrtnrName fields of Accounting Collection 201 and only the Name, Number and Expertise fields of Experts Collection 301 will be used to form the synopsis text which will be displayed for each document if selected as a match. In this step, Computer 101 also populates SampleFieldValues Table 508 by extracting and storing a number of data entries for each field that will be analyzed, as shown in FIG. 7k. The amount of data to be extracted can vary. For example, Computer 101 can default to retrieving the first 100 entries with data, and more if necessary later on. This data is used to infer additional meta- information and refine the field analysis and classification. Next, Computer 101 proceeds to step S604 to select a field to analyze and classify. As in step S602, Computer 101 can determine if a field has been added or modified by examining for each record the TypelD and RolelD fields in Fields Table 507. Where the IndexMark value in a record is "Yes" and the TypelD and RolelD fields have a null value, then this could indicate that the field in the selected collection needs to be analyzed. Of course, if the collection has been structurally changed, then Computer 101 in prior steps, for example, in steps S602 or S603, would have set any relevant existing records' TypelD and RolelD field values to a null value indicating that the field has to be reanalyzed, reclassifed and reindexed. Using for illustrative purposes Accounting Collection 201' s Client field, Computer 101 selects the record with the FieldID value = 2 from Fields Table 507 and proceeds to step S605 to analyze and classify the field. In the preferred embodiment, Computer 101 uses the IndexMark field only to determine if the field needs to be indexed for later entity-based searching — in the preferred embodiment all fields are analyzed and classified. In step S605, Computer 101 analyzes the field to determine the field's entity type attribute. A field's entity type describes the semantic nature of the data in the field. For example, a field with peoples' names would have aperson entity type. Other exemplary entity types include company, skill, telephone number, office and dollar amount. In the preferred embodiment, Computer 101 analyzes the field's name and a sample of data stored in the field as a way to determine the field's entity type. FIG. 8 shows in greater detail one embodiment for analyzing a field name. First, in step S801 Computer 101 tokenizes the field name. Methods for generating tokens from a given input string are well known in the art. For example, for a field with the name Client, Computer 101 returns as a token [Client]. For a field with the name PrtnrName, Computer 101 returns as tokens [Prtnr] and [Name]. In step S802, Computer 101 uses the tokens generated in step S801 to create a list of possible entity types. This list may be weighted depending, for example, the closeness of a match. Using as examples the Client (FieldID = 2) and PrtnrName (FieldID = 4) fields of Accounting Database 201, the token [Client] for the Client field might be indicative that the field contains a name of an entity type such as aperson or company. For the field PrtnrName, the token [Prtnr] might be indicative that the field relates to a partner, and thus of a person entity type. The token [Name] in the same PrtnrName field would again suggest that the field contains the name oi a person or company entity type. To generate the entity type list, Computer 101 maps the tokens to a dictionary of terms, which can be as simple as a list of terms, or as advanced as being a domain dependent list, and obtains a list of possible entity type matches. In step S803, Computer 101, in the preferred embodiment, uses the patterns of the tokens to further infer information about the entity type of the field. For example, a heuristic rule used in this step would be that if a field contains a token [By] in the last position of the field name, that might be indicative of an "action" and thus that the field may contain people or company names. As another example, the patterns of the tokens are used to infer that a field named ContactPhone is preferably of the entity type telephone while a field named PhoneContact is preferably of the entity type person. Of course, a system does not have to implement all these steps, depending on the confidence level desired for a proper attribute classification. The result from this step is another list of possible entity types, which may be weighted, if desired. In step S804, Computer 101 take the results of steps S802 and S803 and generates a set of possible entity types. This list can be generated, for example, by listing only the intersection of the results from both the field name and pattern matching analysis routines. This list can then be further processed by removing duplicate entity types and choosing only those entity types whose confidence ranking exceeds a predetermined threshold. To further refine the accuracy of the analysis of the field's entity type, Computer 101 also analyzes a sample of the data stored within the field. In its analysis, Computer 101 can use many different methods, either alone or in combination, depending on the level of analysis desired. Different methods of analysis include heuristic matches, keyword-based matches, and formatting criteria. FIG. 9. shows one combination of these field value analysis methods. In the preferred embodiment, Computer 101 first uses the possible entity types generated by field name analysis routines (i.e., FIG. 8) and then uses those possible entity types to conduct a field value analysis (i.e., FIG. 9) to further refine the analysis. In a heuristic match, used in step S901, Computer 101 compares the values in a field against typical values for a given entity type. For instance, Computer 101 may compare the values in a field against a table or list of person names. A high number of matches would be indicative that the data contains person names. Likewise, a high number of matches while comparing the values in a field against a table or list of office locations within a company, or a table of company names would be indicative that the field contains an entity type of office or company, respectively. In the preferred embodiment, this can be accomplished by comparing each sample field value for a given field from SampleFieldNalues Table 508 to different lists of person names, company names or skills. A high number of matches from a person names list and a low number of matches from the company location list would be indicative, for example, that the field has the attribute oϊ a person entity type. In keyword-based matches, step S902, the data in the fields are scanned for specific keywords. For instance, the presences of the terms "Inc.", "Corp." or "Co." may be indicative that the field contains company names, and is thus of a company entity type. In using the formatting criteria method, step S903, Computer 101 examines how the data in the fields are formatted. As an example, short strings with capitalized words might be indicative of a person's name, and thus be a person entity type. The confidence level of a correct entity type identification varies for each of the different methods. In the methods shown in Fig. 9, the confidence level of a correct identification ranges from a high confidence for the heuristic match to a low confidence level using the formatting criteria. In the preferred embodiment, Computer 101 takes the possible entity types generated by field name analysis routines (i.e., FIG. 8) and uses those possible types to conduct a field value analysis (i.e., FIG. 9). Computer 101 then selects the entity type with the highest combined confidence level above a predetermined threshold. If the confidence level for any of the possible types is not above the predetermined threshold, then additional field samples are added to SampleFieldNalues Table 508 for additional analysis. When Computer 101 has determined the selected field's entity type, it enters that information in the appropriate record and TypelD field in Fields Table 507. Thus, for the Client field of Accounting Database 201, the field is classified as the entity type company. Likewise, the PrtnrName field is classified as entity type person. In the situation where it is determined that an entity type cannot be assigned to the field, then an appropriate value indicating this is entered into Fields Table 507. This situation can occur, for example, when the confidence level returned by the field value analysis is still below the predetermined threshold, despite repeated increases in sample size, or that a field is known to be irrelevant, such as the RecID field, to an entity-based search. Computer 101 may in this situation request assistance from an operator. Referring back to FIG. 6a, once the selected field's entity type has been determined, Computer 101 then proceeds to step S606 to classify the selected field's entity role. Entity roles are more specific classifications of an entity type. Entity roles for aperson entity type may include manager, partner, author, mentor, reviewer, or editor. Entity roles for a company entity type may include client, supplier and vendor. To identify the field's entity role, Computer 101 uses in the preferred embodiment the field's entity type, determined in step S605, and the field's name. Using the PrtnrName field of Accounting Collection 201 as an example, Computer 101 determines that the token [Prtnr] may be indicative of an entity role of partner. Routines similar to those used in classifying the field's entity type can be used to also classify the field's entity role. Once the entity role has been determined, it is entered into Fields Table 507. After the field's entity role has been determined, Computer 101 proceeds to step S607. If there are more fields in the selected collection to be classified, Computer 101 returns to step S604. Of course, not all fields need to be classified, and that once all desired fields have been classified, Computer 101 then proceeds to step S608 to assign attributes to the collection. A collection's attributes can include a collection type, which is a description of what a collection is. These types can include, for example, directory, newsletter, contact, company data, or client-engagements collections. In a manner similar to classifying fields, Computer 101 uses a set of rules to assist in the classification of the collection. For instance, if a collection contains a field with an entity type/7ersow (whose role is not a collection editor or author) and a field with an entity type telephone number, then that collection can be classified as of collection type contact. Another rule could be that if a collection contains a field with an entity type of person in the entity role of a mentor or reviewer, and fields with entity types company and dollar amounts, then the collection can be classified as of collection type engagements. Yet another rule could be that if a collection contains fields with entity types company and dollar amounts, but no field of a person entity type, then the collection can be classified as of collection type company data. Once Computer 101 has determined the collection type and possibly the values of other attributes, it then populates CollectionsFeatures Table 508 (FIG. 7e). In particular, Computer 101 enters the SymbolID of the particular collection attribute that it assigns to the collection, and the SymbolID of that attribute's value. For example, FIG. 7e depicts sample records for CollectionFeatures Table 505 indicating that Accounting Collection 201 has a collection type of client-engagements. It is possible that a collection has multiple attributes and thus have multiple associated records in CollectionsFeatures Table 505. Computer 101 then proceeds to step S609. If there are additional collections to analyze, then flow proceeds back to step S602, otherwise, flow proceeds to step S610 of FIG. 6b. This completes the field and collection classification steps. At the appropriate steps Computer 101 can, if desired, engage the operator in confirming or assigning the entity or collection type or entity role. Because these multiple collections are not related to one another in a standard relational format, their data is highly heterogenous and not strongly typed. Thus, there is no single unique key which identifies the entities across the different collections and documents. Rather, an entity may be referred to in a variety of different data formats, synonyms and abbreviations. Accordingly, the data must next be normalized to allow searching for entities. Referring now to FIG 6b, Computer 101 will perform a set of normalization and indexing routines. The indexing routine includes determining document meta-information, generating a synopsis of each document and storing original and normalized forms of each value in an indexed filed into Index Table 509 (FIG. 7h). In step S610, Computer 101 selects a collection to index. In the preferred embodiment, Computer 101 selects a collection from Collection Table 502. Next, at step S611, Computer 101 determines if the selected collection has been modified since the last time it was indexed. One way to make this determination is to use the LastlndexedDate field of Collection Table 502. Since most types of databases and collections retain the time that they were last modified, Computer 101 can compare the time in the LastlndexedDate field against the time that the collection reports that it was last modified. If the time in LastlndexedDate Field is blank or earlier than reported by the collection, then the index for the selected collection needs to be updated, or, in the case of a new collection, created. If the date stored in the LastlndexedDate field has not changed or is later than the date reported by the selected collection, flow continues to step S617. In step S612 the records in Document Table 506 (FIG. 7f) and Index Table 509 relating to the documents in the selected collection which have been deleted or changed are removed. Alternatives to deleting the related records may also include, for example, simply updating the records relating to those documents which have been modified by comparing the Date field in Document Table 506 and the time the collection reports that it was last modified. In the preferred embodiment, records stored in Document Table 506 and Index Table 509 and related to modified or deleted documents are deleted. Computer 101 next proceeds to step S613 and selects a new or modified document from the selected collection. This, of course, presupposes that new or modified documents were previously identified, such as in step S612 of FIG. 6b. Other ways include selecting those documents which are dated after the date stored in the LastlndexedDate field for that collection. Flow then proceeds to step S614. In this step, Computer 101 creates a new record for the document in Document Table 506 and generates a synopsis for that document. A synopsis is a brief description of the document displayed to the user when the document is returned as a hit, i.e., a match to a search query. The synopsis can, for example, take the form of a concatenation of the values in selected fields. Referring to FIG. 7f, each record in Document Table 506 contains information about each document in the collection to be searched. In particular, information pertaining to the collection and the unique document identifier as maintained by each specific collection's database program is stored in fields CollectionID and TuplelD, respectively. For any given document, the synopsis is generated from a concatenation of the fields for that collection identified to be used in the synopsis, as determined by the values in the SynopsisMark field of Fields Table 507. For example, Fields Table 507 has three fields in Accounting Collection 201 selected to be part of a synopsis, as indicated by the "Yes" value in the SynopsisMark field, specifically the Client, Matter and PrtnrName fields. The values in these fields are concatenated and stored in the Synopsis field in Document Table 506 to facilitate its retrieval. Other ways of generating a synopsis field are well within the skill of one skilled in the art. Once a synopsis has been generated and stored, flow proceeds to step S615. In this step, Computer 101 generates normal forms for the values in selected fields (as indicated by IndexMark field of Fields Table 507) for each document, and stores the original value as well as the normal forms in Index Table 509. In the preferred embodiment of the present invention, normalization is specific to a field's entity type. Thus, if a field has a company entity type then exemplary steps of normalizing the field's value include: (1) removing any leading "The" terms, expanding abbreviations, such as "Inc." and "Co.", uniformly capitalizing terms, and combining leading initials (e.g., H B N becomes "HBN" and "R. A. Nado" becomes "RA Nado"); and (2) looking up the value's leading words
2t in a synonym table, and if found, fetching a normalized form of the name from the synonym table to use as the normalized form of the name (e.g., "IBM" becomes "International Business Machines"); if not found, using the first "significant" word of the company name as the normalized value. A significant word is a word that is not a stop entry (words or phrases that should be ignored). A list of stop entries is maintained by the system for access by Computer 101. For a field with an entity type of person, normalization can include: (1) breaking names down into components, such as first, middle and last name (accounting, of course, for different variants, e.g., "First Middle Last", "Last, Mr. First", etc.); (2) normalizing first and, if desired, middle names, using a synonym table so that, for example, "Dick", "Rick" and "Rich" become "Richard"; and (3) creating a normal form, or forms, for the name based on its components (e.g., "R. Mike Willis" becomes "Willis/R" and "Willis/Michael", stored in the preferred embodiment in two records in Index Table 509, and "Mr. Dick Smith" becomes "Smith/Richard"). For a field with an entity type of skills, the normal form is the first word of the value in the field. Of course, other ways of generating and storing normal forms can be used. Computer 101 then stores the unnormalized and normal forms in Index Table 509, as shown in FIG. 7h, and depicted with sample records relating to Accounting Collection 201. Also associated with each record is data identifying the source of the value. For example, the fourth record of Index Table 509 indirectly indicates that the record's normal form "Smith/Richard" is from the PrtnrName field (FieldID = 4), which is from Accounting Collection 201 (CollectionID = 1) of entity typeperson, and from the document identified as 688 (from DocumentID = 1 and TuplelD = 688) in Accounting Collection 201. The name "R. Mike Willis" from RecID = 689 from Accounting Collection 201 normalizes, in the example, into two possible normal forms stored in two records with IndexID = 7 and 8 in Index Table 509, as "Willis R" and "Willis/Michael", respectively. One of these normal forms is designated as a canonical form, to be displayed when the value is matched as a "hit", by the appropriate value stored in the Canonical field. If desired, a Soundex index can also be generated from the field value and stored in the Soundex field to provide an alternative or enhanced search functionality. Of course, other collections may not have the RecID field whose value is to be stored in the TuplelD field of Document Table 506, and other means to associate the normal form to the collection and document could then be used. It is understood by one skilled in the art that Index Table 509 can have many other forms, depending on the information desired to be retained, as well as the database program used. Referring back to FIG. 6b, once step S615 has been completed, Computer 101 proceeds to step S616 to determine if there are more documents to process. If there are, flow proceeds back to step S613. If not, Computer 101 updates the date in the LastlndexedDate field for the selected collection in Collection Table 502 and then proceeds to step S617. In step S617, Computer 101 determines if there are any additional collections to process. If there are, flow proceeds back to step S610. If not, flow proceeds to FIG. 6c. This completes the normalization and indexing steps. FIG. 6c, presents a flow diagram of the steps taken by Computer 101 to process a query. In Step S618, a user enters one or more search strings and identifies the entity types of each search string. Thus, for example, a user searching for the text string "IBM" would also identify to Computer 101 the query entity type, which in this example is the company entity type associated with that string. In another embodiment, Computer 101 can analyze the search string in a similar fashion to the field analysis and assign attributes to the search string. FIG. 10 shows a sample query input screen. Input Boxes 1001(a), 1002(a) and 1003(a) hold the user's search strings. Entity Type Boxes 1001(b), 1002(b) and 1003(b) hold the respective search string's entity type. If desired, these search strings can be conjoined ("ANDed" together) to find documents containing all search strings. Other embodiments allow any boolean combination of any number of search strings, an example of which is shown in FIG. 10. Using the example above, Input Box 1001 (a) has a text string value of "IBM" and Entity Type Box 1001 (b) has an entity type value of company. Other optional search qualifiers may include a date range, a collection type, and methods of sorting the resulting matches ("hits") and the number of hits to be returned. In another embodiment, the entity-based search can include a full- text search engine to search across the multiple collections not only based on entities, but also in conjunction with a full-text search. For efficiency, the collections should be indexed prior to the search. Thus, in this embodiment a query can consist of not only a search for entity types, but also for text strings. The DocumentID from Document Table 506 can be used by both the full-text and entity searches to identify the documents found. Next, in steps S619 and S620, Computer 101 normalizes the search string and looks up the normalized search string against the normalized index entries ih Index Table 509. In the preferred embodiment the search string can have multiple normalized forms depending on the entity type. For example, if a document contains the term "Sun microsystems" of the company entity type, that phrase is normalized to both "Sun Microsystems" and "Sun" and stored as two entries in Index Table 509. If the search string is "Sun microsystems", then there will only be one normal form, "Sun Microsystems" (using the normalization methods described with respect to step S615) to search for in Index Table 509. Each record with an entry in the field IndexWord in Index Table 509 that matches the "Sun Microsystem" normalized form will be selected. Records with the index term "Sun" will, however, not be matched. In contrast, if the search string was "Sun", then the result of the search will match records which contain the term "Sun" and may include records with the index term "Sun Microsystems." If the search string was of entity type person, then there may be multiple normalized forms to search. For example, the search term "Mike Willis" can be normalize into three search terms: "Willis", "Willis/Michael", and "Willis/M." Of course, other ways of generating and storing normal forms can be used. Any record with an index term matching any of these three normalized search forms will be selected. These index records identify the relevant documents . The group of matched documents is called a match set. It is of course possible to search the collections without the use of Index Table 509 by normalizing the field values and query at runtime, i.e., during the search. This will, however, significantly impact the performance of the search and is generally tenable only for very small collections or groups of collections. In step S622, Computer 101 generates a match score for each entry in the match set. Because there may not be an exact match between matching entities, matches are graded according to their match score. Take, as an example, the search string "Bob Smith", which normalizes to "Smith/Robert." This string matches "Smith/Robert" very well, "Smith/R" not as well, and "Smith/Susan" not at all. Moreover, because an entity can be referred to in many different formats, generating a match score requires the application of different techniques and weighting factors and can be type dependent. For instance, given aperson entity type, a string "Bob Smith" is a variation of, and would match well, another string "Smith, Bob." However, given a company entity type, a text string "Apple Computer" is not a variation for the string "Computer, Apple." As another example, a mismatch in a company name of "Co." instead of "Corp." is generally a minor variation. However, for a person, a mismatch of "Mr." and "Ms." is a significant mismatch. To generate a match score for strings of a company entity type, Computer 101 compares, on a word-by-word basis, each word in the normalized search string against the value in the FieldNalue field of Index Table 509. Words appearing in one string and not the other reduces the match score by an amount proportional to the total number of words. Entity-type specific methods penalize a match less if a general word, such as "Company" is not matched. If the entity type is person, then a match on the normalized forms of the names may be sufficient. In addition, the preferred embodiment can use stop words (a list of words and phrases maintained by the system that are to be ignored when generating a match score) to refine the match score. The present invention is not limited to the described weighting and scoring methods. These and other methods are well known in the art and can also be used without departing from the scope or spirit of the present invention. In the preferred embodiment, Computer 101 generally relies on the normalization to be strong enough such that a match on the normal form is sufficient without requiring further word-by-word matching. In this case, Computer 101 may assign a high match score (e.g., a "1" out of a number from 0 to 1) to the match. If normalization is weak, for instance, where no synonyms were found for a company during normalization, and thus only the first significant word was used, then a match score may be used to determine if a match is strong enough to be reported as a hit. After the match scores are generated, Computer 101 proceeds to step S623 where the match scores are processed. Usually, a minimum threshold is established so that matches with a match score below that threshold are not considered. If optional collection types and/or date ranges, etc. are used, then matches not fitting these specified options are eliminated from the match set. At step S624, Computer 101 then sorts, if desired, the match sets, and formats the results to be presented to the user as a hit list. Examples of sort can include sorting by collection type, collection names, by entity roles, dates, etc. If a boolean combination of several search strings were used, then a match set for each search string would be generated (steps S619 through S623 may be repeated for each search string) and Computer 101 would combine the sets (for an OR search, or find the intersection for an AND search), and eliminate duplicates to generate the hit list. Flow then proceeds to step S625 to display the hit list to the user. Another aspect of the present invention is the ability to obtain a profile of an entity based on the index entries related to the entity. A profile provides a variety of useful information about an entity, and this information can be drawn from various documents in different collections. For example, a profile of a company could include the company's SIC code, net income, total assets, net revenues and who it is a client of. A profile of a person could include that person's phone number, clients and background. To compile this information, the present invention uses the profiled entity's entity types and, optionally, entity roles, along with predetermined relations between selected collection fields to extract profile information from selected fields of documents in various collections. In the preferred embodiment, the values of fields containing profile information are stored in Index Table 509, making it unnecessary to have access to the original collections to retrieve those values. Key to this profiling is the mapping of a "domain" to a "range", i.e., mapping from one type of thing (the domain) to another type of thing (the range). To illustrate this concept, referring now to FIG. 7i there are shown sample records from selected fields of Relation Table 510. Referring to the first two relations shown, the SIC Code and the Work Phone relations, identified as RelationID = 1 and 2; respectively, the SIC Code relation record (RelationID = 1) has in its Domain field the SymbolID referring to the company entity type (SymbolID = 2) and in its Range field the SymbolID referring to the background entity type (SymbolID = 26). Thus, this relation maps the company entity type to the background entity type -- that is, if a profile of a particular company (an entity of type company) was requested, the system could also retrieve documents in a collection with fields assigned to the background entity type. The second example, the relation with RelationID = 2, describes a mapping where a requested profile of person (Domain = 1) would also retrieve the phone number for that person (Range = 27). In order for the relations set forth in Relation Table 510 to work, there must also be a mapping of a relation's Domain and Range to the specific collections and fields which contain the desired profile information. FIG. 7j shows sample records of RelationMapping Table 511 which illustrate this mapping of a relation's Domain and Range to pairs of fields, the DomainField and RangeField, respectively, of a collection. The first record (RelationMappingID = 1) identifies that the relation being mapped in this record is the "Work Phone" relation (RelationID = 2), with respect to Experts Collection 301 (CollectionID = 3), where the DomainField is the Name field of Experts Collection 301 (DomainField = 7), and the RangeField is the Number field (RangeField = 8) of Experts Collection 301 from which is extracted the Work Phone profile information. In practice, there will be one record for each relation mapped to a pair of fields in a collection, and possibly many relation mapped field pairs for a given profile. For instance, in addition to the relation mapping given above for the Work Phone, RelationMapping Table 511 also has a relation map for an expert's expertise identified by RelationMapID = 3. This relation map identifies that the relation being mapped is the "Expertise" relation (RelationID = 5), with respect to Experts Collection 301 (CollectionID = 3), where the DomainField is the Name field of Experts Collection 301 (DomainField = 7), and the RangeField is the Expertise field (RangeField = 9) of Experts Collection 301 from which is extracted the Expertise profile information. An example of such a profile is illustrated in FIG. 11. It is also possible that for any given relation, there are several RelationMapping Table 511 entries, thereby obtaining additional profile information from, for instance, the same collection or different collections. An example of this situation would be a profile, similar to that described with respect to FIG. 11, with several RelationMapping Table 511 records that extract the Work Phone number from different collections. In the preferred embodiment, when a new record is added to RelationMapping Table 511 Computer 101 confirms that the Domain Field and Range Field fields are of the entity types as specified by the Domain and Range fields ofRelationTable 510, respectively. FIG. 11 shows a simplified screen for the profile data displayed by Computer 101 for a person. The relations that should be part of a given profile is stored as a simple list of RelationlDs. In this example, the profile list of a entity of type person contains the relations specified by RelationID = 2 and 5. Assuming that the user requested a profile on aperson entity type, namely "Catherine Baudin" as shown in FIG. 11, her work phone would then be retrieved from a field with the phone entity type, as specified in the Range field of Relation Table 501 (RelationID = 2) and her expertise would be retrieved from a field with the skill entity type (RelationID = 5). To extract the profile information Computer 101 first conducts a search of Index Table 509 in a manner as previously described. The result is a first match set of records where the normalized values of the search string match the normalized values stored in Index Table 509 and have the same entity type as the search string. Next, Computer 101 filters out those records from Index Table 509 whose FieldID value is not a member of the DomainField set. The members of the DomainField set consists of the DomainField FieldID values given by RelationMapping Table 511 for the desired profile. Each element in the DomainField set also has associated with it a RangeField set where the members of the RangeField set consist of the RangeField FieldID values for a given DomainField FieldID value, or element. In the instant example, the profile for a person entity type has two RelationMapIDs, 1 and 3, and thus the DomainField set has only one element, the FieldID value 7. The RangeField set for this element consists of two elements, the FieldID values 8 and 9. Thus, continuing with the profile example, the remaining records retrieved from Index Table 509 will have matched a normalized form of "Catherine Baudin" and have a FieldID value in the DomainField set (i.e., 7) to form a second match set. For each record in the second match set, Computer 101 retrieves those records from Index Table 509 which also match that second match set record's DocumentID value and have FieldID values that are a member of the RangeField set associated with that second match set record's FieldID value, to form a third match set. (The FieldID value at this point is necessarily an element in the DomainField set.) In the example, the FieldID value of each record in the second match set is 7 (since there is only one element in the DomainField set), and thus, the RangeField set is comprised of FieldID values 8 and 9. Computer 101 therefore only retrieves those records from Index Table 509 whose DocumentID matches the second match set record's DocumentID value and whose FieldID is 8 or 9. The records from the third match set thus contains the desired profile values, which can then be further manipulated, formatted and output for display as shown in FIG. 11. It is understood that the present invention steps described above are for ease of illustration. For example, Computer 101 can combine the steps creating the first and second match sets to instead conduct an entity-based search in Index Table 509 for the normalized form(s) of "Catherine Baudin", with the person entity type and a FieldID value in the set of DomainField FieldID values and using the resulting set of records as the second match set for later steps. In another embodiment, rather than extracting the profile information during run-time, separate tables are instead created for each type of profile information during the indexing of a collection. For instance, separate Work Phone and Expertise Tables could be created based on the RelationMapping Table 511 when Experts Collection 301 is indexed. The Work Phone Table would, for instance, contain a person's name and phone number for each record. Profile information extraction would then search the Work Phone Table for the desired entity reference, thereby increasing the profile information extraction speeds, but at a cost of requiring additional tables for data storage. FIG. 12 depicts an example of a profile for a company which includes the company's SIC Code. In addition to the SIC Code, the company profile also contains an "inverse" profile relation. An inverse profile relation uses another existing relation and indicates to Computer 101 to switch that relation's DomainField values with the RangeField values during the search and filtering steps. For instance, the inverse profile information in FIG. 12 is where the company is identified as the "client of of a person or persons. This "inverse" relation (RelationID = 4) is identified by having a value in the Inverse field of Relation Table 510, which points to another relation, in this case, the Client relation (RelationID = 3). As shown, the Client relation has a Domain oϊ a person entity type and a Range of a company entity type. The related record in RelationMapping Table 511 (RelationMappingID = 2) identifies that the DomainField and RangeField are the PrtnrName field and Client field, respectively, of Accounting Table 201. A company profile, however, has a Domain of the company, entity type, and thus, the inverse relation indicates that the Client relation of Relation Table 510 should be used, but that the DomainField values and RangeField values of the associated RelationMapping Table 511 record should be exchanged for use in the previously described profile search and filtering process of Index Table 509. The partner information from Index Table 509 is subsequently displayed, i.e., the partners' names whose client is the company being profiled. This feature provides the added advantage that inverse relations do not require separate entries in RelationMapping Table 511. Yet another aspect of the invention is the ability to conduct boolean searches using both a full-text search and an entity-based search. In is embodiment, Computer 101 conducts both full-text and entity-based searches and combines the search results according to the desired boolean criteria. This type of search could be used, for instance, in a profile search where the full-text search was also conducted to provide information about documents which mention the entity being profiled. Yet another aspect of the invention is the ability to obtain cross- references to other entities by analyzing the results of an entity-based search on a first entity and then finding references to other entities in the documents identified by the first entity-based search. This provides a means to obtain information about what other entities the entity being search is associated or related with. These references can then be used, if desired, as the basis for new searches. In the manner described above, the present invention thus provides a system and method to search multiple databases in a focused and efficient manner. While this invention has been described with reference to the preferred embodiments, collections and sample data other information may also be stored in all described databases and collections, if desired, and other modifications will become apparent to those skilled in the art by study of the specification and drawings. It is thus intended that the following appended claims include such modifications as fall within the spirit and scope of the present invention. What we claim is:

Claims

CLAIMS 1. A method for creating an entity-based search for searching across a plurality of databases by classifying data into entity types, comprising: selecting a database from the plurality of databases wherein said database comprises a field; selecting said field; assigning an entity type to said selected field; and creating an index of assigned said entity type of said selected field so that a search across the plurality of databases based upon said entity type can be accomplished.
2. A method according to claim 1 wherein said assigning step includes using said selected field's name to determine said entity type to assign to said selected field.
3. A method according to claims 1 or 2 wherein said assigning step includes using a set of sample data values from said selected field to determine said entity type to assign to said selected field.
4. A method according to claim 1 wherein said assigning step assigns an entity role to said selected field.
5. A method according to claim 1 wherein said assigning step includes analyzing said entity type and assigning a database type to said database and wherein said creating step includes creating an index of said database type assigned to said database so that a search based upon said database type can be accomplished.
6. A method according to claim 4 wherein said assigning step includes analyzing said entity type and said entity role and assigning a database type to said database and wherein said creating step includes creating an index of said database type assigned to said database so that a search based upon said database type can be accomplished.
7. A method creating an entity-based search across a plurality of databases by classifying data into entity types, comprising: assigning a first entity type to a field in a database; assigning a second entity type to a query; searching said database for said query wherein said search uses said first and second entity types to determine a match and generating a result; and outputting the result of said search.
8. A method creating an entity-based search across a plurality of databases by classifying data into entity types, comprising: assigning a first entity type to a field in a database; assigning a second entity type to a query; searching an index for said query wherein said search uses said first and second entity types to determine a match and generating a result; and outputting the result of said search.
9. A method according to claims 7 or 8 wherein said search uses a first normalized value from said field and a second normalized value from said query to determine a match.
10. A method according to claim 9 wherein said first and second normalized values are entity type dependent.
11. A method according to claim 7 wherein said search uses a relation to obtain profile information from said document identified in said search wherein such profile information is output as a profile for said entity.
12. A method according to claim 8 wherein said search uses a relation to obtain profile information from said index wherein such profile information is output as a profile for said entity.
13. A system for creating an entity-based search for searching across a plurality of databases by classifying data into entity types, comprising: means for selecting a database from the plurality of databases wherein said database comprises a field; means for selecting said field; means for assigning an entity type to said selected field; and means for creating an index of assigned said entity type of said selected field so that a search across the plurality of databases based upon said entity type can be accomplished.
14. A system according to claim 13 wherein said means for assigning includes means for using said selected field's name to determine said entity type to assign to said selected field.
15. A system according to claims 13 or 14 wherein said means for assigning includes means for using a set of sample data values from said selected field to determine said entity type to assign to said selected field.
16. A system according to claim 13 wherein said means for assigning assigns an entity role to said selected field.
17. A system according to claim 13 wherein said means for assigning includes means for analyzing said entity type and means for assigning a database type to said database and wherein said means for creating includes means for creating an index of said database type assigned to said database so that a search based upon said database type can be accomplished.
18. A system according to claim 16 wherein said means for assigning includes means for analyzing said entity type and said entity role and means for assigning a database type to said database and wherein said means for creating includes means for creating an index of said database type assigned to said database so that a search based upon said database type can be accomplished.
19. A system for creating an entity-based search across a plurality of databases by classifying data into entity types, comprising: means for assigning a first entity type to a field in a database; means for assigning a second entity type to a query; means for searching said database for said query wherein said search uses said first and second entity types to determine a match and generating a result; and means for outputting the result of said search.
20. A system for creating an entity-based search across a plurality of databases by classifying data into entity types, comprising: means for assigning a first entity type to a field in a database; means for assigning a second entity type to a query; means for searching an index for said query wherein said search uses said first and second entity types to determine a match and generating a result; and means for outputting the result of said search.
21. A system according to claims 19 or 20 wherein said search uses a first normalized value from said field and a second normalized value from said query to determine a match.
22. A system according to claim 21 wherein said first and second normalized values are entity type dependent.
23. A system according to claim 19 wherein said search uses a relation to obtain profile information from said document identified in said search wherein such profile information is output as a profile for said entity.
24. A system according to claim 20 wherein said search uses a relation to obtain profile information from said index wherein such profile information is output as a profile for said entity.
PCT/US1998/008243 1997-04-25 1998-04-23 System and method for entity-based data retrieval WO1998049632A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU72557/98A AU7255798A (en) 1997-04-25 1998-04-23 System and method for entity-based data retrieval

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US84469897A 1997-04-25 1997-04-25
US08/844,698 1997-04-25

Publications (1)

Publication Number Publication Date
WO1998049632A1 true WO1998049632A1 (en) 1998-11-05

Family

ID=25293409

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/008243 WO1998049632A1 (en) 1997-04-25 1998-04-23 System and method for entity-based data retrieval

Country Status (2)

Country Link
AU (1) AU7255798A (en)
WO (1) WO1998049632A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001063483A2 (en) * 2000-02-25 2001-08-30 Bellis Joseph L De Search-on-the-fly/sort-on-the-fly search engine
GB2362004A (en) * 2000-04-19 2001-11-07 Glenn Courtney Smith Data object matching using a classification index
EP1258814A1 (en) * 2001-05-17 2002-11-20 Requisite Technology Inc. Method and apparatus for analyzing the quality of the content of a database
US6631365B1 (en) 2000-03-14 2003-10-07 Requisite Technology, Inc. Method and apparatus for analyzing the quality of the content of a database
US6980984B1 (en) 2001-05-16 2005-12-27 Kanisa, Inc. Content provider systems and methods using structured data
EP1749258A2 (en) * 2004-04-21 2007-02-07 Telcordia Technologies, Inc. Two-stage data validation and mapping for database access
AU785458B2 (en) * 2001-05-16 2007-07-12 Requisite Technology Inc. Method and apparatus for analyzing the quality of the content of a database
US7734680B1 (en) 1999-09-30 2010-06-08 Koninklijke Philips Electronics N.V. Method and apparatus for realizing personalized information from multiple information sources
US7769757B2 (en) 2001-08-13 2010-08-03 Xerox Corporation System for automatically generating queries
US7941446B2 (en) 2001-08-13 2011-05-10 Xerox Corporation System with user directed enrichment
US9477726B2 (en) 2000-02-25 2016-10-25 Vilox Technologies, Llc Search-on-the-fly/sort-on-the-fly search engine for searching databases
US11011256B2 (en) * 2015-04-26 2021-05-18 Inovalon, Inc. System and method for providing an on-demand real-time patient-specific data analysis computing platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5315709A (en) * 1990-12-03 1994-05-24 Bachman Information Systems, Inc. Method and apparatus for transforming objects in data models
GB2294134A (en) * 1994-10-13 1996-04-17 Edward Lea Accessing computer databases

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5315709A (en) * 1990-12-03 1994-05-24 Bachman Information Systems, Inc. Method and apparatus for transforming objects in data models
GB2294134A (en) * 1994-10-13 1996-04-17 Edward Lea Accessing computer databases

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KATHI HOGSHEAD DAVIS, ADARSH K. ARORA: "A Methology for Translating a Conventional File System into an Entity-Relationship Model", THE 4TH INTERNATIONAL CONFERENCE ON ENTITY-RELATIONSHIP APPROACH, 28 October 1985 (1985-10-28) - 30 October 1985 (1985-10-30), Chicago, Illinois, pages 148 - 159, XP002073739 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7734680B1 (en) 1999-09-30 2010-06-08 Koninklijke Philips Electronics N.V. Method and apparatus for realizing personalized information from multiple information sources
WO2001063483A3 (en) * 2000-02-25 2004-02-26 Bellis Joseph L De Search-on-the-fly/sort-on-the-fly search engine
US9477726B2 (en) 2000-02-25 2016-10-25 Vilox Technologies, Llc Search-on-the-fly/sort-on-the-fly search engine for searching databases
US9507835B2 (en) 2000-02-25 2016-11-29 Vilox Technologies, Llc Search-on-the-fly/sort-on-the-fly search engine for searching databases
WO2001063483A2 (en) * 2000-02-25 2001-08-30 Bellis Joseph L De Search-on-the-fly/sort-on-the-fly search engine
US6631365B1 (en) 2000-03-14 2003-10-07 Requisite Technology, Inc. Method and apparatus for analyzing the quality of the content of a database
GB2362004A (en) * 2000-04-19 2001-11-07 Glenn Courtney Smith Data object matching using a classification index
US6980984B1 (en) 2001-05-16 2005-12-27 Kanisa, Inc. Content provider systems and methods using structured data
AU785458B2 (en) * 2001-05-16 2007-07-12 Requisite Technology Inc. Method and apparatus for analyzing the quality of the content of a database
EP1258814A1 (en) * 2001-05-17 2002-11-20 Requisite Technology Inc. Method and apparatus for analyzing the quality of the content of a database
US7941446B2 (en) 2001-08-13 2011-05-10 Xerox Corporation System with user directed enrichment
US7769757B2 (en) 2001-08-13 2010-08-03 Xerox Corporation System for automatically generating queries
US8219557B2 (en) 2001-08-13 2012-07-10 Xerox Corporation System for automatically generating queries
US8239413B2 (en) 2001-08-13 2012-08-07 Xerox Corporation System with user directed enrichment
EP1749258A2 (en) * 2004-04-21 2007-02-07 Telcordia Technologies, Inc. Two-stage data validation and mapping for database access
US8346794B2 (en) 2004-04-21 2013-01-01 Tti Inventions C Llc Method and apparatus for querying target databases using reference database records by applying a set of reference-based mapping rules for matching input data queries from one of the plurality of sources
EP1749258A4 (en) * 2004-04-21 2011-12-28 Telcordia Licensing Company Llc Two-stage data validation and mapping for database access
US11823777B2 (en) 2015-04-26 2023-11-21 Inovalon, Inc. System and method for providing an on-demand real-time patient-specific data analysis computing platform
US11011256B2 (en) * 2015-04-26 2021-05-18 Inovalon, Inc. System and method for providing an on-demand real-time patient-specific data analysis computing platform

Also Published As

Publication number Publication date
AU7255798A (en) 1998-11-24

Similar Documents

Publication Publication Date Title
US8341159B2 (en) Creating taxonomies and training data for document categorization
US5802515A (en) Randomized query generation and document relevance ranking for robust information retrieval from a database
JP4587512B2 (en) Document data inquiry device
US8214385B1 (en) Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
US6363379B1 (en) Method of clustering electronic documents in response to a search query
US6336112B2 (en) Method for interactively creating an information database including preferred information elements, such as, preferred-authority, world wide web pages
KR100666064B1 (en) Systems and methods for interactive search query refinement
US20060129538A1 (en) Text search quality by exploiting organizational information
US6725217B2 (en) Method and system for knowledge repository exploration and visualization
US6286000B1 (en) Light weight document matcher
US8803882B2 (en) Identifying on a graphical depiction candidate points and top-moving queries
Barbosa et al. Organizing hidden-web databases by clustering visible web documents
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
US20040249808A1 (en) Query expansion using query logs
WO2005083597A1 (en) Intelligent search and retrieval system and method
US20080147588A1 (en) Method for discovering data artifacts in an on-line data object
US20080147641A1 (en) Method for prioritizing search results retrieved in response to a computerized search query
US20090112845A1 (en) System and method for language sensitive contextual searching
Lin et al. ACIRD: intelligent Internet document organization and retrieval
WO1998049632A1 (en) System and method for entity-based data retrieval
US20020103794A1 (en) System and method for processing database queries
JP4426041B2 (en) Information retrieval method by category factor
KR20020089677A (en) Method for classifying a document automatically and system for the performing the same
Yang et al. Developing Reliable Taxonomic Features for Data Warehouse Architectures
JP2002183175A (en) Text mining method

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM GW HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 1998547145

Format of ref document f/p: F

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA