COMPUTER THESAURUS MANAGER
Related Applications; This application claims the benefit of U.S. Provisional Application No. 60/063,136, filed on October 22, 1997. Field of the Invention:
This invention relates to computerized information management and, more particularly, to a system and method of managing multiple thesaurus subsets in an integrated manner.
Description of the Related Art:
A thesaurus is a collection of information about a particular field. In its most common usage "thesaurus" is understood to mean an organization of words according to concepts they convey, providing synonyms, hierarchical relationships, and the like. Many large organizations, such as multinational corporations, make extensive use of thesauri to facilitate use of a common language within the organization. Often times, such thesauri are like glossaries, providing explanations for acronyms and defining terms of art as they are used within the organization.
As organizations increase in size and as the subject matters they deal with become increasingly complex, the need for thesaurus management tools becomes acute.
Organizations that reach a threshold size find it impractical to maintain a single large thesaurus, as such a thesaurus would contain relatively few terms of interest for any particular group within the organization. For example, the organization's financial managers would have little need to reference terms that may be in daily use by the organization's research and development team.
Accordingly, it has become relatively common for larger organizations to create thesauri at a group-by-group rather than overall level. This provides more pertinent thesauri for each group, but presents a problem where terms are of interest to multiple groups within
the organization. For example, both the financial and the research and development groups of an organization may be interested in terms representing a component that is proposed to be used in a new product. As the number of overlapping entries grows with the number of groups using thesauri, enterprise-wide thesaurus management quickly becomes unwieldy. Consider, for instance, a multinational corporation with thousands of subgroups, a hundred of which reference the same term in their thesauri. If a new meaning for that term arises, or if a new relationship develops with other terms, that new information should somehow be conveyed into each group's particular thesaurus.
Automated thesaurus management has been addressed by a number of commentators. See, e.g., Mandel, Carol A., "Multiple Thesauri in Online Library Bibliographic Systems," United States Library of Congress 1987; Giles-Peters, A., "Experiments in the Mechanical Construction of Cross-Database Thesauri," Proceedings of the 17th International Online Information Meeting, London, December 7-9, 1993, Raitt, D.I. and Jeabes, B. eds., Learned Information (Europe) Ltd, 1993; Silvester, June P. and Klingbiel, Paul H., "An Operational System for Subject Switching Between Controlled Nocabularies (Development of the NASA Lexical Dictionary)," Information Processing & Management (ISSN 0306-4573), v.29 p47-59 January February 1996; Roulin, Corentin, "Towards the European Education Metathesaurus," Commission of the European Communities, Brussels 1990.
Indeed, international standards have been promulgated to specifically deal with the issue of thesaurus establishment and development. See, e.g., International Organization for Standardization, Ref. No. ISO 2788-1986 (E).
Numerous commercial products have been developed for thesaurus management. The IBM® Thesaurus Administrator/2 and Thesaurus End User System Toolkit/2 provide a computerized system for acquiring, creating, using, and maintaining terminological data thesauri. Products such as this commonly maintain hierarchical relationships among terms, such as broader term, narrower term, part, and instance relationships. These relationships can often be helpful to users. For example, such products inform the user that "house" is a narrower term for "building," or that a "lens" is a part of a "camera." Other such products
include Lexico/2™ from Project Management, Inc., Bethesda, MD; MultiTes from MultiSystems, Miami, FL; TCS [Thesaurus Construction System] from Liu-Palmer, Los Angeles, CA; BiblioTech® PRO from Comstow Information Services, Inc., Harvard, MA; and numerous others. One problem not adequately addressed by any of the known systems is how to efficiently manage multiple thesaurus subsets in an integrated manner, i.e., so that subset(s) from which information is obtained is essentially transparent to the user.
SUMMARY OF THE INVENTION In accordance with the present invention, a system for integrated access to information from a number of separate thesaurus subsets includes an input subsystem for accepting a thesaurus request; a processing subsystem that collects, from the subsets, entries that correspond to the request, treating the subsets as a single integrated thesaurus; and an output subsystem for displaying the retrieved entries.
In another aspect of the invention, the system accepts as input information about requested changes and modifications pertaining to a thesaurus request, and implements those changes in a number of thesaurus subsets. In still another aspect of the invention, a request for a change is implemented as a single, integrated request, and the performance of corresponding changes to multiple thesaurus subsets appears to the user as a single, integrated action. Also in accordance with the invention, the processing subsystem integrates the thesaurus subsets as the single integrated thesaurus using concepts to establish relationships among terms in the thesaurus subsets. Furthermore, the processing subsystem addresses the thesaurus subsystems so as to provide integration thereof as the single integrated thesaurus. In one embodiment, the integration is performed without actually merging the thesaurus subsets. Alternatively, the processing subsystem merges the thesaurus subsets to form the single integrated thesaurus.
The processing system may also be configured to either designate, or to receive as input a designation, of the thesaurus subsets that are active thesauri, such that the processing
system will only collect the select entries from the active thesauri. In an additional aspect of the invention, the output subsystem is configured to display the select entries together, irrespective of the subset from which each of the select entries is retrieved. Alternatively, the output subsystem may be configured to display, along with the select entries, an indication of the subset from which each of the select entries is retrieved.
In another aspect of the invention, the processing subsystem further comprises a correlator configured to establish the equality of meaning of terms or concepts across the thesaurus subsets. The correlator determines the meaning of term with reference to the other terms to which the term being correlated is related. Still also in accordance with the invention, an update processing system is configured to apply an integrity constraint that serves to define a constraint on the relations between thesaurus terms, wherein the input subsystem is operatively coupled to the update processing system so as to accept the integrity constraint.
In one embodiment, the integrity constraints include at least one of the set of: specifying whether a relationship is one-to-one; specifying whether the relationship is one-to- many; specifying whether the relationship is many-to-one; specifying whether a relationship is many-to-many; specifying whether a relationship is transitive; specifying whether the relationship is symmetric; specifying whether the relationship is reflexive; specifying whether the relationship is irreflexive; specifying whether the relationship is among items each of which has corresponding preferred terms.
In a further related aspect of the invention, the subsystem is configured to detect violations of integrity constraints in thesaurus information. In response to detection of a violation of the integrity constraints, the output subsystem is optionally configured to perform at least one of the set of: notifying the user of the violation; executing a correction of the violation; taking responsive action to the violation.
Also in accordance with the present invention, the processing subsystem is configured to define a preferred term and a non-preferred term for a concept, in response to signaling from the input subsystem; the processing subsystem being further configured to swap the
preferred term and the non-preferred term in response to further signaling from the input subsystem. The processing system may swapping the preferred term and the non-preferred term across one or across a plurality of the thesaurus subsets. The processing system may also be configured to validate the changes across one or more thesaurus subsets. In different aspects of the invention, corresponding methods, data structures, and computer-readable media are employed as described above. The features and advantages described in the specification are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram showing a thesaurus manager in accordance with the present invention.
Figure 2 is a block diagram illustrating the knowledge base shown in Figure 1. Figure 3 is a flow diagram illustrating processing in response to a user query, in accordance with the present invention.
Figure 4 is a flow diagram illustrating integrated maintenance of multiple thesaurus subsets, in accordance with the present invention.
Figure 5 is a flow diagram illustrating correlation processing, in accordance with the present invention.
Figure 6 is a flow diagram showing details of processing performed to identify correlation candidates.
DESCRIPTION OF A PREFERRED EMBODIMENT
The figures depict a preferred embodiment of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
Referring now to figure 1, there is shown a thesaurus manager 100 in accordance with the present invention. Thesaurus manager 100 is a preferred embodiment of a computer- implemented system for maintaining an enterprise-wide set of vocabularies for a large organization. For organizations that have operational groups exhibiting great functional and geographical diversity, each such operational group may maintain a thesaurus, or multiple thesauri, for providing the personnel of such group with a reference for the standard vocabulary used within the group. Necessarily, there will be a significant amount of interaction among these groups, with research and development personnel from one such group interfacing with product marketing personnel from another group. As one example, separate thesauri may be maintained for each of the following purposes: Research and Development, Product Literature, Marketing, Litigation, Manufacturing, and Regulatory Compliance. In addition, the organization may acquire external thesauri developed by others; these need to be integrated in the same manner as diverse internal thesaurus systems. The overall thesaurus of the organization may be considered a superset of these specialized thesauri, and it will be expected that many of the specialized thesauri will include similar concepts under either similar or different terms, and similar terms for both similar and different concepts. Thesaurus manager 100 operates to provide maintenance of terms across all such specialized thesauri in a single step, to permit thesauri to be viewed in an integrated manner that permits ready differentiation among inconsistent or otherwise different thesauri, to allow the selection of preferred terms to be used for particular concepts throughout the organization, and to provide a single tool and source of authority for managing the various subset thesauri as a single integrated thesaurus. In addition, because the Thesaurus Manager
allows different terms to be correlated to a particular concept, the Thesaurus Manager allows for the use of different terms throughout the organization, with the Thesaurus Manager providing the link to the same underlying concept. This implementation allows the organization to speak consistently about the same things without enforcing a single standard vocabulary across the organization.
In a preferred embodiment, the various subsystems illustrated in figures 1 and 2 are implemented by software controlling one or more general purpose computers, as described herein. In a preferred embodiment, such software is stored in a conventional data storage device (e.g., disk drive subsystem) and is conventionally transferred to random access memory of the computer(s) as needed for execution.
At the highest level, thesaurus manager 100 consists of system code 110 and a knowledge base 120. The knowledge base 120 serves as the store for thesaurus information and an underlying concept system, while the system code 110 handles all updating and retrieval of thesaurus information, all maintenance and utility functions, and all interaction with external entities.
Several types of entities external to thesaurus manager 100 may interact with it:
• Any authorized user 150 with access to a networked workstation running a world wide web browser or other html interface 140 may access thesaurus manager 100.
• Thesaurus data files 170 may be loaded into thesaurus manager 100, or may be created by outputting a thesaurus subset.
• External programs, e.g., 160, may connect to thesaurus manager 100 and retrieve thesaurus information.
• Finally, thesaurus manager 100 maintains a variety of external logs 180, which record user requests and actions. In contrast to a traditional thesaurus model in which relationships are constrained to fairly rigid vertical (i.e., parent-child) or horizontal (i.e., sibling) relationships, thesaurus manager 100 supports a rich, expressive relation set and allows users to customize that set to their own needs. Also in contrast to the traditional thesaurus model, in which relationships
occur between terms only, thesaurus manager 100 maintains terms as lexical entries associated with underlying concepts defined or denoted in a logical conceptual language and stored in knowledge base 120, and it is these underlying concepts that relationships are established between. For example, if the organization using thesaurus manager 100 is a pharmaceuticals company, there may be a concept for AGENTS-WHICH-RELIENE-PAIN and a concept for ORALLY- ADMINISTERED- AGENTS. Corresponding lexical entries might be "analgesic" and "tablet." Relationships connecting "analgesic" and "tablet" in this example are made in thesaurus manager 100 not between those terms per se, but rather between the underlying concepts. A benefit to such an approach is that it permits modeling of more complex types of relationships among concepts than would be possible if term relationships were relied upon. For instance, in addition to the traditional thesaurus relationships of "broader term," "narrower term," "related term," and "scope of term," many other conceptual relationships are available. For instance, continuing with the pharmaceuticals example, relationships among trade names with generic names, trade names with countries of use, trade names with product names, and product names with indications are straightforward to implement and manage. Using an even more specific example, the concept of LAMTVUDINE is linked to the related concepts of EPINIR, COMBIVIR and AΝTI-NIRAL- AGENT. Each such concept is then available for reference and the most appropriate use as desired. In adapting the traditional thesaurus model to thesaurus manager 100, the classical notions of "Preferred Term" and "Use For" are treated as two kinds of lexical entry that can be associated with a concept. Each concept can have only one "Preferred Term" within a thesaurus subset, but it may have several "Use For" terms. The preferred embodiment provides an interface that equates the "Preferred Term" to the underlying concept, so that within a thesaurus subset one can use the preferred term string to uniquely identify a concept. The "Use For" terms are present in thesaurus manager 100 as simple strings associated with a Preferred Term/Concept. Thesaurus manager 100 provides the ability to introduce new lexical relationships in addition to "Use For". Lexical relationships record strings that can be
used within thesaurus manager 100 to search for a concept. Such search strings are either Preferred Terms, or Alternate Terms (strings recorded using Use For or any of the other lexical relations added by users). For example, user 150 might create and use these lexical relationships in addition to "Use For": "Former Term," to record out-of-use terms for a concept; "French Term," to record French language terms for a concept; and "Slang Term," to record informally used terms for a concept.
To further illustrate the use of relationships between concepts, consider how relationships are used in thesaurus subsets. Thesaurus information maintained within thesaurus manager knowledge base 121 is divided up into a number of user-defined thesaurus subsets. Several subsets may contain relationships among the same set of concepts, but each subset may use a different relationship to express the link between concepts. For example, a corporate thesaurus subset of thesaurus manager knowledge base 121 may link "Allopurinol" with "Zyloprim" using the relationship [GenericName-Trademark], while a product literature subset links "Allopurinol" with "Zyloprim" using the relationship [UseFor], and while a research and development subset links "Allopurinol" with "Zyloprim" using the relationship [Synonym].
SYSTEM CODE 110 COMPONENTS The primary components of system code 110 are the IO layer 111, retrieval and update modules 112 and 113, reporting subsystem 114, and maintenance subsystem 115. In typical operation, user 150 or external program 160 places a query, or a request for update of thesaurus information, to the thesaurus manager via the IO layer 111, and receives information or confirmation of update via the same module. In the case of a query, the retrieval module 112 obtains from thesaurus knowledge base 121 the information that satisfies the query and passes it back to the user or external program via the HTML (Hypertext Markup Language) module or API (Application Programming Interface), respectively. In the case of an update, update module 113 verifies the validity of the change and changes the thesaurus information in thesaurus knowledge base 121, passing
confirmation back to the user or external program via the HTML or API modules 111.1, 111.2, respectively. Update module 113
Update module 113 provides these updating functions: (a) loading thesaurus data from thesaurus data files 170 into thesaurus knowledge base 121, (b) editing thesaurus information aheady present in thesaurus knowledge base 121, (c) copying thesaurus information from one thesaurus subset within thesaurus knowledge base 121 to another such subset, (d) creating, renaming or deleting thesaurus subsets, (e) creating, renaming or killing thesaurus relationships, (f) fixing integrity violations of data within a thesaurus subset, and (g) "correlating", or establishing correspondences between, concepts within one thesaurus subset and those in all others and/or those in the generalized knowledge base 122. Loading a Thesaurus
During loading, a file in an appropriate input format is interpreted, and the new thesaurus information therein is added to a target thesaurus subset. Editing thesaurus information
Editing operations include adding new terms, renaming existing terms, deleting terms, adding or deleting relationships between terms, and creating, renaming or deleting thesaurus subsets. All of these actions may be performed across one or more Thesaurus Subsets as a single integrated operation. Naturally, when specifying a change to be done in several thesauri, there may be unanticipated, to the user, reasons why the change may not be performed in all subsets. For example, suppose the user asks to rename the term "Canine" to "Dog" in all thesaurus subsets. This operation may fail in one subset because there is no term equivalent to "Canine" to rename. In another subset it might fail because "Dog" already is present in that subset as a preferred term. The Update module evaluates each suggested change separately. If all can be performed, all are performed. If any cannot be performed, the Update module requests optional confirmation of the accepted changes, which can be obtained if the edits were initiated through an interface with a human user.
Copying a Thesaurus Subset
In a preferred embodiment, thesaurus information may be copied from one thesaurus subset to another. A number of parameters control what information is copied: Start Terms, Cutoff Terms, and Cutoff Level. If none of these parameters is provided, the entire content of the source thesaurus is copied to the target thesaurus. If Start Terms are given, copying begins with those terms and proceeds down the "narrower term" hierarchy. If Cutoff Terms are given, no terms more specific than those terms, according to the "narrower term" hierarchy, are copied. Instead of Cutoff Terms, an integer may be specified as a Cutoff Level, which limits the copy operation to terms no more than that many "narrower term" hierarchy levels away either from the Start Terms, if any were specified, or, if no Start Terms were given, from the top terms of the source thesaurus subset. Operations on Thesaurus Subsets
In a preferred embodiment new thesaurus subsets may be created. After creation, such subsets are empty. The names and abbreviations for thesaurus subsets may be changed, so long as the new name or abbreviation is not aheady in use. Finally, an existing thesaurus subset may be deleted. Deleting a subset that contains thesaurus information has the effect of deleting all the contained thesaurus information. Operations on Thesaurus Relations
In a preferred embodiment, new thesaurus relations may be created. Existing relations may be renamed, so long as the name is not already in use. Finally, thesaurus relations may be killed. If a thesaurus relation is in use in one or more of the thesaurus subsets, killing it has the effect of deleting all thesaurus information represented with that relation. Repairing Integrity Violations
Integrity checker 118, described below, is a tool for detecting and repairing thesaurus information that violates the set of active integrity rules. Once a violation has been detected, update module 113 is responsible for repairing the violation.
Correlating
Correlator 116 is described in detail below. Once a correlation has been discovered and confirmed, update module 113 handles the merging of the two correlated concepts into one. Retrieval module 112
Retrieval module 112 provides these retrieval functions: (a) outputting thesaurus information from the Thesaurus Knowledge Base 121 as a new Thesaurus Data File 170, (b) search for a thesaurus concept starting from any term (lexical entry) or subword of a term, (c) retrieval of all, or just particular, thesaurus relationships a term is involved in, (d) retrieval of all thesaurus subsets, and (e) retrieval of all thesaurus relationships. For (a), (b) and (c), the Retrieval module maintains the notion of Active Thesaurus Subsets, which are set by the user or specified in a query; these retrievals obtain information only from the Active subsets. Reporting module 114
Reporting module 114 generates and/or retrieves saved reports of user actions, thesaurus and user statistics, and system information, as described in greater detail below. Maintenance module 115
Maintenance module 115 provides maintenance functions of transcripting of changes to thesaurus information, saving and loading of backups, and user management. IO Layer 111 The IO Layer 111 handles interaction with users and external programs, and consists of two parts: an HTML (Hypertext Markup Language) module 111.1, which is responsible for interaction with a user 150 via an external World Wide Web browser 140, and an API (Application Programming Interface) 111.2, which is responsible for communication with external programs 160. HTML module 111.1
The HTML (Hypertext Markup Language) module 111.1 handles all human interaction with thesaurus manager 100. In standard operation, it listens for connecting external Web browser 140, and when a connection is made, the HTML module generates the
appropriate HTML page for the purpose of interacting with the user. These automatically generated HTML pages contain buttons, type-in boxes, and menus appropriate to the user's task. When the user clicks on a link or submits a form, the HTML module receives that submission, passes control to the necessary subsystem(s) for processing, then generates an HTML page with the results and dispatches it to the external web browser.
Any number of Users 150 may have sessions open with thesaurus manager 100; the HTML subsystem 111.1 maintains the state of each of these connections, so that a user sees only the results of his or her own interactions.
HTML module 111.1 provides a conventional web-browser interface for thesaurus manager 100. In a preferred embodiment, HTML module 111.1 presents User 150 with a number of choices in a graphical user interface. Since the HTML module allows the user to access and control the operation of system code 110, interface features of HTML module 111.1 can be categorized as accessing each of the four main internal subsystems of System Code 110: Retrieval, Update, Reporting and Maintenance. HTML module 111.1 provides standard hyperlink displays so that the user can see underlined and in color information that provides active links to related topics. For example, the Full Term Display for ZANTAC might provide a collection of synonyms, broader terms, narrower terms, and otherwise conceptually related terms, each underlined to allow the user to click on such term to obtain further information specific to that term. In a preferred embodiment, the subset of thesauri containing ZANTAC-related concepts is also displayed in an underlined fashion, and the user can click on each named thesaurus subset to get more information about that subset (e.g., which organizational unit promulgated it and what purpose it is intended to serve). In addition, even the relation symbols presented by HTML module 111.1 , such as "SN" for scope note, "BT" for broader term, and "NT" for narrower term, are underlined links that the user can click to obtain more information about those relation symbols if needed.
In addition to underlining entries, HTML module 111.1 provides small icons that appear next to entries permitting immediate access to different types of displays pertaining to
such entries. For instance, one of the underlined narrower term entries for ZANTAC might be "zantac chewdose tablet;" HTML interface 130 places next to that entry small icons for alphabetical and full record displays of that entry. Thus, the user 150 is able to immediately move from the full-term record for ZANTAC to an alphabetical display of a narrower term, if desired.
Retrieval features of HTML module 111.1
The retrieval features of HTML module 111.1 present thesaurus information as a series of automatically generated HTML pages. In a preferred embodiment, there exists a typein box where the user may specify a term to examine, and one of four modes for viewing that term: Hierarchical Display, Full Term Display, Alphabetical Display, or Show Siblings Display. There also exists a "Find All" facility for searching for a term across all thesaurus subsets, a "View Relations" facility for examining available relations, a "Thesauri" page which lists all accessible thesaurus subsets, and a "Preferences" page where the user can set preferences affecting the behavior of the HTML module 111.1 retrieval features. In general, each of the main term displays (Hierarchical, Full Term, Siblings, and
Alphabetical displays) shows thesaurus information about the selected term with hyperlinks to related information from all of the user-chosen "Active Thesauri." Thesaurus information from other thesaurus subsets that are not active is not shown. Each of the four kinds of display shows a union of all Active Thesauri in a single, combined display, as though it were a single (monolithic) thesaurus.
Because these displays show information from potentially many thesauri at one time, they use a special convention for displaying terms which mean the same thing, i.e., use the same underlying concept, but which have different preferred term strings in some of the Active Thesauri. For example, a consumer information thesaurus CI might use the preferred term "Zantac" for the same concept referred to, in a research and development thesaurus RD, as "ranitidine hydrochloride." If the thesauri CI and RD were both active, this term would be displayed as follows:
Zantac | ranitidine hydrochloride (CI, RD)
Each different preferred term is printed, separated from the others by a vertical bar, and at the end, a list of thesaurus Subset annotations is printed.
The Hierarchical Display shows the position of the selected term in a hierarchy of more general and more specific terms, according to some hierarchical thesaurus relationship. By default, the "broader term" relation is used as the relation determining the hierarchy, but some other hierarchical relation may be selected for this. Each more general and more specific term is a hyperlink; clicking on that term brings up a hierarchical display focusing on that term. In addition, the Hierarchy Display optionally shows other relationships the selected term is involved in. The relationships themselves are links to pages of information about the relationship, whereas the related terms are each links to a Hierarchy Display focused on that term. Furthermore, the Hierarchy Display shows the "top terms" or major thesaurus partitions the selected term is present in. Also, each term shown on the Hierarchy Display page is annotated with the thesaurus subsets in which it appears. These annotations are hyperlinks; clicking on one displays information about the associated term, but only from that subset. Finally, like the other main term displays, the Hierarchy Display provides small icons next to each displayed term permitting one-click access to the Full Term and Alphabetical displays of that term.
The Full Term Display collects all thesaurus facts about a single term. It shows each relationship the term is involved in, and for each such relationship, a list of the other terms or strings related to the displayed term by that relationship. The relationships are hyperlinks to a page of information about the relation, while the related terms are hyperlinks to a Hierarchical Display about the related term. In addition, small icons that provide one-click access to the Full Term and Alphabetical Displays for that term accompany each related term.
The Full Term Display operates in two modes: Thesauri Separate and Thesauri Merged. When in Thesauri Separate mode, there is a separate section for each thesaurus Subset the term is present in. Information about the term in that Subset appears in that section. When in Thesauri Merged mode (the default), each term that is related to the displayed term is annotated with a list of thesaurus subset symbols, indicating which
Thesaurus Subsets the relationship is present in. These annotations are links; clicking on one displays information about the displayed term, ONLY from that subset.
The Full Term Display also has links to some of the update functionality of HTML module 111.1: "Full Term Edit," "Correlate Concept," "Uncorrelate," "Convert Preferred Terms to Use Fors," "Import Use Fors," and "Swap Preferred Term and Use For."
The Alphabetical Display shows a list of terms, alphabetized, in KWIC (Key Word In Context) format. The user-selected term or string is positioned in the middle, with several terms alphabetically before, and several terms alphabetically after it. Terms in a thesaurus may consist of several words, and KWIC alphabetizes on each subword of the term. For example, a user may choose to display "analgesic," with one Active Thesaurus that contains "analgesic agent" and "oral analgesic." The word "analgesic" by itself is not a term in the Active Thesauri, so the Alphabetical Display will center around the phrase "analgesic would appear here." Following this line will be a line for "oral analgesic", then a line for "analgesic agent." The lines previous to "analgesic would appear here" would be occupied by terms prior to "analgesic" in the alphabet, such as "gastrointestinal agent" (alphabetized by its second word) or "aluminum hydroxide" (alphabetized by its first word). In addition, terms are arranged horizontally so that the subword used for alphabetization begins at the same character position on each line. Each term displayed on the Alphabetical Display is a hyperlink to the Hierarchy Display view of the term. Furthermore, a small icon that provides a hyperlink to the Full Term view of the term accompanies each term. In the preferred embodiment, the Alphabetical Display has a button that toggles the display of alternate terms. When turned off, only Preferred Terms are indexed. When turned on, all of the user-selected Alternate Terms are indexed. As described above], Alternate Terms are strings recorded using Use For or any of the user-defined Lexical Relationships. The Show Siblings Display shows the selected term with all of its sibling terms in all of the Active Thesauri. Sibling terms are grouped according to the parent term ("broader term") they share with the displayed term. Each sibling term is a hyperlink to a Hierarchical Display page about that sibling term. Additionally, small icons accompany each term that
permit one-click access to Full Term and Alphabetical displays about the sibling term. Finally, each term is followed by a list of thesaurus annotation which are also hyperlinks. The annotations indicate which thesauri the term is a sibling in, and clicking on one of these annotations brings up a page of Full Term information about the sibling term, but only in the thesaurus subset represented by the clicked annotation.
This display collects, on one page, information that would otherwise be at least two clicks away, and sometimes more. For example, the term "man" might have two different broader terms: "male" and "human." Sibling terms of "man" according to the parent term "human" might include "woman" and "child." Sibling terms of "man" according to the parent term "male" might include "bull," "stallion," etc. So in a thesaurus system which is not necessarily a strict tree, but allows terms to have more than one parent, it can often be quite a lot of work to locate all the sibling terms if one does not use the Siblings Display. Update features of HTML module 111.1
The update features of HTML module 111.1 are provided as a number of automatically generated HTML pages containing conventional HTML forms that can be filled out and submitted by the user to perform changes to thesaurus information. For most update features, the system has the ability to update multiple thesaurus subsets at one time. In one embodiment, there are buttons for "Quick Edit," "Add Term," "Delete Term," "Rename Term," "Create Thesaurus," "Delete Thesaurus," "Rename Thesaurus," "Copy Thesaurus," "Define Relation," "Full Term Edit," "Convert Narrower Terms to Use Fors," "Import Use Fors," "Swap Preferred Term with Use For," "Integrity Check Thesaurus," "Integrity Check Term," "Correlate," "Correlate Concept," and "Load Thesaurus." Each of these causes an HTML page to load, containing the appropriate blanks, radio buttons or checkboxes to allow the user to specify the change to make. Submitting the form causes processing of the change. Each of these features is explained more fully below.
The Quick Edit page supports adding, deleting or editing thesaurus information of already-present terms. It requests a term to edit, a set of thesauri in which to perform the change, a thesaurus relation, and the type of operation (add, delete or edit). Quick Edit uses
the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes.
Add Term supports adding a term to one or more thesaurus subsets not yet containing the term. It requests a Preferred Term string, a set of thesauri in which to add the term, and (optionally) an existing term to serve as the "broader term" for the new term. Add Term uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes.
Delete Term supports removal of a term from one or more thesaurus subsets. It can either delete the term and all its narrower terms, or merely "splice out" the term, connecting the former term's narrower terms up to its prior broader terms, depending on user choice. It requests the term to delete and one or more thesauri from which to delete it. Delete Term uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes.
Rename Term supports changing the Preferred Term string for a concept in one or more thesauri. Rename Term uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes.
Create Thesaurus supports the introduction of a new, empty thesaurus subset. It requests a name and an abbreviation for the new thesaurus. Create Thesaurus uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the change.
Delete Thesaurus supports the removal of an entire thesaurus subset, along with its contents, from thesaurus manager 100. It allows the user to pick the thesaurus to delete from a menu of available thesauri, and uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the change. Rename Thesaurus supports changing the name and/or abbreviation for a thesaurus subset. It requests either a new name or a new abbreviation or both, and uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes.
Copy Thesaurus supports copying thesaurus information from one thesaurus subset to another. It requests a source thesaurus and a target thesaurus (from pick menus). Optionally, it also requests Start Terms and Cutoff Terms or a Cutoff Level. These parameters are explained above, under "Copying a Thesaurus Subset." Copy Thesaurus uses the "Copying A Thesaurus Subset" subsystem of update module 113 to verify and perform the changes. Define Relation supports the addition of a new relation to the set available for use to express thesaurus information. Lexical, Hierarchical, Documentation and Custom relations may be created. Depending on the type of relation, various kinds of information may be requested: Relation name, name of inverse relation, whether the relation is one-to-one, one- to-many, many-to-one, or many-to-many, whether the relation is reflexive, whether the relation is irreflexive, whether the relation is symmetric, whether the relation is transitive, whether it relates two Preferred Terms (i.e., concepts) or relates a Preferred Term to a string. Define Relation uses the "Operations on Thesaurus Relations" subsystem of update module 113 to verify and perform the changes. Full Term Edit supports the addition, deletion and editing of relationships among existing terms, and among existing terms and strings, across all subsets, in one action. It presents a free-form text box to the user, containing the current thesaurus information for a term in all active thesauri, in a syntax that can be understood by the program. The user may edit this text to add terms to relations already present, or add new relationships between terms, in each of the thesauri in which the term is present. Full Term Edit uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes. Convert Narrower Terms to Use Fors allows the user to perform, in one action, what might otherwise be a time-consuming task of designating a particular term the most specific Preferred Term, and converting all its current narrower terms into Use Fors. This facility uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes. This facility provides increased capability for managing the granularity of terms across thesaurus subsets.
Import Use Fors allows the user to quickly import Use For terms from other thesauri. It uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes.
As vocabularies evolve, it is often the case that a Preferred Term goes out of use and is replaced in common usage by what used to be an Alternate Term. Swap Preferred Term with Use For allows the user to easily choose the terms to swap. It uses the "Editing Thesaurus Information" subsystem of update module 113 to verify and perform the changes.
Integrity Check Thesaurus calls an Integrity Checker subsystem of update module 113 on each fact in the selected thesaurus. It runs until a problem is found, until a time limit is reached, or until the entire thesaurus is checked. If a problem is found it is automatically fixed, by an Integrity Checker subsystem of Update Module 113, if possible, otherwise the user is presented with a set of repair options.
The "Integrity Checker" 118 may also be called on a single term. In this case, each thesaurus fact concerning the term, in all thesaurus subsets in which it appears, is checked. Problems that can be fixed automatically are fixed; those with several repair options are presented to the user. In each case, the repair is performed by an Integrity Checker subsystem of update module 113.
Correlate supports the establishment of correspondences between all the terms of a thesaurus subset (that can be matched) with terms in other thesaurus subsets and with concepts in the generalized knowledge base 122. It uses correlator 116, a subsystem of system code 110, which is described below.
Correlated Concept allows correlator 116 to be called on a single Preferred Term (i.e., concept), and correspondences to be set up between that term and terms in other thesauri or a concept in the generalized knowledge base 122. Given a concept which appears in two or more thesauri, or which appears in the generalized knowledge base 122 as well as in one or more thesauri, Uncorrelate breaks the concept apart into separate concepts. This is useful when incorrect correlations have accidentally been performed.
Load Thesaurus supports loading thesaurus information from thesaurus data file 170 and into a Thesaurus Subset. It uses the "Loading a Thesaurus Subset" subsystem of update module 113 to perform the load. Reporting features of HTML module 111.1 The Reporting features of HTML module 111.1 are provided as a set of automatically generated HTML pages, accessible from the "Utilities" page, which present information such as thesaurus statistics, user statistics, and operations reports. Thesaurus statistics include information concerning the number of preferred terms and other lexical entries in each thesaurus, the number of concepts that underlie terms in each thesaurus, the number of facts in each thesaurus, and the creator and creation time of each thesaurus. Thesaurus statistics also include the number of concepts, terms and facts in the integrated thesaurus as a whole. User statistics include the following information for each month: For each user, the number of logins during the month, and number of pages requested, per thesaurus and total. Operations reports show a log of changes made to the thesaurus information; this may be sorted in various ways (by user, by date, by time, by thesaurus). Maintenance features of HTML module 111.1
The Maintenance features of HTML module 111.1 are provided as a set of automatically generated HTML pages. In a preferred embodiment, not all aspects of maintenance module 115 are under user control, but in the preferred embodiment, those that are can be accessed via links from the "Utilities" page. Maintenance features under administrator control include "Quick State Snapshot," "Backup Thesaurus Information to File," "Manage Users," and "System Information." API module 111.2
The API module 111.2 provides an Application Programming Interface that permits an external program to submit queries and requests for update to thesaurus manager 100. In one embodiment, API module 111.2 performs retrieval only, but in another embodiment, it performs update in addition to retrieval.
General operation of API module 111.2
The API module 111.2 is a server that uses a stream connection, such as TCP, and SMTP-like commands and responses. The client program sends commands to the server requesting information or update, and the server responds with the information or confirmation of change, or with an indication of why the command could not be performed. Retrieval functionality of API module 111.2
In the preferred embodiment, retrieval functionality of API module 111.2 includes the ability to (a) test if a string is a known term in a thesaurus subset, (b) retrieve all thesaurus subsets, (c) retrieve all relations, (d) retrieve the thesaurus subsets which contain thesaurus information about a term, (e) retrieve the Narrower Terms, Use Fors, all recursive Narrower Terms, all Use Fors of the term and of all its Narrower Terms, all equivalent terms, or all terms related to a term by a given relation, within some set of thesaurus subsets, and (f) retrieve all the Top Terms (major thesaurus partitions) of a given set of thesaurus subsets. For example, an application program that uses thesaurus information to expand a keyword query for the purpose of retrieving documents might connect to the API module 111.2 to retrieve all Use Fors of all narrower terms of the original, user-provided keywords, and incorporate this information into the search to increase the number of relevant retrievals. Update functionality of API module 111.2
In one embodiment, update functionality of API module 111.2 includes the ability to (a) create, rename or delete a thesaurus subset, (b) add, rename or delete a term to, in, or from a single thesaurus subset, or, optionally, a set of thesaurus subsets, all in one command, (c) edit thesaurus information of existing terms in (optionally, multiple) existing thesaurus subsets. For example, an application program that automatically extends a thesaurus subset by accessing online text might connect to the API module 111.2 to test whether a term is known, and add it to the thesaurus if it is not known. Logging module 117
Logging module 117 is responsible for maintaining logs 180. These logs record two types of information: (a) changes to thesaurus information done by the update module 113,
and (b) Thesaurus Subsets accessed by user 150 via the web interface implemented by HTML module 111.1. Reporting module 114 reads these logs. Integrity Checker 118
Integrity Checker 118 is a tool for detecting and repairing thesaurus information that violates the set of active integrity rules. The following rules are used in a preferred embodiment:
• No BT/NT cycles are allowed, nor are cycles using any Hierarchical Relation. The program will identify loops or cycles in the chains made of Broader Term/Narrower Term or any other Hierarchical relation links. • No term is related to itself (by any irreflexive relations). The program will identify self- loops when prohibited.
• No term can be both a preferred term and an alternate "Use For" term. The program will identify strings used both as preferred terms and as "UseFor" terms.
• No two underlying concepts can have the same preferred term. The program will identify distinct concepts that share a common preferred term.
• No two distinct preferred terms can have the same alternate, or "Use For", term. The program will identify distinct concepts that share a common "Use For" string.
• A "top" term must have no broader terms (BTs). The program will identify purported top concepts that have broader terms. • A word on the stoplist cannot be either a preferred term or an alternate, "Use For" term. The program will identify concepts which have a stoplist word as a preferred term or a "Use For" term.
• Every term, string and relation in a thesaurus must be linked to a known preferred term. The program will identify thesaurus assertions involving concepts that are not actually in that thesaurus. Links (relations) to non-existent preferred terms will be flagged as errors.
• If a relation is one-to-one, many-to-one, or one-to-many, only one term is allowed to be related by that relation. The program will identify multiple related terms for the given relation when prohibited.
• An underlying concept has only one preferred term, no more. The program will identify multiple preferred term strings for a given thesaurus concept.
• BT and RT may not both relate a term to the same term. The program will identify pairs of concepts related by both Broader Term and Related Term. • A term and its BT cannot both have RT relations to the same term. (No RT-BT-RT triangles.) The program will identify pairs of concepts that are related by Broader Term/Narrower Term and are both related by Related Term to a common third term.
• No term can be both a preferred term and an alternate "Use For" term. The program will identify concepts which have a given string as both a preferred term and a "Use For" term. • If a term has no BT, it must be a "top". (I.e., No orphans.) The program will identify concepts that have no Broader Terms but are not "tops".
In a preferred embodiment, all of these integrity rules are active, but in an alternate embodiment users choose which rules to apply.
Integrity checker 118 operates off the definitions of the thesaurus relations used to express the thesaurus information. When a new thesaurus relation is created using the Define Relation subsystem of HTML module 111.1, described below, the definition of that new relation directs which integrity rules will be applied to it by integrity checker 118.
KNOWLEDGE BASE 120 COMPONENTS Knowledge base 120 is, in a preferred embodiment, implemented using the Cyc® Knowledge Base available from Cycorp, Inc., of Austin, Texas. This knowledge base includes a generalized knowledge base 122 that consists of approximately one half-million hand entered formulas (or "rules") that are part of human consensus reality knowledge. When used as part of the preferred embodiment, it also includes thesaurus knowledge base 121. Knowledge Base 120 features Formal language
The formulas of knowledge base 120 are encoded in a formal language, CycL. Concepts in a formal language are represented by symbols, and these symbols are combined
in meaningful ways to form logical Formulas. Formulas are like sentences - each states some fact about the word. For example, from the concepts of TREE, OUTDOOR-REGION, AND PROPERTY-OF-BEING-LOCATED-IN-PLACE, we can form a formula which says "trees are located outdoors." From the concepts of TO-MEAN-SOMETHING, STRING-OF- CHARACTERS, and AUTOMOBILE, we can form a formula which says "one meaning for the string, 'car,' is the concept automobile." Contexts
In a preferred embodiment, knowledge base 120 is divided into a large number of Contexts, each of which is essentially a bundle of formulas that share a common set of assumptions and which are consistent with each other. A context mechanism allows the knowledge base 120 to independently maintain formulas that are prima facie contradictory, by having them reside in different contexts. For example, in a context about the United Kingdom, there might be a formula which says that driving is done on the left side of the road, whereas in a context that assumes a United States location, there will be a formula stating that driving is done on the right side of the road. Notice that the same concepts
(DRIVING-EVENT, ROADWAY, etc.) are used in these two formulas. If they were in the same context, or in a knowledge base without a context mechanism, they would be contradictory. Lexicon In a preferred embodiment, knowledge base 120 includes a Lexicon of over 12,000 root English words. These words are related in knowledge base 120 to the Concepts that are the meanings of the words. An English word may have many meanings, and the Lexicon of knowledge base 120 accounts for this. The Lexicon recognizes any form of a word. For example, Lexicon information maintained for the root word "swim" is sufficient to map any of these strings into the concept for SWIMMING: "swim," "swims," "swam," "swimming." Lexicon information is used by correlator 116.
Feature summary
Whereas in a preferred embodiment, knowledge base 120 is implemented using the Cyc® Knowledge Base, in another embodiment any knowledge base may be used to implement knowledge base 120 so long as it uses a formal representation language that forms Formulas by combining Concepts and possesses a rich store of Concepts and knowledge about those concepts. Generalized Knowledge Base 122
Generalized knowledge base 122 contains, in a preferred embodiment, a rich store of Concepts and of rules that are part of human consensus reality knowledge. In a preferred embodiment, generalized knowledge base 122 contains thousands of concepts, including various kinds of intelligent agents like people, everyday objects from paperclips to aircraft carriers, anatomical concepts, substances from water to wood to pharmaceuticals, units of measure like inch, a plethora of actions from scratching to lecturing to thundershowers to collisions to surgery, and the like. Those skilled in the art will recognize that the particular selection of such Concepts and rules is a matter of design choice and is not essential to the implementation of thesaurus manager 100. Thesaurus Knowledge Base 121
Thesaurus knowledge base 121 contains thesaurus information represented using the Concepts and Formulas that are encoded in the formal language of knowledge base 120. In the preferred embodiment, thesaurus information is represented in CycL. Use of Contexts
In operation, thesaurus knowledge base 121 contains one or more thesaurus subsets 211 as defined by users. Each thesaurus subset is represented as a Context of knowledge base 120. The formulas of such a context express the thesaurus information of that thesaurus subset.
In addition, there is a set of general formulas of the following types, which are resident in thesaurus knowledge base 121 in general, and are available within each thesaurus subset 211 :
• Formulas which state names and abbreviations, comment strings and other definitional and bookkeeping information for each thesaurus subset
• Formulas which define thesaurus relations, stating whether a relation has in inverse, is many-to-many, many-to-one, one-to-many or one-to-one, is transitive, is symmetric, is reflexive or irreflexive, implies that another kind of relation holds between the so-related terms, is a lexical (string-searchable) relation, is a hierarchical relation, and/or whether sequences of the relation can form a cycle.
• Formulas which state user ID and define permissions for users, e.g., 150.
• Formulas which identify the creator and creation time of each underlying concept. Use of formal language
Within a context corresponding to a thesaurus Subset 211, thesaurus information is expressed as formulas in the formal language of knowledge base 120. For each Preferred Term, there exists exactly one knowledge base Concept. For example, consider the Preferred Term "Zantac" within a generalized pharmaceutical thesaurus subset, which is represented there by the underlying concept ZANTAC-THE-PRODUCT. One statement in the formal representation language expresses the fact that "The preferred term for the concept 'Zantac, the product' is the string 'Zantac'." Another might express the fact that "The preferred term for the concept 'Zantac in tablet form' is the string 'Zantac Tablet'." These two statements are invisible to user 150, but are used to maintain the equivalence between the preferred term strings "Zantac" and "Zantac Tablet," and the underlying concepts, so that, in the interface supported by HTML code 111.1, the user refers to an underlying concept exclusively by its Preferred Term.
Another set of statements in the formal language of Knowledge Base 120 is visible to the User 150. However, such statements are not visible as formulas; instead, the underlying form of these statements is transformed into thesaurus form for display and manipulation by the user. This class of statements uses the thesaurus relations that are viewable in "View Relations," and which can be used in "Quick Edit" or "Full Term Edit" to add new (or edit or delete) thesaurus information about a Preferred Term. Each thesaurus relation, and its
inverse, if any, is represented by a single Concept in the underlying formal language. For example, in the preferred embodiment, the thesaurus relation "NT" is represented by the concept BROADER-TERM. A statement such as "A narrower term of Zantac is Zantac Tablet" would be visible to the user in thesaurus form both as ["Zantac" NT "Zantac Tablet"] and as ["Zantac Tablet" BT "Zantac"]. This same statement would be represented in the underlying formal language as (BROADER-TERM ZANTAC-TABLET ZANTAC-THE- PRODUCT). Note that, as mentioned earlier, relations are stated between Concepts in thesaurus manager 100, not directly between the Preferred Terms of any thesaurus. INTEGRATION OF THESAURUS KB 121 AND GENERALIZED KB 122 Referring now to Figure 2, there is shown a more detailed view of knowledge base
120 and its main components thesaurus knowledge base 121 and generalized knowledge base 122. Thesaurus knowledge base 121 contains a number of thesaurus subsets, in this instance Subset A 211, Subset B 212, and Subset C 213. Each of these subsets includes a set of relationships among concepts as illustrated by the links (solid lines, representing relationships) between nodes (dots, representing Concepts) in Figure 2. While some concepts may only be involved in relationships in a single subset, other concepts appear in multiple subsets. The dashed lines between subsets in Figure 2 indicate that the same underlying concept is referred' to in each. Finally, some of the concepts used to express thesaurus information in the various thesaurus subsets are the same as pre-existing concepts which are part of the generalized knowledge base 122, as indicated by dashed lines that connect concepts (dots) in the thesaurus subsets to concepts in the generalized knowledge base 122. In summary, thesaurus manager 100 integrates multiple thesaurus subsets by sharing concepts among subsets. A single such concept may thus have different Preferred Terms, different Alternate Terms, even different knowledge expressed about it in different thesaurus subsets, but these differing descriptions do not conflict, since each is partitioned away from the others by treating the thesaurus subsets as contexts. Thesaurus manager 100 also integrates thesaurus subsets with generalized knowledge base 122 by using the same concepts, where possible and appropriate, in each. For example, the general concept DOG
might appear in several thesauri as well as in the Generalized KB. In a Biology thesaurus, it might have "Canis familiaris" as its preferred term, with "dog" as an alternate term. In a children's thesaurus, "dog" might be the preferred term, with "doggie" as an alternate term. In the Generalized KB, the concept DOG will be involved in formulas such as "dogs are commonly kept as pets by people," "dogs like to eat meat," "young dogs are playful," and so on. Correlator 116
Correlator 116 is a tool that is used to establish the equality of concepts across thesaurus subsets and the generalized knowledge base 122. The Correlator plays this role in several types of processing done by thesaurus manager 100: (a) during a load of a thesaurus, (b) at the time a new term is added, and (c) when a User 150 invokes correlator 116 via the web interface implemented by HTML module 111.1. Correlation at load time
When a thesaurus subset is loaded, term definitions are read singly from thesaurus data file 170. A term definition consists of a Preferred Term string together with the relationships that term has to other terms. Correlator 116 is invoked on each Preferred Term string to determine if an existing Thesaurus Subset aheady refers to a Concept by that same Preferred Term string. The matching, already-existing concept will be used for the loaded term. This is performed automatically, without user interaction (unless a "re-use concepts?" parameter is turned off).
Correlation at Add Term time
When user 150 adds, via the web interface implemented by HTML module 111.1, a new term to a thesaurus subset or to a group of thesaurus subsets 211, correlator 116 is called on the Preferred Term string entered by the user to see if a pre-existing concept might match. Concepts appearing in thesaurus Subsets other than those to which the term is being added, and concepts appearing in the generalized knowledge base 122 are considered as candidates for re-use. If candidates are found, the user will be asked via the web interface to confirm or choose among the candidates. If a candidate concept is chosen, it will be used to represent
the added term. If no concept is chosen, a fresh concept will be generated to represent the added term.
Explicit Correlation
In the preferred embodiment, correlator 116 may be invoked via the web interface supported by HTML code module 111.1. Given a thesaurus subset, correlator 116 visits every concept mentioned in the thesaurus information of that subset, and attempts to find concepts not mentioned in that subset, but instead mentioned in generalized knowledge base
122 or in another thesaurus subset, which could be equal to the visited concept. Correlator
116 may also be invoked on a single concept. Candidate concepts are presented to user 150 via HTML code 111.1. If the user allows the correlation, the two concepts are merged into one.
Method of correlation
Given a Concept or a Preferred Term string appearing in some thesaurus subset, e.g.,
211, correlator 116 finds a set of concepts not currently equivalent, which can be considered as correlation candidates. Correlator 116 judges candidates according to a set of heuristics.
• If the Preferred Term strings of the concepts match, strongly favor the matching concept as a correlation candidate.
• If the Preferred Term string of the starting concept is one of the Alternate Terms of a matching concept, weakly favor the matching concept as a correlation candidate. • If there is overlap between the Alternate Terms of the starting concept and the Alternate Terms of a matching concept, weakly favor the matching concept as a correlation candidate.
• If the Preferred Term string of the starting concept, when matched with Lexicon information in the knowledge base 120, yields any concepts, moderately favor those matching concepts as correlation candidates.
• If the Alternate Terms of the starting concept, when matched with Lexicon information in the knowledge base 120, yields any concepts, weakly favor those matching concepts as correlation candidates.
• Graph isomorphism between the thesaurus relations the starting concept is involved in, and the thesaurus relations or generalized knowledge base relations a matching concept is involved in weakly favors a correlation between the two concepts.
• String-similarity between Preferred Term strings, Alternate Terms, and/or Lexicon 122.1 entries for concepts weakly favors a correlation between the two concepts.
These heuristics are additive, so several weak endorsements add up to a strong endorsement of a correlation match.
Referring now to figure 3, there is shown a flow diagram illustrating processing in response to a user query for thesaurus information, in accordance with the present invention. The user 150, while browsing 305 one of the standard browsing pages displayed by External Web Browser 140, clicks on either a) the name of a term, to show a Hierarchy Display page about the term, b) the full Term icon, to show a Full Term Display about the term, or c) the Alpha icon, to show the Alphabetical Index centered on the term. (Note that other actions may accomplish the same result as this click, namely typing in the name of a term into a "type in" box on the standard page header, and choosing one of the main modes in which to view the term.)
Once the user click has been received, HTML module 111.1 processes 310 that click and dispatches the request to the Retrieval module 112. This module performs a lookup 315 procedure, retrieving thesaurus information about the chosen term from all active thesaurus subsets, in this case Subsets 211, 212, and 214. Subset 213, also depicted, is not among the Active Thesaurus Subsets, so information is not retrieved from that subset. Next, the Retrieval module uses the thesaurus information to build 320 an Output Item, which is passed on to HTML module 111.1 for formatting. HTML module 111.1 formats 325 the information as part of a standard World Wide Web page layout appropriate to the type of display requested by the user. This information is streamed to the Extemal Web Browser 140, where it is displayed so User 150 may view the new browsing page 330.
Not shown in this diagram are the actions of Logging module 117, which records thesauri visited by User 150 for later report generation.
Figure 4 illustrates integrated maintenance of multiple thesaurus subsets, in accordance with the present invention. The user 150, while viewing one of the standard browsing pages 405 displayed by External Web Browser 140, clicks on a menu button to perform one of the editing procedures described above. HTML module 111.1 receives and processes this request to edit, by formatting 410 the requested editing page and streaming it to External Web Browser 140. Web Browser 140 then displays this edit page 415, which contains typein boxes, pick menus and/or buttons as needed for the requested editing procedure. After the user submits his or her choices, the Web Browser 140 dispatches them to HTML module 111.1, which processes 420 the editing instructions into discrete Operations. Each operation concerns a particular thesaurus subset, concerns a particular relation, and performs some particular procedure: add, delete, replace (for thesaurus relations), and create or kill (for thesaurus terms and thesaurus subsets). Note that certain operations may seem atomic in the interface but may generate many of these Operations. Update module 113 verifies 420 each operation according to the integrity constraints present on the relation involved. The update module first confirms 425 that the operation satisfies the integrity constraint present on the relation involved. If all operations are OK 430, update module 113 performs each change 435 in the thesaurus selected for that change. If any operations are not valid 440, HTML module 111.1 formats 445 a verification page 450 that is sent to user 150 via External Web Browser 140. If no operations were valid, processing stops there — the verification page 450 merely contains an explanation of why the operations could not be performed. If at least some of the operations appear to be valid, the user has the option to OK them on the verification page 450. Sometimes, one or more of the requested operations may be valid if the directives input by User 150 are interpreted in an alternate fashion, e.g., if the user input a "UserFor" instead of a "Preferred Term." The verification page 450 will always check with the user before performing the operation in this case.
After the user responds to the questions on the Verification page, the External Web Browser dispatches the page to HTML module 111.1, which processes 455 the verification input. Then Update module 113 actually performs 435 each operation in the thesaurus subset
selected by the operation. In the diagram, no changes were requested for Subset 213, so none are performed there. The HTML module 111.1 formats 460 a results page 465 and streams it to the external web browser 140.
Not shown in this diagram are the actions of the Logging 117 and Maintenance 115 modules, which record logs of changes used for report generation and disaster recovery, respectively.
Figure 5 illustrates processing performed in response to a Correlate Concept request from User 150. In this example, User 150 has requested that correlation be performed for a particular underlying concept of a thesaurus Preferred Term present in thesaurus Subset 211, indicated by a gray dot in the figure.
Processing begins when User 150, via a page 505 displayed by External Web Browser 140, issues instructions to begin correlation. HTML module 111.1 interprets and processes 510 the request and transmits it the the Correlator 116. The Correlator 116 attempts to identify 515 correlation candidates. Correlation candidates are other concepts which represent terms in other thesaurus subsets or which are involved in formulas in the
Generalized knowledge base 122, which are not present in Subset 211, and which are likely to mean the same thing as the starting concept 517 in Subset 211. (Figure 6 shows in more detail how correlation candidates are found.) In this example, two candidates 518, 519 were found — one 519 from Subset 212, and one 518 present in the Generalized Knowledge Base 122. After correlation candidates 518,519 are found, HTML module 111.1 formats 520 a page 525 which allows User 150 to choose one of the candidates, or alternatively, to type in a Preferred Term (identifying an underlying concept) from another thesaurus to correlate with the starting concept 518. Web Browser 140 displays this page 525. At this point the user may decide not to perform any correlation at all, and may simply go on to another task. If the user wishes to choose one of the suggested concepts 518,519 to correlate with the starting concept 517, or to enter a particular term to correlate with the starting concept 517 instead, he or she does so at this time and submits his or her choices. In the example illustrated in Figure 5, User 150 chose the concept 519 from Subset 212 to correlate with the starting concept 517.
HTML module 111.1 processes 530 the submitted page, dispatching the instructions to Correlator 116. Correlator 116 then merges 535 the starting concept 517 with the chosen concept 519. The strong dashed line 532 connecting two concepts (dots) in the figure indicates this merging. In correlation merging, one of the two concepts 517,519 is chosen to be kept, and all formulas expressing thesaurus information are removed from the other concept, restated with the "keeper" concept, and added to the Knowledge Base 120. The other concept is then deleted. The result is that there is no change in the thesaurus information or Generalized KB formulas present in the system, but now where there were once two separate concepts 517,519, there is a single concept. In the example illustrated, both concepts 517,519 are only in thesaurus subsets, so it does not matter which concept is chosen to be kept. If one of the two concepts 517,519 to be merged is involved in formulas in the Generalized KB 122, that concept will automatically be chosen for retention.
Once the concepts are merged, HTML module 111.1 formats 540 a result page 545, which is dispatched to and displayed by Web Browser 140 for User 150 to view. The example depicted shows processing in response to a user's request to find and establish correlations for a particular concept. The Correlator 116 may also be invoked on an entire thesaurus, sweeping through the subset and finding correlations for each concept present in the subset. Processing for each of these concepts is the same as what is depicted here, but interaction for User 150 differs because the Correlator 116 finds candidates for up to 10 starting concepts at a time, or until a time limit is reached. The user 150 then handles all these concepts as a batch.
Figure 6 shows details of the "Identify Correlation Candidates" step 515 of correlation processing. Starting with the concept to be correlated, several lists of concepts are obtained. These are not necessarily obtained in parallel, but the order in which the lists are retrieved does not matter. A list of concepts which have the same or similar Preferred Term string, in another thesaurus subset, as the starting concept is retrieved 602 and is given a strong weight. A list of concepts which have Alternate Terms (terms linked to a concept via "Use For" or one of the other, user-defined, Lexical Relations), in another thesaurus subset, which are the
same or similar to the Preferred Term string of the starting concept, is retrieved 604; these concepts are weighted weakly. A list of concepts which have some Alternate Term that is the same as, or similar to, one of the Alternate Terms of the starting concept is retrieved 606; these concepts are weighted weakly. A list of concepts is retrieved 608 by querying Lexicon 122.1 for concepts that serve as one of the meanings for the Preferred Term string of the starting concept. These concepts are given medium weighting. Finally, another list of concepts is retrieved 610 by querying Lexicon 122.1 for concepts that serve as one of the meanings for any of the Alternate Terms of the starting concept. These concepts are given a weak weight. These five concept lists are merged 620 or combined additively — i.e., if a concept appears in more than one list, the weights from each list are added together. The resulting list, in which each concept appears only once, associated with the sum of its weights from the five starting lists, is subjected to several filters. First, the correlator 116 ensures 630 that there is no thesaurus overlap. If a concept is present in any of the thesauri of the starting concept, it is removed from the list. For these purposes, the Generalized Knowledge Base is treated as a thesaurus. Therefore, if the starting concept is involved in Generalized Knowledge Base formulas, any candidate concept also involved in Generalized Knowledge Base formulas is also removed from the list.
Next, a graph isomorphism filter is applied 640 to each candidate concept, comparing the relationships the starting concept is involved in with the relationships the candidate concept is involved in. First, a graph is created for each term. Terms in a thesaurus are linked to many other terms through a variety of relationships. These links are used to create a graph over the terms of the thesaurus. For each term, then, the links from that central term to other linking teπns constitutes a subgraph which serves as a signature for the central term. The subgraph of a candidate is then compared to the subgraph of the term being correlated. The similarity of the subgraphs can be treated as a graph isomorhpism problem between the two subgraphs of the terms. The number of links in one subgraph which can be
mapped on to an isomorphic link in the other subgraph serves as an indicator of the number of links, or relations, that are shared by the terms.
Finally, a string similarity filter is applied 650 to each candidate concept. If the Preferred Term string or Alternate Term strings for a candidate concept are not equal, yet are string-similar, to the Preferred Term string or Alternate Term strings of the starting concept, the weight for that candidate concept is increased slightly. In a preferred embodiment, the string-similarity routine looks for missing, substituted or transposed letters as well as singular vs. plural, common variants in spelling (e.g., British vs. American English word endings) and differing verb conjugations. After these filters have been applied, concepts that exceed a certain, parameterized weight cutoff 660 are returned as correlation candidates.
From the above description, it will be apparent that the invention disclosed herein provides a novel and advantageous thesaurus management system and process providing integrated access to, and management of, multiple thesaurus subsets. The foregoing discussion discloses and describes merely exemplary methods and embodiments of the present invention. As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.