WO2006031466A2 - Functionality and system for converting data from a first to a second form - Google Patents

Functionality and system for converting data from a first to a second form Download PDF

Info

Publication number
WO2006031466A2
WO2006031466A2 PCT/US2005/031303 US2005031303W WO2006031466A2 WO 2006031466 A2 WO2006031466 A2 WO 2006031466A2 US 2005031303 W US2005031303 W US 2005031303W WO 2006031466 A2 WO2006031466 A2 WO 2006031466A2
Authority
WO
WIPO (PCT)
Prior art keywords
rules
data
set forth
terms
search
Prior art date
Application number
PCT/US2005/031303
Other languages
French (fr)
Other versions
WO2006031466A3 (en
Inventor
Edward A. Green
Kevin L. Markey
Luis Rivas
Mark Kreider
Alec Sharp
Original Assignee
Silver Creek Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/931,789 external-priority patent/US7865358B2/en
Priority claimed from US10/970,372 external-priority patent/US8396859B2/en
Priority claimed from US11/151,596 external-priority patent/US7536634B2/en
Application filed by Silver Creek Systems, Inc. filed Critical Silver Creek Systems, Inc.
Publication of WO2006031466A2 publication Critical patent/WO2006031466A2/en
Publication of WO2006031466A3 publication Critical patent/WO2006031466A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Definitions

  • the present invention relates generally to machine-based tools for use in converting data from one form to another and, in particular, to a framework for efficiently developing conversion rules as well as accessing and applying external information to improve such conversions.
  • the invention further relates to applying public or private rules for structuring or understanding data ("schema”) to new data so as to reduce start-up efforts and costs associated with configuring such machine-based tools.
  • linguistic differences may be due to the use of different languages or, within a single language, due to terminology, proprietary names, abbreviations, idiosyncratic phrasings or structures and other matter that is specific to a location, region, business entity or unit, trade, organization or the like.
  • linguistic differences for present purposes are different currencies, different units of weights and measures and other systematic differences. Syntax relates to the phrasing, ordering and organization of terms as well as grammatic and other rules relating thereto. Differences in format may relate to data structures or conventions associated with a database or other application and associated tools.
  • Some examples of conversion environments include: importing data from one or more legacy systems into a target system; correlating or interpreting an external input (such as a search query) in relation to one or more defined collections of information; correlating or interpreting an external input in relation to one or more external documents, files or other sources of data; facilitating exchanges of information between systems; and translating words, phrases or documents.
  • a machine-based tool attempts to address differences in linguistics, syntax and/or formats between the input and target environments. It will be appreciated in this regard that the designations "input” and "target” are largely a matter of convenience and are process specific. That is, for example, in the context of facilitating exchanges of information between systems, which environment is the input environment and which is the target depends on which way a particular conversion is oriented and can therefore change.
  • One difficulty associated with machine-based conversion tools relates to properly handling context dependent conversions.
  • properly converting an item under consideration depends on understanding something about the context in which the item is used. For example, in the context of product descriptions, an attribute value of "one inch” might denote one inch in length, one inch in radius or some other dimension depending on the product under consideration.
  • walking functions differently in the phrase “walking shoe” than in "walking to work.”
  • understanding something about the context of an item under consideration may facilitate conversion.
  • search engines are used in a variety of contexts to allow a user of a data terminal, e.g., a computer, PDA or data enabled phone, to search stored data for items of interest.
  • search engines are used for research, for on-line shopping, and for acquiring business information.
  • on ⁇ line catalog searching is illustrative.
  • On-line sales are an increasingly important opportunity for many businesses. To encourage and accommodate on-line purchasing, some companies have devoted considerable resources to developing search tools that help customers identify products of interest. This is particularly important for businesses that have an extensive product line, for example, office supply companies.
  • One type of search engine is the product category search engine.
  • the available products are grouped by categories and subcategories.
  • a user can then enter a product category term, or select a term from a puli-down window or the like, to access a list of available products.
  • These search engines are very useful for customers that have considerable experience or expertise by which to understand the structure of the product space at interest.
  • the product category may not be obvious or may not be the most convenient way to identify a product. For example, a customer wishing to purchase Post-It notes may not be able to readily identify the category in which that product is grouped or may not want to work through a series of menus to narrow a search down to the desired product.
  • web ⁇ sites often accommodate keyword searching.
  • keyword searching To execute a keyword search, the user enters a term to identify the product-of-interest; often a trademark or portion of a trademark.
  • a conventional search engine can then access a database to identify hits or, in some cases, near hits. This allows a customer with a particular product in mind to quickly identify the product, even if the customer can not or does not wish to identify the product category for that product.
  • a customer needing to order appointment books may enter the popular trademark "Daytimer.” If Daytimer appointment books are not carried or are not currently available at the site, the search results may indicate that there is no match, even though other appointment books, e.g., At-A-Glance brand books, are available. This, of course is a lost sales opportunity for the business.
  • the present invention is directed to a computer-based tool and associated methodology for transforming electronic information so as to facilitate communications between different semantic environments and access to information across semantic boundaries.
  • the present invention is related to enabling sharing of knowledge developed in a process of configuring a transformation utility that provides the structure for transforming such electronic information.
  • the present invention may be implemented in the context of a system where subject matter experts (SMEs) develop a semantic metadata model (SMM) for facilitating data transformation.
  • SMEs subject matter experts
  • SMM semantic metadata model
  • the SMM utilizes contextual information and standardized rules and terminology to improve transformation accuracy.
  • the present invention allows for sharing of knowledge developed in this regard so as to facilitate development of a matrix of transformation rules ("transformation rules matrix").
  • transformation rules matrix Such a transformation system and the associated knowledge sharing technology are described in turn below.
  • the invention is applicable with respect to a wide variety of content including sentences, word strings, noun phrases, and abbreviations and can even handle misspellings and idiosyncratic or proprietary descriptors.
  • the invention can also manage content with little or no predefined syntax as well as content conforming to standard syntactic rules.
  • the system of the present invention allows for substantially real-time transformation of content and handles bandwidth or content throughputs that support a broad range of practical applications.
  • the invention is applicable to structured content such as business forms or product descriptions as well as to more open content such as information searches outside of a business context. In such applications, the invention provides a system for semantic transformation that works and scales.
  • the invention has particular application with respect to transformation and searching of both business content and non-business content.
  • transformation and searching of business content presents special challenges.
  • the need for better access to business content and business content transformation is expanding.
  • business content is generally characterized by a high degree of structure and reusable "chunks" of content.
  • Such chunks generally represent a core idea, attribute or value related to the business content and may be represented by a character, number, alphanumeric string, word, phrase or the like.
  • this content can generally be classified relative to a taxonomy defining relationships between terms or items, for example, via a hierarchy such as of family (e.g., hardware), genus (e.g., connectors), species (e.g., bolts), subspecies (e.g., hexagonal), etc.
  • family e.g., hardware
  • genus e.g., connectors
  • species e.g., bolts
  • subspecies e.g., hexagonal
  • Non-business content though typically less structured, is also amenable to normalization and classification.
  • normalization terms or chunks with similar potential meanings including standard synonyms, colloquialisms, specialized jargon and the like can be standardized to facilitate a variety of transformation and searching functions.
  • chunks of information can be classified relative to taxonomies defined for various subject matters of interest to further facilitate such transformation and searching functions.
  • the present invention takes advantage of the noted characteristics to provide a framework by which locale-specific content can be standardized and classified as intermediate steps in the process for transforming the content from a source semantic environment to a target semantic environment and/or searching for information using locale-specific content.
  • standardization may encompass linguistics and syntax as well as any other matters that facilitate transformation.
  • a method and corresponding apparatus are provided for transforming content from a first semantic environment to a second semantic environment by first converting the input data into an intermediate form.
  • the associated method includes the steps of: providing a computer-based device; using the device to access input content reflecting the first semantic environment and convert at least a portion of the input content into a third semantic environment, thereby defining a converted content; and using the converted content in transforming a communication between a first user system operating in the first semantic environment and a second user system operating in the second semantic environment.
  • the input content may be business content such as a parts listing, invoice, order form, catalogue or the like.
  • This input content may be expressed in the internal terminology and syntax (if any) of the source business.
  • this business content is converted into a standardized content reflecting standardized terminology and syntax.
  • the resulting standardized content has a minimized (reduced) set of content chunks for translation or other transformation and a defined syntax for assisting in transformation.
  • the intermediate, converted content is thus readily amenable to transformation.
  • the processed data chunks may be manually or automatically translated using the defined syntax to enable rapid and accurate translation of business documents across language boundaries.
  • the conversion process is preferably conducted based on a knowledge base developed from analysis of a quantity of information reflecting the first semantic environment.
  • this quantity of information may be supplied as a database of business content received from a business enterprise in its native form.
  • This information is then intelligently parsed into chunks by one or more SMEs using the computer-based tool.
  • the resulting chunks which may be words, phrases, abbreviations or other semantic elements, can then be mapped to standardized semantic elements.
  • the set of standardized elements will be smaller than the set of source elements due to redundancy of designations, misspellings, format variations and the like within the source content.
  • business content is generally characterized by a high level of reusable chunks.
  • the transformation rules matrix or set of mapping rules is considerably compressed in relation to that which would be required for direct transformation from the first semantic environment to the second.
  • the converted semantic elements can then be assembled in accordance with the defined syntax to create a converted content that is readily amenable to manual or at least partially automated translation.
  • a computer- based device for use in efficiently developing a standardized semantic environment corresponding to a source semantic environment.
  • the associated method includes the steps of: accessing a database of information reflecting a source semantic environment; using the computer-based device to parse at least a portion of the database into a set of source semantic elements and identify individual elements for potential processing; using the device to select one of the source elements and map it to a standardized semantic element; and iteratively selecting and processing additional source elements until a desired portion of the source elements are mapped to standardized elements.
  • the computer-based device may perform a statistical or other analysis of the source database to identify how many times or how often individual elements are present, or may otherwise provide information for use in prioritizing elements for mapping to the standardized lexicon. Additionally, the device may identify what appear to be variations for expressing the same or related information to facilitate the mapping process. Such mapping may be accomplished by associating a source element with a standardized element such that, during transformation, appropriate code can be executed to replace the source element with the associated standardized element. Architecturally, this may involve establishing corresponding tables of a relational database, defining a corresponding XML tagging structure and/or establishing other definitions and logic for handling structured data.
  • the "standardization" process need not conform to any industry, syntactic, lexicographic or other preexisting standard, but may merely denote an internal standard for mapping of elements. Such a standard may be based in whole or in part on a preexisting standard or may be uniquely defined relative to the source semantic environment. In any case, once thus configured, the system can accurately transform not only known or recognized elements, but also new elements based on the developed knowledge base.
  • the mapping process may be graphically represented on a user interface.
  • the interface preferably displays, on one or more screens
  • information representing source content and a workspace for defining standardized elements relative to source elements (simultaneously or sequentially), information representing source content and a workspace for defining standardized elements relative to source elements.
  • corresponding status information is graphically shown relative to the source content, e.g., by highlighting or otherwise identifying those source elements that have been mapped and/or remain to be mapped. In this manner, an operator can readily select further elements for mapping, determine where he is in the mapping process and determine that the mapping process is complete, e.g., that all or a sufficient portion of the source content has been mapped.
  • mapping process thus enables an operator to maximize effective mapping for a given time that is available for mapping and allows an operator to define a custom transformation "dictionary" that includes a minimized number of standardized terms that are defined relative to source elements in their native form.
  • contextual information is added to source content prior to transformation to assist in the transformation process.
  • the associated method includes the steps of: obtaining source information in a first form reflecting a first semantic environment; using a computer-based device to generate processed information that includes first content corresponding the source information and second content, provided by the computer-based device, regarding a context of a portion of the first content; and converting the processed information into a second form reflecting a second semantic environment.
  • the second content may be provided in the form of tags or other context cues that serve to schematize the source information.
  • the second content may be useful in defining phrase boundaries, resolving linguistic ambiguities and/or defining family relationships between source chunks.
  • the result is an information added input for transformation that increases the accuracy and efficiency of the transformation.
  • an engine for transforming certain content of electronic transmissions between semantic environments.
  • a communication is established for transmission between first and second user systems associated with first and second semantic environments, respectively, and transmission of the communication is initiated.
  • a business form may be selected, filled out and addressed.
  • the engine then receives the communication and, in substantially real-time, transforms the content relative to the source semantic environment, thereby providing transformed content.
  • the transmission is completed by conveying the transformed content between the user systems.
  • the engine may be embodied in a variety of different architectures.
  • the engine may be associated with the transmitting user system relative to the communication under consideration, the receiving user system, or at a remote site, e.g., a dedicated transformation gateway.
  • the transformed content may be fully transformed between the first and second semantic environments by the engine, or may be transformed from one of the first and second semantic environments to an intermediate form, e.g., reflecting a standardized semantic environment and/or neutral language. In the latter case, further manual and/or automated processing may be performed in connection with the receiving user system. In either case, such substantially real-time transformation of electronic content marks a significant step towards realizing the ideal of globalization.
  • information is processed using a structure for normalization and classification of locale- specific content.
  • a computer-based processing tool is used to access a communication between first and second data systems, where the first data system operates in a first semantic environment defined by at least one of linguistics and syntax specific to that environment.
  • the processing tool converts at least one term of the communication between the first semantic environment and a second semantic environment and associates a classification with the converted or unconverted term.
  • the classification identifies the term as belonging to the same class as certain other terms based on a shared characteristic, for example, a related meaning (e.g., a synonym or conceptually related term), a common lineage within a taxonomy system (e.g., an industry-standard product categorization system, entity organization chart, scientific or linguistic framework, etc.), or the like.
  • a related meaning e.g., a synonym or conceptually related term
  • a common lineage within a taxonomy system e.g., an industry-standard product categorization system, entity organization chart, scientific or linguistic framework, etc.
  • the classification is then used to process the communication.
  • the communication may be directed to and/or received from the first semantic environment.
  • a communication such as a search query
  • locale-specific information such as abbreviations, proprietary names, colloquial terminology, or the like.
  • Such a term in the query may first be normalized or cleaned such that the term is converted to a standardized or otherwise defined lexicon. This may involve syntax conversion, linguistic conversion and/or language translation.
  • the converted or unconverted term is classified and the associated classification is used to identify information responsive to the query.
  • the communication may be directed to the first semantic environment as by an individual or business consumer seeking product information from a company information system.
  • a term may be converted from an external form of the second semantic environment to the first semantic environment.
  • a term of the communication e.g., 10mm hexagonal Allen nut
  • an internal product identifier name, number, description of the like, e.g., hex nut-A
  • the converted or unconverted term is associated with a classification (e.g., metric fasteners) and the classification is used to process the communication (e.g., by constructing a menu, page or screen with product options of potential interest).
  • the noted process generally involves one or more operators or subject matter experts (SMEs) for developing the knowledge base which reflects a semantic metadata model (SMM) involving a matrix of transformation rules, e.g., relating to standard terminology, grammar and term classification.
  • SMM semantic metadata model
  • This process can be time-consuming and is somewhat subjective. That is, the process of developing the SMM generally involves, among other things, mapping individual terms from the source collection to standardized terminology and associating those terms with information identifying their respective classification or position within a defined taxonomy structure. In the case of large source collections, this may involve considerable time. In order to accelerate the process, it would be useful to re-use knowledge that has been previously developed.
  • knowledge may be re-used to import rules developed in connection with another project, or to allow multiple SMEs to work on a given domain with each SME benefiting from, and not being handicapped by, the work of the other(s).
  • the selection of standardized terminology, mapping of terms from the source collection to the standardized terminology and the association of taxonomy tags to individual terms all involve subjective determinations. These subjective determinations can result in inconsistencies or ambiguities in the SMM when knowledge is imported. The elimination of such ambiguities is, of course, an important motivation for developing the SMM.
  • such domains correspond to different subject matter areas presumptively encompassing different terms of the source collection.
  • such different domains may correspond in a business context to different business divisions, product categories or catalog sections.
  • These different domains may be associated with different tags, at a high (broad) level, of a taxonomy structure.
  • tags at a high (broad) level, of a taxonomy structure.
  • it is often difficult to make neat divisions of domains such that all overlap of terminology is avoided.
  • a method and apparatus for sharing transformation information, i.e., importing transformation information developed in connection with one set of data for use in connection with a second set of data. That is, one part of the transformation process is developing a transformation rules matrix as noted above.
  • This transformation rules matrix is developed by establishing a set of rules in connection with consideration of a first set of terms of a source system.
  • the instant utility is directed to re-using such rules in connection with a second set of terms that may be associated with the same source system (e.g., in the case of multiple SMEs developing a single transformation rules matrix) or may be associated with a different source system.
  • the utility involves providing logic for converting data from a first form to a second form, where the logic is configurable to apply transformation rules (e.g., a transformation rules matrix as noted above) developed for a particular transformation application.
  • the logic is first used in connection with a first set of source data including a set of first terms to develop first transformation information, and to establish a first storage structure (e.g., a database) for indexing the first transformation information to the first terms.
  • the logic is further (e.g., subsequently) used to develop second transformation information for a second set of source data that may be from the same or a different system than the first set of source data.
  • the second terms are different than the first terms.
  • the logic is operative for: establishing a second storage structure (e.g., a database that may be embodied in the same or a different machine than the first storage structure), for relating the second transformation information to the second terms; importing at least a portion of the first transformation information to the second storage structure; and relating, in the second storage structure, the first transformation information to the second terms.
  • a second storage structure e.g., a database that may be embodied in the same or a different machine than the first storage structure
  • transformation information may be developed in connection with a first development effort for developing the first storage structure. That transformation information may be imported and supplemented or otherwise modified in connection with a second development effort for developing the second storage structure. The modified transformation information may then be exported for further use, e.g., in connection with further development of yet another storage structure.
  • the noted utility may be used in any of these transformation information re-use contexts.
  • the logic is preferably operative to address any potential inconsistencies entailed by such re-use.
  • the logic may identify to all "owners" of transformation information any differences between original transformation information and modified transformation information, or apparent naming or rules conflicts so that such potential inconsistencies can be addressed, e.g., by selectively accepting, rejecting, or modifying any differences associated with the modified transformation information or arbitrating conflicts.
  • a method and apparatus is provided for allowing multiple SMEs to work in the same domain in connection with developing an SMM.
  • the utility involves: providing logic for use in converting data from a first form to a second form, where the logic is configurable to associate elements of the first form with elements of the second form to develop a transformation model such as an SMM; first operating the logic to establish a first association of elements of the first form with elements of the second form; second operating the logic to establish a second association of elements of the first form with elements of the second form; and third operating the logic to process at least one of the first and second associations to address any inconsistencies associated with the overlap.
  • a transformation model such as an SMM
  • the transformation process involves mapping a source collection to an SMM, i.e., a form that facilitates transformation of the data.
  • a target form of a particular transformation transaction may also be mapped to the same or a different SMM.
  • multiple SMEs or other users or systems may be involved in establishing associations between a source form and an SMM, between a target form and an SMM, or between SMMs, and the noted utility may be utilized in any of these contexts.
  • the associations may involve developing a transformation rules matrix for mapping terms of the first form to terms of the second form (e.g., from the source collection to standardized terminology) and/or applying tags identifying a position of a term within a taxonomy structure so as to facilitate contextual understanding of the term.
  • first and second operating the logic to establish associations may be executed by different SMEs or other users.
  • users may use intuitive graphical interfaces to relate a term from the source collection to an associated standardized term.
  • graphical interfaces may be used to relate terms to appropriate classifications or positions of a taxonomy structure, e.g., via a drag-and-drop operation.
  • a potential inconsistency in this regard may relate, for example, to associating a given term of a source form with two different terms of an SMM or associating the given term with two different taxonomy tags or overall tag structures, thereby creating a transformation ambiguity.
  • Such potential inconsistencies may be resolved in a number of ways. For example, such potential inconsistencies may be resolved by establishing a definitive transformation rules matrix and automatically rejecting conflicting transformation information. Alternatively, potentially conflicting transformation information may be identified to all concerned users who may arbitrate such potential conflicts on an individual basis. It will be appreciated that such inconsistencies may be addressed in other ways.
  • a knowledge base is constructed by which item descriptor terms and/or potential search terms are associated with contextual information by which the search logic can associate such a term including a specific, colloquial or otherwise idiosyncratic term, with a subject matter context, so as to enable a more complete search to be performed and increase the likelihood of yielding useful results.
  • a method and apparatus (“utility") are provided for use in establishing a searchable data structure where search terms are associated with a subject matter context.
  • the searchable data structure may be, for example, a database system or other data storage resident on a particular machine or distributed across a local or wide area network.
  • the utility involves providing a list of potential search terms pertaining a subject matter area of interest and establishing a classification structure for the subject matter area of interest.
  • the list of potential search terms may be an existing list that has been developed based on analysis of the subject matter area or may be developed by a subject matter expert or based on monitoring search requests pertaining to the subject matter of interest.
  • the list may be drawn from multiple sources, e.g., starting from existing lists and supplemented by monitoring search requests. It will be appreciated that lists exist in many contexts such as in connection with pay-per-click search engines.
  • the classification structure preferably has a hierarchical form defined by classes, each of which includes one or more sub-classes, and so on.
  • the utility further involves associating each of the potential search terms with the classification structure such that the term is assigned to at least one sub-class and a parent class. For example, such associations may be reflected in an XML tag structure or by any other system for reflecting such metadata structure. In this manner, search terms are provided with a subject matter context for facilitating searching.
  • a search query including the term Daytimer may be interpreted so as to provide search results related more generally to appointment books.
  • Such a search may be implemented iteratively such that the search system first seeks results matching "Daytimer" and, if no responsive information is available, proceeds to the next rung on the classification system, for example, "Appointment Books.” Such iterations may be repeated until results are obtained or until a predetermined number iterations are completed, at which point the system may return an error message such as "no results found.”
  • similar context information may be provided to terms associated with the data to be searched or source data.
  • the utility generally involves providing a list of source data terms defining a subject matter area of interest and establishing a classification structure for the source data terms.
  • the classification structure preferably has a hierarchical form including classes each of which includes one or more sub-classes, and so on.
  • Each of the source terms is associated with the classification structure such that the source term is assigned to at least one of the sub-classes and an associated parent class.
  • context is provided in connection with source data to facilitate searching.
  • a search query including the term "Appointment Book” may retrieve source data pertaining to Daytimer products, even though those products' descriptors may not include the term "Appointment Book.”
  • a data structure is established such that both potential search terms and source data terms are associated with a classification structure.
  • a search query including the term “Daytimer” may be associated with a classification “Appointment Books.”
  • a data item associated with the trademark “At-A-Glance” may be associated with the subject matter classification “Appointment Books.” Consequently, a search query including the term “Daytimer” may return search results including the "At-A-Glance” products of potential interest.
  • a utility for searching stored data using contextual metadata.
  • the utility involves establishing a knowledge base for a given subject matter area, receiving a search request including a first descriptive term, accessing a source data collection using the knowledge base, and responding to the search request using the responsive information.
  • the knowledge base defines an association between a term of the search request and an item of source data based on a classification within a context of the subject matter area. Such a classification may be associated with the search term and/or a source term.
  • a search request may thereby be addressed based on a second matter context even though the search is entered based on specific search terms and the item of source data is associated with specific source terms.
  • the knowledge base may optionally include additional information related to the subject matter area, such as a system of rules for standardizing terminology and syntax, i.e., a grammar.
  • a data search is facilitated based on a standardization of terms utilized to execute the search. It has been recognized that term searches are complicated by the fact that searchers may enter terms that are misspelled, colloquial, or otherwise idiosyncratic. Similarly, source data may include jargon, abbreviations or other matter that complicates term matching. Accordingly, term searches can be facilitated by standardizing one or both of the search terms and source terms. For example, a user searching for Post-it notes may enter a colloquial term such as "sticky tabs.” This term may be rewritten by a utility according to the present invention, as, for example, "adhesive notepad" or some other selected standard term.
  • a source collection such as a catalog
  • Such an entry may be rewritten to include standard terminology and syntax.
  • the term “Pl notes” may be rewritten as "Post-it notes” and may be associated with the classification "adhesive notepad.”
  • a first order classification of the source term matches the standardized search term, thereby facilitating retrieval of relevant information.
  • such matching is not limited to matching of terms rewritten in standardized form or matching of classifications, but may involve matching a rewritten search term to a classification or vice-versa.
  • searching using a data structure of standardized terms and/or associated classifications e.g., a knowledge base
  • a knowledge base may be constructed such that the classification "pen” or specific pen product records are retrieved in response to a search query including "writing instruments" and "office gifts.”
  • such functionality may facilitate searching of multiple legacy databases, e.g., by an inside or outside party or for advanced database merging functionality.
  • an entity may have information related to a particular product, company or other subject matter in multiple legacy databases, e.g., a product database and an accounting database.
  • legacy databases may employ different conventions, or no taut conventions, regarding linguistics and syntax for identifying common data items. This complicates searching using conventional database search tools and commands, and can result in incomplete search results.
  • a defined knowledge base can be used to relate a search term to corresponding information of multiple legacy systems, e.g., so that a substantially free form search query can retrieve relevant information from the multiple legacy system despite differing forms of that information in those legacy environments.
  • a searchable data system using contextual metadata includes an input port for receiving a search request including a search term, a first storage structure for storing searchable data defining a subject matter of the searchable data system, a second storage structure for storing a knowledge base, and logic for identifying the search term and using the knowledge base to obtain responsive information.
  • the system further comprises an output port for outputting the responsive data, e.g., to the user or an associated network node.
  • the knowledge base relates a potential search term to a defined classification structure of the subject matter of the searchable data system.
  • the classification structure may include classes, sub-classes and so on to define the subject matter to a desired granularity.
  • the logic uses the knowledge base to relate the search term to a determined classification of the classification structure and, in turn, uses the determined classification to access the first storage structure to obtain the responsive data.
  • the present invention is further directed to a machine-based tool and associated logic and methodology for use in converting data from an input form to a target form using context dependent conversion rules.
  • conversions are improved, as ambiguities can be resolved based on context cues.
  • existing public or private schema can be utilized to establish conversion rules for new data thereby leveraging existing structure developed by an entity or otherwise developed for or inherent in a given subject matter context.
  • structure can be imported a priori to a given conversion environment and need not, in all cases, be developed based on a detailed analysis of the new data. That is, structure can be imparted in a top-down fashion to a data set and is not limited to bottom-up evolution from the data.
  • context dependent conversion rules can be efficiently accessed without the need to access a rigid and complex classification structure defining a larger subject matter context.
  • a rule structure developed in this manner can provide a high degree of reusability across different conversion environments for reduced start-up effort and cost.
  • subject matter cues and structure can be based on or adopt existing data structures and metadata elements (e.g., of an existing database or other structured data system) so as to provide further efficiencies and functionality.
  • the parse-tree involves developing a classification structure by which terms under consideration can be mapped to or associated with a particular classification taxonomy. For example, in the context of a database or catalog of business products, a product attribute term may be associated with a parent product classification, which in turn belongs to a grandparent product grouping classification, etc.
  • the associated classification structure may be referred to as a parse tree.
  • classification taxonomy entails certain inefficiencies.
  • very deep parses may be required reflecting a complicated parse tree. These deep parses require substantial effort and processing resources to develop and implement.
  • the resulting classification structures impose significant rigidity on the associated conversion processes such that it may be difficult to adapt the structures to a new conversion environment or to reuse rules and structures as may be desired.
  • predefined, complex structures have limited ability to leverage context cues that may exist in source structured data or that may otherwise be inferred based on an understanding of the subject matter at issue, thereby failing to realize potential efficiencies.
  • a frame-slot architecture for use in converting information.
  • a frame represents an intersection between a contextual cue recognized by the machine tool, associated content and related constraint information specific to that conversion environment, whereas a slot represents an included chunk of information.
  • a chunk of information such as "1 inch roller bearing” may be recognized by the machine tool logic or grammar as an attribute phrase.
  • the term "1 inch” may then be recognized as an attribute value.
  • “1 inch” represents a radius dimension and not a length, width, height or similar rectilinear designation.
  • Such contextual cues can be inferred from a general, public understanding of the subject matter, i.e., what a roller bearing is. Such understanding is a kind of public schema.
  • an associated private schema may define acceptable values or ranges for this attribute. For example, only certain values or a certain values range for the attribute at issue may be "legal"; that is, only those values may be acceptable within rules defined by an interested entity.
  • such private schema may be pre-defined and thus available for use in a conversion process prior to any detailed analysis of the data sets at issue. The attribute value can be compared to such constraints to confirm the identification of the attribute phrase or to identify corrupted or nonconforming data.
  • the frame is thus a specification of context or other disambiguating cues at or close to the whole-record level, less sensitive to syntax and more sensitive to the intersection of attributes and their values.
  • a frame functions as a container for grammatical information used to convert data, analogous to a software object.
  • the frame-slot architecture thus can resolve ambiguities without deep parses and yields flexible and more readily reusable syntactic rules.
  • constraint information is readily available, e.g., for attribute values, thus allowing for more confidence in conversions and better recognition of conversion anomalies.
  • a method and apparatus for converting a semantic element under consideration.
  • the utility involves receiving content associated with a data source and obtaining first information from the content for use in a conversion.
  • the nature of the content depends, for example, on the conversion environment.
  • the content may be structured (e.g., in the case of converting data from a database or other structured source) or unstructured (e.g., in the case of a search query or other textual data source).
  • the first information can be any of a variety of data chunks that are recognized by the utility, for example, an attribute phrase or other chunk including context cues in data or metadata form.
  • the utility uses the first information to obtain second information, from a location external to the content, for use in the conversion, and uses the first and second information in converting the content from a first form to a second form.
  • the second information may include context specific interpretation rules (e.g., "1 inch” means “1 inch in radius”), context specific constraints (e.g., acceptable attribute values must fall between 0.5-6.0 inches) and/or context-specific syntax or format rules (e.g., re-write as "roller bearing - 1 inch radius").
  • a frame-slot architecture can be implemented with attendant advantages as noted above. It will be appreciated that such an architecture can be imposed on data in a top-down fashion or developed from data in a bottom-up fashion. That is, frames may be predefined for a particular subject matter such that data chunks can then be slotted to appropriate frames, or frames can evolve from the data and make use of the data's intrinsic or existing structures. In the latter regard, it will be appreciated that existing databases and structured data often have a high degree of embedded contextual cues that the utility of the present invention can leverage to efficiently define frame-slot architecture.
  • a utility for converting data from a first form to a second form based on an external schema.
  • the utility involves establishing a number of schema, each of which includes one or more conversion rules for use in converting data within a corresponding context of a subject matter area.
  • a set of data is identified for conversion from the first form to the second form and a particular context of the set of data is determined.
  • a first schema is accessed and a conversion rule of the first schema is used in a process for converting the set of data from the first form to the second form.
  • the schemas are established based on external knowledge of a subject matter area independent of analysis of a particular set of data to be converted.
  • the schema may include one or more public schema including conversion rules generally applicable to the subject matter area independent of any entity or group of entities associated with the set of data.
  • public schema may involve an accepted public definition of a semantic object, e.g., a "flat bar" may be defined as a rectilinear object having a length, width, and thickness where the length is greater than the width which, in turn, is greater than the thickness.
  • the external schema may include one or more private schema, each including conversion rules specific to an entity or group of entities less than the public as a whole.
  • a private schema may define legal attribute values in relation to a product catalog of a company.
  • schema involved some relationship between elements included in a single attribute phrase, e.g., an object such as "bar” and an associated attribute such as “flat.” It should be appreciated that schema are not limited to such contexts but more broadly encompass public or private rules for structuring or understanding data. Thus, for example, rules may be based on relationships between different objects such as “paint brush,” on the one hand, and “bristles,” “handle” or “painter” on the other.
  • the set of data to be converted may include, for example, an attribute phrase (or phrases) including a semantic object, an attribute associated with the object and an attribute value for that attribute.
  • This attribute phrase may be identified by parsing a stream of data.
  • the context of the subject matter area may be determined from the semantic object.
  • the attribute phrase includes information potentially identifying the semantic object, attribute and attribute value.
  • Logic may be executed to interpret this information so as to identify the object, attribute and/or attribute value.
  • the object, attribute or attribute value may be compared to a set of objects, attributes or attribute values defined by the first schema. Such a comparison may enable conversion of the set of data from the first form to the second form or may identify an anomaly regarding the set of data.
  • the process of establishing the schema may be implemented in a start-up mode for configuration of a machine-based tool.
  • a start-up mode may be employed to configure the tool so as to convert data based on contextual cues inferred from an understanding of the subject matter area.
  • the schema enables conversion of data which was not specifically addressed during configuration.
  • the machine tool is not limited to converting data elements or strings of elements for which context cues have been embedded but can infer contextual cues with respect to new data. In this manner, start-up efforts and costs can be substantially reduced.
  • Figure 1 is a monitor screen shot illustrating a process for developing replacement rules in accordance with the present invention
  • Figure 2 is a monitor screen shot illustrating a process for developing ordering rules in accordance with the present invention
  • Figure 3 is a schematic diagram of the NorTran Server components of a SOLx system in accordance with the present invention
  • FIG. 4 is a flowchart providing an overview of SOLx system configuration in accordance with the present invention.
  • Figures 5-10 are demonstrative monitor screen shots illustrating normalization and translation processes in accordance with the present invention.
  • FIG. 11 is a flowchart of a normalization configuration process in accordance with the present invention.
  • Figure 12 is a flowchart of a translation configuration process in accordance with the present invention.
  • Figure 13 is an illustration of a graphical desktop implementation for monitoring the configuration process in accordance with the present invention.
  • Figure 14 illustrates various network environment alternatives for implementation of the present invention
  • Figure 15 illustrates a conventional network/web interface
  • Figure 16 illustrates a network interface for the SOLx system in accordance with the present invention
  • Figure 17 illustrates a component level structure of the SOLx system in accordance with the present invention
  • Figure 18 illustrates a component diagram of an N-Gram Analyzer of the SOLx system in accordance with the present invention
  • Figure 19 illustrates a taxonomy related to the area of mechanics in accordance with the present invention
  • Figure 20 is a flowchart illustrating a process for constructing a database in accordance with the present invention.
  • Figure 21 is a flowchart illustrating a process for searching a database in accordance with the present invention.
  • Figure 22 is a schematic diagram of a transformation information sharing system in accordance with the present invention.
  • Figures 23-35 are sample user interface screens illustrating transformation information sharing functionality in accordance with the present invention.
  • Figure 36 is a flowchart illustrating an information import and testing process in accordance with the present invention.
  • Figure 37 is a schematic diagram of a search system in accordance with the present invention operating in the startup mode;
  • Figure 38 is a schematic diagram illustrating the mapping of the potential search terms and source terms to a single parse tree in accordance with the present invention;
  • Figures 39 and 40 illustrate graphical user interfaces for mapping terms to a parse tree in accordance with the present invention
  • Figure 41 is a flow chart illustrating a process for mapping terms to a parse tree in accordance with the present invention
  • Figure 42 is a schematic diagram illustrating a search system, in accordance with the present invention, in a use mode;
  • Figure 43 is a flow chart illustrating a process for operating the system of Fig. 42 in the use mode;
  • Figure 44 is a schematic diagram illustrating use of a knowledge base to search multiple legacy systems in accordance with the present invention.
  • Fig. 45 is a schematic diagram of a semantic conversion system in accordance with the present invention
  • Fig. 46 is a flow chart illustrating a semantic conversion process in accordance with the present invention
  • Fig. 47 is a schematic diagram showing an example of a conversion that may be implemented using the system of Fig. 45;
  • Fig. 48 is a schematic diagram illustrating the use of public and private schema in a conversion process in accordance with the present invention.
  • Figs.49 - 5OB illustrate exemplary user interfaces in accordance with the present invention
  • the invention is set forth in the context of a search system involving standardization of source and search terms, and the association of classification information with both source terms and search terms and in other conversion contexts.
  • Specific examples are provided in the environment of business information, e.g., searching a website or electronic cata ⁇ og for products of interest.
  • Section I The discussion below begins by describing, at a functional and system component level, a search system constructed in accordance with the present invention. This description is contained in Section I. Thereafter, in Sections U and III, the underlying framework for term standardization, classification and transformation is described in greater detail including certain utilities for sharing rule information and development between multiple users and between applications. Finally, Section IV describes a novel frame-slot architecture. 1. SEARCH SYSTEM
  • the search system of the present invention is operable in two modes; the setup mode and the use mode.
  • the setup mode the user, generally a subject matter expert as will be described below, performs a number of functions including accessing lists of potential search terms and/or source terms, developing a standardized set or set of terms, establishing a classification structure, associating the standardized terms with the classification structure and selectively transforming (e.g., translating) the terms as necessary.
  • Figure 37 is a schematic diagram of a search system 3700, in accordance with the present invention, operating in the startup mode.
  • the system 3700 includes a controller 3702 and storage configured to store a term listing 3704, a parse tree structure 3706 and a set of structured standardized terms 3708.
  • the system 3700 is illustrated as being implemented on a single platform 3710, it will be appreciated that the functionality of the system 3700 may be distributed over multiple platforms, for example, interconnected by a local or wide area network.
  • the user 3712 uses the controller 3702 to access a previously developed parse tree structure 3706 or to develop the structure 3706.
  • the parse tree structure 3706 generally defines a number of classifications, each generally including one or more sub-classifications that collectively define the subject matter area. Examples will be provided below.
  • the number of layers of classifications and sub-classifications will generally be determined by the user 3712 and is dependent on the nature of the subject matter. In many cases, many such classifications will be available, for example, corresponding to headings and subheadings of a catalog or other pre-existing subdivisions of a subject matter of interest. In other cases, the subject matter expert may develop the classifications and sub-classifications based on an analysis of the subject matter.
  • a term listing 3704 may include potential search terms, source terms from a source data collection or both.
  • potential search terms the terms may be obtained from a pre- existing list or may be developed by the user 3712.
  • the potential search terms may be drawn from a stored collection of search terms entered by users in the context of the subject matter of interest. Additional sources may be available, in a variety of contexts, for example, lists that have been developed in connection with administering a pay-per-click search engine. The list may be updated over time based on monitoring search requests.
  • the source term listing may be previously developed or may be developed by the user 3712. For example, in the context of online shopping applications, the source listing may be drawn from an electronic product catalog or other product data base.
  • Standardization refers to mapping of terms from the term listing 3704 to a second set, generally a smaller set, of standardized terms. In this manner, misspellings, abbreviations, colloquial terms, synonyms, different linguistic/syntax conventions of multiple legacy systems and other idiosyncratic matter can be addressed such that the list of standardized terms is substantially reduced in relation to the original term listing 3704. It will be appreciated from the discussion below that such standardization facilitates execution of the searching functionality as well as transformation functions as may be desired in some contexts, e.g., translation.
  • the resulting list of standardized terms can then be mapped to the parse tree structure 3706. As will be described below, this can be executed via a simple drag and drop operation on a graphical user interface.
  • an item from a source listing for example, identifying a particular Post-it note product, may be associated with an appropriate base level classification, for example, "Adhesive Notepad.”
  • a term from a potential search term listing such as "Sticky Pad" may be associated with the same base level classification. It will be appreciated that a given term may be associated with more than one base level classification, a given base level classification may be associated with more than one parent classification, etc.
  • a base level classification may be associated with a parent classification, grandparent classification, etc. All of these relationships are inherited when the term under consideration is associated with a base level classification.
  • the result is that the standardized term is associated with a string of classes and sub-classes of the parse tree structure 3706. For example, these relationships may be reflected in an XML tag system or other metadata representation associated with the term.
  • the resulting structured standardized terms are then stored in a storage structure 3708 such as a database.
  • both source terms and potential search terms may be mapped to elements of the same parse tree structure. This is shown in Figure 38. As shown, multiple terms 3802 from the source collection are mapped to the parse tree structure 3800. Similarly, multiple terms from the potential search term listing 3804 are mapped to corresponding elements of the parse tree structure 3800. In this manner, a particular search term entered by a user can be used to identify responsive information from the source collection based on a common classification or sub-classification despite the absence of any overlap between the entered search term and the corresponding items from the source collection. It will be appreciated that it may be desirable to link a given term 3802 or 3804 with more than one classification or classification lineage of the parse tree 3800. This may have particular benefits in connection with matching a particular product or product category to multiple potential search strategies, e.g., mapping "pen” to searches including "writing instrument” or "office gift.”
  • Figure 39 shows a user interface representing a portion of a parse tree 3900 for a particular subject matter such as the electronic catalog of a office supply warehouse.
  • the user uses the graphical user interface to establish an association between search terms 3902 and 3904 and the parse tree 3900.
  • search term 3902 in this case "sticky pad” is dragged and dropped on the node 3906 of the parse tree 3900 labeled "Adhesive."
  • This node 3906 or classification is a sub-classification of "Notepads" 3908 which is a sub-classification of "Paper
  • Fig. 40 illustrates how the same parse tree 3900 may be used to associate a classification with items from a source collection.
  • a source collection may be drawn from an electronic catalog or other database of the business.
  • the source term 4002 denoted "3- pack, 3x3 Post-it notes (Pop-up)-Asst'd” is associated with the same node 3906 as “Sticky Pad” was in the previous example.
  • term 4004 denoted "2005 Daytimer-Weekly-7x10-Blk” is associated with the same node 3914 as potential search term "Daytimer" was in the previous example.
  • such common associations with respect to the parse tree 3900 facilitate searching.
  • the illustrated process 4100 is initiated by developing (4102) a parse tree that defines the subject matter of interest in terms of a number of classifications and sub-classifications. As noted above, such parsing of the subject matter may be implemented with enough levels to divide the subject matter to the desired granularity.
  • the process 4100 then proceeds on two separate paths relating to establishing classifications for potential search terms and classifications for items from the source collection. It will be appreciated that these two paths may be executed in any order or concurrently.
  • the process involves obtaining or developing (4104) a potential search term listing.
  • an existing list may be obtained, a new list may be developed by a subject matter expert, or some combination of these processes may occur.
  • the terms are then mapped (4106) to the parse tree structure such as by a drag and drop operation on a graphical user interface as illustrated above.
  • the process 4100 proceeds by obtaining or developing (4108) a source term listing. Again, the source term listing may be obtained from existing sources, developed by subject matter expert or some combination of these processes may occur.
  • the individual terms are then mapped (4110) to the parse tree structure, again, for example, by way of a drag and drop operation as illustrated above.
  • the process 4100 may further include the steps of re-writing the potential search terms and source terms in a standardized form.
  • the search system of the present invention is also operative in a use mode. This is illustrated in Fig. 42.
  • the illustrated system 4200 includes input structure 4202 for receiving a search request from a user 4204.
  • the search request may be entered directly at the machine executing the search system, or may be entered at a remote node interconnected to the platform 4206 via a local or wide area network.
  • the nature of the input structure 4202 may vary accordingly.
  • the search request is processed by a controller 4208 to obtain responsive information that is transmitted to the user 4204 via output structure 4210. Again, the nature of the output structure 4210 may vary depending on the specific network implementation.
  • the controller accesses the knowledge base 4212.
  • the knowledge base 4212 includes stored information sufficient to identify a term from the search request, rewrite the term in a standardized form, transform the term if necessary, and obtain the metadata associated with the term that reflects the classification relationships of the term.
  • the controller uses the standardized term together with the classification information to access responsive information from the source data 4214.
  • Fig. 4300 is a flow chart illustrating a corresponding process 4300.
  • the process 4300 is initiated by receiving (4302) a search request, for example, from a keyboard, graphical user interface or network port.
  • the system is then operative to identify (4304) a search term from the search request.
  • a search term may be entered via a template including predefined Boolean operators or may be entered freeform. Existing technologies allow for identification of search terms thus entered.
  • the search term is then rewritten (4306) in standard form. This may involve correcting misspellings, mapping multiple synonyms to a selected standard term, implementing a predetermined syntax and grammar, etc., as will be described in more detail below.
  • the resulting standard form term is then set (4308) as the current search parameter.
  • the search then proceeds iteratively through the hierarchy of the parse tree structure. Specifically, this is initiated by searching (4310) the source database using the current search parameter. If any results are obtained (4312) these results may be output (4320) to the user. If no results are obtained, the parent classification at the next level of the parse tree is identified (4314). That parent classification is then set (4316) as the current search parameter and the process is repeated. Optionally, the user may be queried (4318) regarding such a classification search.
  • the user may be prompted to answer a question such as "no match found - - would you like to search for other products in the same classification?"
  • the logic executed by the process controller may limit such searches to certain levels of the parse tree structure, e.g., no more than three parse levels (parent, grandparent, great grandparent) in order to avoid returning undesired results.
  • such searching may be limited to a particular number of responsive items.
  • the responsive items as presented to the user may be ordered or otherwise prioritized based on relevancy as determined in relation to proximity to the search term in the parse tree structure.
  • Fig. 44 illustrates a system 4400 for using a knowledge base 4404 to access information from multiple legacy databases 4401-4403.
  • legacy databases for example, product databases and accounting databases.
  • Those legacy databases may have been developed or populated by different individuals or otherwise include different conventions relating to linguistics and syntax.
  • a first record 4406 of a first legacy database 4401 reflects a particular convention for identifying a manufacturer ("Acme”) and product ("300W AC Elec. Motor . . .”).
  • Record 4407 associated with another legacy database 4403 reflects a different convention including, among other things, a different identification of the manufacturer (“AcmeCorp”) and a misspelling ("Moter").
  • an internal or external user can use the processor 4405 to enter a substantially freeform search request, in this case "Acme Inc. Power Equipment.”
  • a search request may be entered in the hopes of retrieving all relevant information from all of the legacy databases 4401-4403. This is accommodated, in the illustrated embodiment, by processing the search request using the knowledge base 4404.
  • the knowledge base 4404 executes functionality as discussed above and in more detail below relating to standardizing terms, associating terms with a classification structure and the like.
  • the knowledge base 4404 may first process the search query to standardize and/or classify the search terms. For example, Acme, Inc.
  • polar equipment may be associated with the standardized term or classification "motor.”
  • polar equipment may be associated with the standardized term or classification "motor.”
  • merge functionality may be implemented to identify and prioritize the responsive information provided as search results to the processor 4405. In this manner, searching or merging of legacy data systems is accommodated with minimal additional code.
  • the present invention also accommodates sharing information established in developing a transformation model such as a semantic metadata model (SMM) used in this regard.
  • SMM semantic metadata model
  • the invention is preferably implemented in connection with a computer- based tool for facilitating substantially real-time transformation of electronic communications.
  • the invention is useful in a variety of contexts, including transformation of business as well as non-business content and also including transformation of content across language boundaries as well as within a single language environment.
  • transformation of data in accordance with the present invention is not limited to searching applications as described above, but is useful in a variety of applications including translation assistance.
  • a system is described in connection with the transformation of business content from a source language to a target language using a Structured Object Localization expert (SOLx) system.
  • SOLx Structured Object Localization expert
  • the invention is further described in connection with classification of terminology for enhanced processing of electronic communications in a business or non- business context.
  • the information sharing functionality and structure of the invention is then described.
  • Such applications serve to fully illustrate various aspects of the invention. It will be appreciated, however, that the invention is not limited to such applications.
  • SOLx This includes a discussion of configuration objectives as well as the normalization classification and translation processes. Then, the structure of SOLx is described, including a discussion of network environment alternatives as well as the components involved in configuration and run-time operation. In the second section, the information sharing functionality and structure is described. This includes a discussion of the creation, editing and extension of data domains, as well as domain management and multi-user functionality.
  • the information sharing technology of the present invention is preferably implemented in connection with a machine based tool that is configured or trained by one or more SMEs who develop a knowledge base including an SMM.
  • This machine based tool is first described in this Section I.
  • the knowledge sharing functionality and structure is described in Section U that follows.
  • the present invention addresses various shortcomings of conventional data transformation, including manual translation and conventional machine translation, especially in the context of handling business content.
  • the present invention is largely automated and is scalable to meet the needs of a broad variety of applications.
  • problems associated with typical business content that interfere with good functioning of a conventional machine translation system. These include out-of-vocabulary (OOV) words that are not really OOV and covert phrase boundaries.
  • OOV out-of-vocabulary
  • a word to be translated is not in the machine translation system's dictionary, that word is said to be OOV.
  • words that actually are in the dictionary in some form are not translated because they are not in the dictionary in the same form in which they appear in the data under consideration.
  • particular data may contain many instances of the string "PRNTD CRCT BRD”, and the dictionary may contain the entry "PRINTED CIRCUIT BOARD,” but since the machine translation system cannot recognize that "PRNTD CRCT BRD” is a form of "PRINTED CIRCUIT BOARD" (even though this may be apparent to a human), the machine translation system fails to translate the term "PRNTD CRCT BRD".
  • the SOLx tool set of the present invention helps turn these "false OOV” terms into terms that the machine translation system can recognize.
  • Conventional language processing systems also have trouble telling which words in a string of words are more closely connected than other sets of words. For example, humans reading a string of words like Acetic Acid Glass Bottle may have no trouble telling that there's no such thing as "acid glass,” or that the word Glass goes together with the word Bottle and describes the material from which the bottle is made.
  • Language processing systems typically have difficulty finding just such groupings of words within a string of words. For example, a language processing system may analyze the string Acetic Acid Glass Bottle as follows:
  • phrase boundaries are often covert - that is, not visibly marked.
  • SOLx tool of the present invention prepares data for translation by finding and marking phrase boundaries in the data. For example, it marks phrase boundaries in the string Acetic Acid Glass Bottle as follows:
  • This simple processing step - simple for a human, difficult for a language processing system - helps the machine translation system deduce the correct subgroupings of words within the input data, and allows it to produce the proper translation.
  • the present invention is based, in part, on the recognition that some content, including business content, often is not easily searchable or analyzable unless a schema is constructed to represent the content.
  • a computational system must address to do this correctly. These include: deducing the "core” item; finding the attributes of the item; and finding the values of those attributes.
  • conventional language processing systems have trouble telling which words in a string of words are more closely connected than other sets of words. They also have difficulty determining which word or words in the string represent the "core,” or most central, concept in the string.
  • a conventional language processing system may analyze the string Acetic Acid Glass Bottle as follows:
  • the SOLx system of the present invention allows a user to provide guidance to its own natural language processing system in deducing which sets of words go together to describe values. It also adds one very important functionality that conventional natural language processing systems cannot perform without human guidance.
  • the SOLx system allows you to guide it to match values with specific attribute types.
  • the combination of (1) finding core items, and (2) finding attributes and their values, allows the SOLx system to build useful schemas. As discussed above, covert phrase boundaries interfere with good translation. Schema deduction contributes to preparation of data for machine translation in a very straightforward way: the labels that are inserted at the boundaries between attributes correspond directly to phrase boundaries. In addition to identifying core items and attributes, it is useful to classify an item.
  • either or both of the core item (acetic acid) and its attributes may be associated with classifications. Conveniently, this may be performed after phrase boundaries have been inserted and core items and attributes have been defined.
  • acetic acid may be identified by a taxonomy where acetic acid belongs to the class aqueous solutions, which belongs to the class industrial chemicals and so on.
  • Glass bottle may be identified by a taxonomy where glass bottle (as well as bucket, drum, etc.) belong to the family aqueous solution containers, which in turn belongs to the family packaging and so on.
  • These relationships may be incorporated into the structure of a schema, e.g., in the form of grandparent, parent, sibling, child, grandchild, etc. tags in the case of a hierarchical taxonomy.
  • Such classifications may assist in translation, e.g., by resolving ambiguities, and allow for additional functionality, e.g., improve searching for related items.
  • the next section describes a number of objectives of the SOLx system configuration process. All of these objectives relate to manipulating data from its native from to a form more amenable for translation or other localization, i.e., performing an initial transformation to an intermediate form.
  • the SOLx configuration process has a number of objectives, including solving 0OVs and solving covert phrase boundaries based on identification of core items, attribute/value pairs and classification. Additional objectives, as discussed below, relate to taking advantage of reusable content chunks and resolving ambiguities. Many of these objectives are addressed automatically, or are partially automated, by the various SOLx tools described below. The following discussion will facilitate a more complete understanding of the internal functionality of these tools as described below.
  • False OOV words and true 00V words can be discovered at two stages in the translation process: before translation, and after translation.
  • Potential OOV words can be found before translation through use of a Candidate Search Engine as described in detail below.
  • OOV words can be identified after translation through analysis of the translated output. If a word appears in data under analysis in more than one form, the Candidate Search Engine considers the possibility that only one of those forms exists in the machine translation system's dictionary. Specifically, the Candidate Search Engine offers two ways to find words that appear in more than one form prior to submitting data for translation: the full/abbreviated search option; and the case variant search option. Once words have been identified that appear in more than one form, a SOLx operator can force them to appear in just one form through the use of vocabulary adjustment rules.
  • the full/abbreviated search may output pairs of abbreviations and words. Each pair represents a potential false OOV term where it is likely that the unabbreviated form is in-vocabulary.
  • the full/abbreviated search may output both pairs of words and unpaired abbreviations.
  • abbreviations that are output paired with an unabbreviated word are potentially false OOV words, where the full form is likely in-vocabulary.
  • Abbreviations that are output without a corresponding full form may be true OOV words.
  • the machine translation dictionary may therefore be consulted to see if it includes such abbreviations.
  • some entries in a machine translation dictionary may be case sensitive.
  • the SOLx system may implement a case variant search that outputs pairs, triplets, etc. of forms that are composed of the same letters, but appear with different variations of case.
  • the documentation for a given machine translation system can then be consulted to learn which case variant is most likely to be in-vocabulary.
  • words that are suspected to be OOV can be compared with the set of words in the machine translation dictionary. There are three steps to this procedure: 1 ) for each word that you suspect is falsely OOV, prepare a list of other forms that that word could take; 2) check the dictionary to see if it contains the suspected false OOV form; 3) check the dictionary to see if it contains one of the other forms of the word that you have identified.
  • the dictionary does not contain the suspected false OOV word and does contain one of the other forms of the word, then that word is falsely OOV and the SOLx operator can force it to appear in the "in-vocabulary" form in the input data as discussed below.
  • this is accomplished through the use of a vocabulary adjustment rule.
  • the vocabulary adjustment rule converts the false OOV form to the in-vocabulary form. The process for writing such rules is discussed in detail below.
  • N-gram analyzer N-gram analyzer
  • a mistranslated phrase identified in the quality control analysis (described below in relation to the TQE module) which has a low NGA probability for the transition between two or more pairs of words suggests a covert phrase boundary.
  • Problems related to covert phrase boundaries can also be addressed through modifying a schematic representation of the data under analysis. In this regard, if a covert phrase boundary problem is identified, it is often a result of attribute rules that failed to identify an attribute. This can be resolved by modifying the schema to include an appropriate attribute rule. If a schema has not yet been produced for the data, a schema can be constructed at this time. Once a categorization or attribute rule has been constructed for a phrase that the translator/translation evaluator has identified as poorly translated, then the original text can be re-translated.
  • Covert phrase boundary problems can be addressed by building a schema, and then running the schematized data through a SOLx process that inserts a phrase boundary at the location of every labeling/tagging rule.
  • the core item of a typical business content description is the item that is being sold/described.
  • An item description often consists of its core item and some terms that describe its various attributes.
  • the item that is being described is a drill.
  • the words or phrases Black and Decker, 3/8", and with accessories all give us additional information about the core item, but do not represent the core item itself.
  • the core item in an item description can generally be found by answering the question, what is the item that is being sold or described here?
  • the item that is being described is a drill.
  • the words or phrases Black and Decker, 3/8", and with accessories all indicate something about the core item, but do not represent the core item itself.
  • a subject matter expert (SME) configuring SOLx for a particular application can leverage his domain-specific knowledge by listing the attributes of core items before beginning work with SOLx, and by listing the values of attributes before beginning work with SOLx. Both classification rules and attribute rules can then be prepared before manipulating data with the SOLx system. Domain-specific knowledge can also be leveraged by recognizing core items and attributes and their values during configuration of the SOLx system and writing rules for them as they appear. As the SME works with the data within the SOLx system, he can write rules for the data as the need appears.
  • the Candidate Search Engine can also be used to perform a collocation search that outputs pairs of words that form collocations.
  • Attribute-value pairs can also be identified based on a semantic category search implemented by the SOLx system.
  • the semantic category search outputs groups of item descriptions that share words belonging to a specific semantic category. Words from a specific semantic category that appear in similar item descriptions may represent a value, an attribute, or (in some sense) both.
  • Adjective phrases also exist mixed with adverbs (Av). Table 2 lists some examples.
  • the noun phrase four-strand color-coded twisted-pair telephone wire has the pattern NNNAANNN. It is grouped as (fourN (color N coded A ) A (twisted A pair N ) N telephone N wire N . Another way to look at this item is an object-attribute list.
  • the primary word or object is wire; of use type telephone; strand type twisted-pair, color property color-coded, and strand number type is four-stranded.
  • N N 1 AN 2 .
  • the SOLx system includes tools as discussed in more detail below for identifying reusable chunks, developing rules for translation and storing translated terms/chunks for facilitating substantially real-time transformation of electronic content.
  • Another objective of the configuration process is enabling SOLx to resolve certain ambiguities.
  • Ambiguity exists when a language processing system does not know which of two or more possible analyses of a text string is the correct one.
  • Lexical ambiguity occurs when a language processing system does not know which of two or more meanings to assign to a word.
  • the abbreviation mil can have many meanings, including million, millimeter, military, and Milwaukee. In a million-item database of tools and construction materials, it may occur with all four meanings.
  • lexical ambiguity leads to the problem of the wrong word being used to translate a word in your input. To translate your material, it is useful to expand the abbreviation to each of its different full forms in the appropriate contexts. The user can enable the SOLx system to do this by writing labeling rules that distinguish the different contexts from each other.
  • mil might appear with the meaning million in the context of a weight, with the meaning millimeter in the context of a length, with the meaning military in the context of a specification type (as in the phrase MIL SPEC), and with the meaning Milwaukee in the context of brand of a tool.
  • resolving lexical ambiguity involves a number of issues, including identification of the core item in an item description; identification of values for attributes; and assignment of values to proper attributes.
  • Lexical ambiguity may also be resolved by reference to an associated classification.
  • the classification may be specific to the ambiguous term or a related term, e.g., another term in the same noun phrase.
  • the ambiguous abbreviation "mil” may be resolved by 1) noting that it forms an attribute of an object-attribute list, 2) identifying the associated object (e.g., drill), 3) identifying a classification of the object (e.g., power tool), and 4) applying a rule set for that classification to select a meaning for the term (e.g., mil - Milwaukee). These relationships may be defined by the schema.
  • Structural ambiguity occurs when a language processing system does not know which of two or more labeling rules to use to group together sets of words within an item description. This most commonly affects attribute rules and may require further nesting of parent/child tag relationships for proper resolution. Again, a related classification may assist in resolving structural ambiguity.
  • the various configuration objectives can be addressed in accordance with the present invention by transforming input data from its native form into an intermediate form that is more amenable to translation or other localization/transformation.
  • the corresponding process which is a primary purpose of SOLx system configuration, is termed "normalization.”
  • normalization Once normalized, the data will include standardized terminology in place of idiosyncratic terms, will reflect various grammar and other rules that assist in further processing, and will include tags that provide context including classification information for resolving ambiguities and otherwise promoting proper transformation.
  • the associated processes are executed using the Normalization Workbench of the SOLx system, as will be described below.
  • grammatical rules There are two kinds of rules developed using the Normalization Workbench: grammatical rules, and normalization rules.
  • the purpose of a grammatical rule is to group together and label a section of text.
  • the purpose of a normalization rule is to cause a labeled section of text to undergo some change.
  • the Normalization Workbench offers a number of different kinds of normalization rules relating to terminology including: replacement rules, joining rules, and ordering rules.
  • Replacement rules allow the replacement of one kind of text with another kind of text. Different kinds of replacement rules allow the user to control the level of specificity of these replacements.
  • Joining rules allow the user to specify how separated elements should be joined together in the final output.
  • Ordering rules allow the user to specify how different parts of a description should be ordered relative to each other.
  • unguided replacement With regard to replacement rules, data might contain instances of the word centimeter written four different ways — as cm, as cm., as cm., and as centimeter — and the user might want to ensure that it always appears as centimeter.
  • the Normalization Workbench implements two different kinds of replacement rules: unguided replacement, and guided replacement.
  • the rule type that is most easily applicable to a particular environment can be selected.
  • Unguided replacement rules allow the user to name a tag/category type, and specify a text string to be used to replace any text that is under that tag.
  • Guided replacement rules allow the user to name a tag/category type, and specify specific text strings to be used to replace specific text strings that are under that tag.
  • the format of unguided replacement rules may be, for example:
  • the second line is unchanged; in the first line, foot has been changed to feet.
  • Guided replacement rules allow the user to name a tag/category type, and specify specific text strings to be used to replace specific text strings that are under that tag. This is done by listing a set of possible content strings in which the normalization engine should "look up" the appropriate replacement.
  • the format of these rules is:
  • Fig. 1 shows a user interface screen 100 including a left pane 102 and a right pane 104.
  • the left pane 102 displays the grammar rules that are currently in use.
  • the rules are shown graphically, including alternative expressions (in this case) as well as rule relationships and categories. Many alternative expressions or candidates therefor are automatically recognized by the workbench and presented to the user.
  • the right pane 104 reflects the process to update or add a text replacement rule.
  • a grammar rule is selected in the left pane 102. All text that can be recognized by the rule appears in the left column of the table 106 in the right pane 104.
  • the SME then has the option to unconditionally replace all text with the string from the right column of the table 106 or may conditionally enter a replacement string.
  • similar interfaces allow for easy development and implementation of the various rules discussed herein. It will be appreciated that "liter” and "ounce” together with their variants thus are members of the class "volume” and the left pane 102 graphically depicts a portion of a taxonomy associated with a schema.
  • Joining rules allow the user to specify how separated elements should be joined together in the final output. Joining rules can be used to re-join elements that were separated during the process of assigning category labels. The user can also use joining rules to combine separate elements to form single delimited fields.
  • the catheter tip configuration JL4 will appear as [catheter_tip_configu ration] (J L 4) after its category label is assigned.
  • the customary way to write this configuration is with all three of its elements adjacent to each other. Joining rules allow the user to join them together again.
  • the user may wish the members of a particular category to form a single, delimited field. For instance, you might want the contents of the category label [litter_box] ( plastic hi-impact scratch-resistant ) to appear as plastic, hi-impact, scratch -resistant in order to conserve space in your data description field.
  • Joining rules allow the user to join these elements together and to specify that a comma be used as the delimiting symbol.
  • the delimiter can be absent, in which case the elements are joined immediately adjacent to each other. For example, numbers emerge from the category labeler with spaces between them, so that the number twelve looks like this:
  • a standard normalization rule file supplied with the Normalization Workbench contains the following joining rule:
  • Ordering rules allow the user to specify how different parts of a description should be ordered relative to each other. For instance, input data might contain catheter descriptions that always contain a catheter size and a catheter type, but in varying orders — sometimes with the catheter size before the catheter type, and sometimes with the catheter type before the catheter size:
  • Ordering rules generally have three parts. Beginning with a simple example:
  • Each of those elements is assigned a number, which is written in the format $number in the third part of the rule.
  • the third part of the rule shown in bold below, specifies the order in which those elements should appear in the output:
  • Ordering rules can appear with any number of elements.
  • this rule refers to a category label that contains four elements. The rule switches the position of the first and third elements of its input, while keeping its second and fourth elements in their original positions:
  • Fig. 2 shows an example of a user interface screen 200 that may be used to develop and implement an ordering rule.
  • the screen 200 includes a left pane 202 and a right pane 204.
  • the left pane 202 displays the grammar rules that are currently in use - in this case, ordering rules for container size - as well as various structural productions under each rule.
  • the right pane 204 reflects the process to update or add structural reorganization to the rule.
  • a structural rule is selected using the left pane 202.
  • the right pane 204 can then be used to develop or modify the rule.
  • the elements or "nodes" can be reordered by simple drag-and-drop process. Nodes may also be added or deleted using simple mouse or keypad commands.
  • Ordering rules are very powerful, and have other uses besides order- changing per se. Other uses for ordering rules include the deletion of unwanted material, and the addition of desired material.
  • the undesired material can be omitted from the third part of the rule.
  • the following rule causes the deletion of the second element from the product description:
  • the desired material can be added to the third part of the rule in the desired position relative to the other elements.
  • the following rule causes the string [real_cnx]"-' to be added to the product description:
  • the SOLx system also involves normalization rules relating to context cues, including classification and phrasing.
  • the rules that the SOLx system uses to identify contexts and determine the location and boundaries of attribute/value pairs fall into three categories: categorization rules, attribute rules, and analysis rules.
  • Categorization rules and attribute rules together form a class of rules known as labeling/tagging rules, labeling/tagging rules cause the insertion of labels/tags in the output text when the user requests parsed or labeled/tagged texts. They form the structure of the schema in a schematization task, and they become phrase boundaries in a machine translation task.
  • Analysis rules do not cause the insertion of labels/tags in the output. They are inserted temporarily by the SOLx system during the processing of input, and are deleted from the output before it is displayed.
  • analysis tags are not displayed in the output (SOLx can allow the user to view them if the data is processed in a defined interactive mode), they are very important to the process of determining contexts for vocabulary adjustment rules and for determining where labels/tags should be inserted. The analysis process is discussed in more detail below.
  • [French] is the name assigned to the category of "things that can be forms of the word that expresses the unit of size of catheters" and could just as well have been called [catheter_size_unit], or [Fr], or [french]. The important thing is to give the category a label that is meaningful to the user.
  • (Fr) 1 (Fr.), and (French) are the forms that a thing that belongs to the category [French] can take. Although the exact name for the category [French] is not important, it matters much more how these "rule contents" are written. For example, the forms may be case sensitive. That is, (Fr) and (fr) are different forms. If your rule contains the form (Fr), but not the form (fr), then if there is a description like this:
  • all of the indications of catheter size include an integer followed by the unit of catheter size.
  • [catheter_size] is the name assigned to the category of "groups of words that can indicate the size of a catheter;" and could just as well have been called [size], or [catheterSize], or [sizeOfACatheter]. The important thing is to give the category a label that is meaningful to the user.
  • ([real] [French]) is the part of the rule that describes the things that make up a [catheter_size] — that is, something that belongs to the category of things that can be [French], and something that belongs to the categories of things that can be [real] — and what order they have to appear in — in this case, the [real] first, followed by the [French]. In this part of the rule, exactly how things are written is important.
  • this example has involved a set of rules that allows description of the size of every catheter in a list of descriptions.
  • the SME working with this data might then want to write a set of rules for describing the various catheter types in the list.
  • this example has started with the smallest units of text that could be identified (the different forms of [French]) and worked up from there (to the [catheter_size] category).
  • the SME may have an idea of a higher-level description (i.e., catheter type), but no lower-level descriptions to build it up out of; in this case, the SME may start at the top, and think his way down through a set of rules.
  • each of these descriptions includes some indication of the type of the catheter, shown in bold text below:
  • a catheter type can be described in one of two ways: by the tip configuration of the catheter, and by the purpose of the catheter. So, the SME may write a rule that captures the fact that catheter types can be identified by tip configuration or by catheter purpose.
  • catheter tip configurations can be described in two ways: 1) by a combination of the inventor's name, an indication of which blood vessel the catheter is meant to engage, and by an indication of the length of the curve at the catheter tip; or 2) by the inventor's name alone.
  • the SME can write a rule that indicates these two possibilities in this way: [catheter_tip_configuration]
  • [catheter_tip_configuration] is the category label; ([inventor] [coronary_artery] [curve_size]) and ([inventor]) are the two forms that things that belong to this category can take.
  • the SME will need to write rules for [inventor], [coronary_artery], and [curve_size]. The SME knows that in all of these cases, the possible forms that something that belongs to one of these categories can take are very limited, and can be listed, similarly to the various forms of [French]:
  • the SME has a complete description of the [catheter_tip_configuration] category.
  • the SME is writing a [catheter_tip_configuration] rule because there are two ways that a catheter type can be identified: by the configuration of the catheter's tip, and by the catheter's purpose.
  • the SME has the [catheter_tip_configuration] rule written now and just needs a rule that captures descriptions of a catheter's purpose.
  • the SME is aware that (at least in this limited data set) a catheter's purpose can be directly indicated, e.g. by the word angioplasty, or can be inferred from something else — in this case, the catheter's shape, as in pigtail. So, the SME writes a rule that captures the fact that catheter purpose can be identified by purpose indicators or by catheter shape.
  • the SME needs a rule for describing catheter purpose, and a rule for describing catheter shape. Both of these can be simple in this example:
  • wanker are rules for category labels that should appear in the output of the token normalization process. In one implementation, wankers are written similarly to other rules, except that their category label starts with the symbol >. For example, in the preceding discussion, we wrote the following wanker rules:
  • Chunks of text that have been described by a wanker rule will be tagged in the output of the token normalization process. For example, with the rule set that we have defined so far, including the two wankers, we would see output like the following:
  • the other rules are used in this example to define the wanker rules, and to recognize their various forms in the input text, since the other rules are not wankers, their category labels do not appear in the output. If at some point it is desired to make one or more of those other rules 1 category labels to appear in the output, the SME or other operator can cause them to do so by converting those rules to wankers.
  • the foregoing example included two kinds of things in rules.
  • the example included rules that contained other category labels. These "other" category labels are identifiable in the example by the fact that they are always enclosed in square brackets, e.g.,
  • the example also included rules that contained strings of text that had to be written exactly the way that they would appear in the input. These strings are identifiable by the fact that they are directly enclosed by parentheses, e.g.
  • Regular expressions allow the user to specify approximately what a description will look like. Regular expressions can be recognized by the facts that, unlike the other kinds of rule contents, they are not enclosed by parentheses, and they are immediately enclosed by "forward slashes.”
  • the SOLx system of the present invention consists of many components, as will be described below.
  • One of these components is the Natural Language Engine module, or NLE.
  • the NLE module evaluates each item description in data under analysis by means of rules that describe the ways in which core items and their attributes can appear in the data.
  • the exact (machine- readable) format that these rules take can vary depending upon the application involved and computing environment. For present purposes, it is sufficient to realize that these rules express relationships like the following (stated in relation to the drill example discussed above):
  • a drill's size may be three eighths of an inch or one half inch
  • the NLE checks each line of the data individually to see if any of the rules seem to apply to that line. If a rule seems to apply, then the NLE inserts a label/tag and marks which string of words that rule seemed to apply to. For example, for the set of rules listed above, then in the item description Black and Decker 3/8" drill with accessories, the NLE module would notice that 3/8" might be a drill size, and would mark it as such. If the user is running the NLE in interactive mode, he may observe something like this in the output:
  • the performance of the rules can be analyzed in two stages. First, determine whether or not the rules operate adequately. Second, if it is identified that rules that do not operate adequately, determine why they do not operate adequately.
  • the performance of the rules can be determined by evaluating the adequacy of the translations in the output text.
  • the performance of the rules can be determined by evaluating the adequacy of the schema that is suggested by running the rule set. For any rule type, if a rule has been identified that does not perform adequately, it can be determined why it does not operate adequately by operating the NLE component in interactive mode with output to the screen.
  • test data set can be analyzed to determine if: every item that should be labeled/tagged has been labeled/tagged and any item that should not have been labeled/tagged has been labeled/tagged in error.
  • test data set must include both items that should be labeled/tagged, and items that should not be tagged.
  • Vocabulary adjustment rules operate on data that has been processed by tagging/tagging rules, so troubleshooting the performance of vocabulary adjustment rules requires attention to the operation of tagging/tagging rules, as well as to the operation of the vocabulary adjustment rules themselves.
  • the data set selected to evaluate the performance of the rules should include: examples of different types of core items, and for each type of core item, examples with different sets of attributes and/or attribute values.
  • Normalization facilitates a variety of further processing options.
  • One important type of processing is translation as noted above and further described below.
  • other types of processing in addition to or instead of translation are enhanced by normalization including database and network searching, document location and retrieval, interest/personality matching, information aggregation for research/analysis, etc.
  • a database and network searching application will now be described. It will be appreciated that this is closely related to the context assisted searching described above. In many cases, it is desirable to allow for searching across semantic boundaries. For example, a potential individual or business consumer may desire to access company product descriptions or listings that may be characterized by abbreviations and other terms, as well as syntax, that are unique to the company or otherwise insufficiently standardized to enable easy access. Additionally, submitting queries for searching information via a network (e.g., LAN, WAN, proprietary or open) is subject to considerable lexicographic uncertainty, even within a single language environment, which uncertainty expands geometrically in the context of multiple languages.
  • a network e.g., LAN, WAN, proprietary or open
  • searcher It is common for a searcher to submit queries that attempt to encompass a range of synonyms or conceptually related terms when attempting to obtain complete search results. However, this requires significant knowledge and skill and is often impractical, especially in a multi-language environment. Moreover, in some cases, a searcher, such as a consumer without specialized knowledge regarding a search area, may be insufficiently knowledgeable regarding a taxonomy or classification structure of the subject matter of interest to execute certain search strategies for identifying information of interest through a process of progressively narrowing the scope of responsive information based on conceptual/class relationships.
  • the left panel 102 of Fig. 1 graphically depicts a portion of a taxonomy where, for example, the units of measure "liter” and "ounce", as well as variants thereof, are subclasses of the class "volume.”
  • a searcher entering a query including the term “ounce” (or “oz") may access responsive information for a database or the like including the term “oz” or ("ounce”).
  • metric equivalent items e.g., including the term "ml,” may be retrieved in response to the query based on tags commonly linking the search term and the responsive item to the class
  • Fig. 19 illustrates a taxonomy 1900 related to the area of mechanics that may be used in connection with research related to small aircraft runway accidents attributed to following in the wake of larger aircraft.
  • Terms 1902 represent alternative terms that may be normalized by an SME using the present invention, such as an administrator of a government crash investigation database, to the normalized terms 1904, namely, "vorticity" and "wake.”
  • These terms 1904 may be associated with a parent classification 1906 ("wingtip vortices") which in turn is associated with a grandparent classification 1908 ("aerodynamic causes”) and so on.
  • normalization allows for mapping of a range of colloquial or scientific search terms into predefined taxonomy, or for tagging of documents including such terms relative to the taxonomy.
  • the taxonomy can then be used to resolve, lexicographic ambiguities and to retrieve relevant documents.
  • Fig. 20 is a flowchart illustrating a process 2000 for constructing a database for enhanced searching using normalization and classification.
  • the illustrated process 2000 is initiated by establishing (2002) a taxonomy for the relevant subject matter. This may be performed by an SME and will generally involve dividing the subject matter into conceptual categories and subcategories that collectively define the subject matter. In many cases, such categories may be defined by reference materials or industry standards.
  • the SME may also establish (2004) normalization rules, as discussed above, for normalizing a variety of terms or phrases into a smaller number of normalized terms. For example, this may involve surveying a collection or database of documents to identify sets of corresponding terms, abbreviations and other variants. It will be appreciated that the taxonomy and normalization rules may be supplemented and revised over time based on experience to enhance operation of the system.
  • a document to be stored is received (2004) and parsed (2006) into appropriate chunks, e.g., words or phrases. Normalization rules are then applied (2008) to map the chunks into normalized expressions. Depending on the application, the document may be revised to reflect the normalized expressions, or the normalized expressions may merely be used for processing purposes. In any case, the normalized expressions are then used to define (2010) a taxonomic lineage (e.g., wingtip vortices, aerodynamic causes, etc.) for the subject term and to apply (2012) corresponding tags. The tagged document (2014) is then stored and the tags can be used to retrieve, print, display, transmit, etc., the document or a portion thereof. For example, the database may be searched based on classification or a term of a query may be normalized and the normalized term may be associated with a classification to identify responsive documents.
  • a taxonomic lineage e.g., wingtip vortices, aerodynamic causes, etc.
  • the SOLx paradigm is to use translators to translate repeatable complex terms and phrases, and translation rules to link these phrases together. It uses the best of both manual and machine translation.
  • the SOLx system uses computer technology for repetitive or straightforward applications, and uses people for the complex or special-case situations.
  • the NorTran (Normalization/Translation) server is designed to support this paradigm.
  • Figure 3 represents a high-level architecture of the NorTran platform 300. Each module is discussed below as it relates to the normalization/classification process. A more detailed description is provided below in connection with the overall SOLx schematic diagram description for configuration and run-time operation.
  • the GUI 302 is the interface between the subject matter expert (SME) or human translator (HT) and the core modules of the NorTran server.
  • SME subject matter expert
  • HT human translator
  • This N-Gram 304 filter for the N-gram analysis defines the parameters used in the N-gram program.
  • the N-gram program is the key statistical tool for identifying the key reoccurring terms and phrases of the original content.
  • the N-Gram and other statistical tools module 306 is a set of parsing and statistical tools that analyze the original content for significant terms and phrases. The tools parse for the importance of two or more words or tokens as defined by the filter settings.
  • the output is a sorted list of terms with the estimated probabilities of the importance of the term in the totality of the content. The goal is to aggregate the largest re-usable chunks and have them directly classified and translated.
  • the chunking classification assembly and grammar rules set 308 relates the pieces from one language to another. For example, as discussed earlier, two noun phrases NiNa are mapped in Spanish as N 2 'de' N-i. Rules may need to be added or existing ones modified by the translator. The rules are used by the translation engine with the dictionaries and the original content (or the normalized content) to reassemble the content in its translated form.
  • the rules/grammar base language pairs and translation engine 310 constitute a somewhat specialized machine translation (MT) system.
  • the translation engine portion of this system may utilize any of various commercially available translation tools with appropriate configuration of its dictionaries. Given that the translation process is not an exact science and that round trip processes (translations from A to B to A) rarely work, a statistical evaluation is likely the best automatic tool to assess the acceptability of the translations.
  • the Translation Accuracy Analyzer 312 assesses words not translated, heuristics for similar content, baseline analysis from human translation and other criteria.
  • the chunking and translation editor 314 functions much like a translator's workbench. This tool has access to the original content; helps the SME create normalized content if required; the normalized content and dictionaries help the translator create the translated terms and phase dictionary, and when that repository is created, helps the translator fill in any missing terms in the translation of the original content.
  • a representation of the chunking functionality of this editor is shown in the example in Table 3.
  • the first column lists the original content from a parts list of cooking dishes. 5
  • the term (A) etc. are dimensional measurements that are not relevant to the discussion.
  • the second column lists the chunked terms from an N-gram analysis; the third column lists the frequency of each term in the original content set.
  • the fourth column is the number associated with the chunk terms in column 2.
  • the fifth column is the representation of the first column in 10 terms of the sequence of chunked content.
  • a classification lineage is also associated with each chunk to assist in translation, e.g., by resolving ambiguities.
  • Table 5 shows the Original Content and the Translated Content that is created by assembling the Translated Normalized Terms in Table 4 according to the Chunked Original Content sequence in Table 3.
  • the Normalized Special Terms and Phrases repository 316 contains chunked content that is in a form that supports manual translation. It is free of unusual acronyms, misspellings, and strived for consistency. In Table 3 for example, Emile Henry was also listed as E. Henry. Terms usage is maximized.
  • the Special Terms and Phrases Translation Dictionary repository 318 is the translated normalized terms and phrases content. It is the specialty dictionary for the client content.
  • Other translation dictionaries 320 may be any of various commercially available dictionary tools and/or SOLx developed databases. They may be general terms dictionaries, industry specific, SOLx acquired content, or any other knowledge that helps automate the process.
  • One of the tenets of the SOLx process is that the original content need not be altered.
  • SOLx uses a set of meta or non- persistent stores so that the translations are based on the normalized meta content 322.
  • Tags reflecting classification information may also be kept here.
  • the above discussion suggests a number of processes that may be implemented for the automatic translation of large databases of structured content.
  • One implementation of these processes is illustrated in the flowchart of Fig. 4 and is summarized below. It will be appreciated that these processes and the ordering thereof can be modified.
  • the firm's IT organization extracts 400 the content from their IT systems — ideally with a part number or other unique key.
  • one of the key SOLx features is that the client need not restructure or alter the original content in their IT databases.
  • restructuring benefits localization efforts by reducing the translation set up time and improving the translation accuracy.
  • One of these modifications is to adopt a 'normalized' or fixed syntactic, semantic, and grammatical description of each content entry.
  • Translators can then translate (406) the internationalized important terms and phrases.
  • This translated content forms a dictionary of specialty terms and phrases. In essence, this translated content corresponds to the important and re-usable chunks.
  • the translator may need to specify the gender alternatives, plural forms, and other language specific information for the special terms and phrases dictionary. Referring again to an example discussed above, translators would probably supply the translation for (four-strand), (color-coded), (twisted-pair), telephone, and wire. This assumes that each term was used repeatedly. Any other entry that uses (color-coded) or wire would use the pre-translated term.
  • language specific rules are used to define (410) the assembly of translated content pieces.
  • the types of rules described above define the way the pre-translated chunks are reassembled. If, in any one description, the grammatical structure is believed to be more complicated than the pre-defined rule set, then the phrase is translated in its entirety.
  • the original content (on a per item basis) is then mapped (412) against the dictionaries.
  • the line item content is parsed and the dictionaries are searched for the appropriate chunked and more general terms (content chunks to translated chunks).
  • all terms in the dictionaries map to a single-line item in the content database, i.e. a single product description.
  • This is the first function of the translation engine.
  • the classification information may be used to assist in this mapping and to resolve ambiguities.
  • a software translation engine then assembles (414) the translated pieces against the language rules. Input into the translation engine includes the original content, the translation or assembly rules, and the translated pieces.
  • a translation tool will enable a translator to monitor the process and directly intercede if required. This could include adding a new chunk to the specialty terms database, or overriding the standard terms dictionaries.
  • a statistically based software tool assesses (416) the potential accuracy of the translated item.
  • One of the difficulties of translation is that when something is translated from one language to another and then retranslated back to the first, the original content is rarely reproduced. Ideally, one hopes it is close, but rarely will it be exact. The reason for this is there is not a direct inverse in language translation.
  • Each language pair has a circle of 'confusion' or acceptability. In other words, there is a propagation of error in the translation process. Short of looking at every translated phrase, the best than can be hoped for in an overall sense is a statistical evaluation.
  • Translators may re-edit (418) the translated content as required. Since the content is stored in a database that is indexed to the original content on an entry-by-entry basis, any entry may be edited and restored if this process leads to an unsatisfactory translation. Although not explicitly described, there are terms such as proper nouns, trade names, special terms, etc., that are never translated. The identification of these invariant terms would be identified in the above process. Similarly, converted entries such as metrics would be handled through a metrics conversion process. The process thus discussed uses both human and machine translation in a different way than traditionally employed. This process, with the correct software systems in place should generate much of the accuracy associated with manual translation. Further, this process should function without manual intervention once sufficient content has been pre-translated.
  • the first step is to import the source structured content file.
  • This will be a flat set file with the proper character encoding, e.g., UTF-8. There will generally be one item description per line. Some basic formatting of the input may be done at this point.
  • Fig. 6 shows normalized form of the content on the right and the original content (as imported above) on the left. What is not shown here are the grammars and rules used to perform the normalization. The form of the grammars and rules and how to created them are described above.
  • the normalization rules can enforce this standard form, and the normalized content would reflect this structure.
  • Another very valuable result of the normalization step can be to create a schematic representation of the content.
  • the phrase analysis step as illustrated, the user is looking for the phrases in the now normalized content that still need to be translated to the target language.
  • the purpose of Phrase Analysis, and in fact, the next several steps, is to create a translation dictionary that will be used by machine translation.
  • the value in creating the translation dictionary is that only the phrases need translation not the complete body of text, thus providing a huge savings in time and cost to translate.
  • the Phrase Analyzer only shows us here the phrases that it does not already have a translation for. Some of these phrases we do not want to translate, which leads us to the next step.
  • an SME reviews this phrase data and determines which phrases should be translated.
  • a professional translator and/or machine tool translates the phrases (Figs. 8 - 9) from the source language, here English, to the target language, here Spanish, using any associated classification information.
  • a SOLx user interface could be used to translate the phrases, or the phrases are sent out to a professional translator as a text file for translation.
  • the translated text is returned as a text file and loaded into SOLx.
  • the translated phrases become the translation dictionary that is then used by the machine translation system.
  • the machine translation system uses the translation dictionary created above as the source for domain specific vocabulary.
  • the SOLx system greatly increases the quality of the output from the machine translation system.
  • the SOLx system can also then provide an estimation of the quality of the translation result (Fig. 10). Good translations would then be loaded into the run-time localization system for use in the source system architecture. Bad translations would be used to improve the normalization grammars and rules, or the translation dictionary.
  • the grammars, rules, and translation dictionary form a model of the content. Once the model of the content is complete, a very high level of translations are of good quality.
  • Fig. 11 summarizes the steps of an exemplary normalization configuration process
  • Fig. 12 summarizes an exemplary translation configuration process.
  • a new SOLx normalization process (1000) is initiated by importing (1102) the content of a source database or portion thereof to be normalized and selecting a quantify of text from a source database. For example, a sample of 100 item descriptions may be selected from the source content "denoted content.txt file.” A text editor may be used to select the 100 lines. These 100 lines are then saved to a file named samplecontent.txt for purposes of this discussion.
  • the core items in the samplecontent.txt file are then found (1104) using the Candidate Search Engine, for example, by running a words-in-common search.
  • attribute/value information is found (1106) in the samplecontent.txt file using the Candidate Search Engine by running collocation and semantic category searches as described above.
  • the SOLx system can be used to write (1108) attribute rules. The formalism for writing such rules has been discussed above. It is noted that the SOLx system performs much of this work for the user and simple user interfaces can be provided to enable "writing" of these rules without specialized linguistic or detailed code-writing skills.
  • the SOLx system can also be used at this point to write (1110) categorization or classification rules.
  • the translation process 1200 is initiated by acquiring (1202) the total set of item descriptions that you want to translate as a flat file, with a single item description per line. For purposes of the present discussion, it is assumed that the item descriptions are in a file with the name of content.txt. A text editor may be used to setup an associated project configuration file.
  • sample of 100 item descriptions is selected (1204) from the content.txt file.
  • a text editor may be used to select the 100 lines. These 100 lines to a file named samplecontent.txt.
  • the translation process continues with finding (1206) candidates for vocabulary adjustment rules in the samplecontent.txt file using the Candidate Search Engine.
  • the Candidate Search Engine may implement a case variant search and full/abbreviated variant search, as well as a classification analysis, at this point in the process.
  • the resulting information can be used to write vocabulary adjustment rules.
  • Vocabulary adjustment rules may be written to convert abbreviated forms to their full forms.
  • Vocabulary adjustment rules are then run (1212) using the Natural Language Engine against the original content. Finally, the coverage of the data set can be analyzed (1214) evaluating performance of your vocabulary adjustment rules and evaluating performance of your attribute rules. At this point, if the proper coverage is being achieved by the vocabulary adjustment rules, then the process proceeds to building (1216) a domain-specific dictionary.
  • the SME can run a translation dictionary creation utility. This runs using the rule files created above as input, and produces the initial translation dictionary file.
  • This translation dictionary file contains the words and phrases that were found in the rules.
  • the words and phrases found in the translation dictionary file can then be manually and/or machine translated (1218). This involves extracting a list of all word types using a text editor and then translating the normalized forms manually or through a machine tool such as SYSTRAN. The translated forms can then be inserted into the dictionary file that was previously output.
  • the SME can run (1220) the machine translation module, run the repair module, and run the TQE module.
  • the file outputs from TQE are reviewed (1222) to determine whether the translation results are acceptable.
  • the acceptable translated content can be loaded (1224) into the Localized Content Server (LCS), if desired.
  • the remainder of the translated content can be analyzed (1226) to determine what changes to make to the normalization and translation knowledge bases in order to improve the quality of the translation. Words and phrases that should be deleted during the translation process can be deleted (1228) and part-of-speech labels can be added, if needed.
  • the SME can then create (1230) a file containing the translated words in the source and target languages. Once all of the content is found to be acceptable, the system is fully trained. The good translated content is then loaded into the LCS.
  • FIG. 13 shows an example of such an interface.
  • the graphical desktop 1300 is divided into multiple workspaces, in this case, including workspaces 1302, 1304 and 1306.
  • One workspace 1302 presents the source file content that is in process, e.g., being normalized and translated.
  • a second area 1304, in this example, functions as the normalization workbench interface and is used to perform the various configuration processes such as replacing various abbreviations and expressions with standardized terms or, in the illustrated example, defining a parse tree.
  • workspace 1306 may be provided for accessing other tools such as the Candidate Search Engine which can identify terms for normalization or, as shown, allow for selection of rules.
  • normalized terms are highlighted relative to the displayed source file in workspace 1302 on a currently updated basis.
  • the SME can readily determine when all or enough of the source file has been normalized.
  • this translation process essentially is offline. It becomes real-time and online when new content is added to the system. In this case, assuming well-developed special-purpose dictionaries and linguistic information already exists, the process can proceed in an automatic fashion.
  • Content, once translated is stored in a specially indexed look-up database. This database functions as a memory translation repository. With this type of storage environment, the translated content can be scaled to virtually any size and be directly accessed in the e-business process.
  • the associated architecture for supporting both configuration and run-time operation is discussed below.
  • the SOLx system operates in two distinct modes.
  • the "off-line" mode is used to capture knowledge from the SME/translator and knowledge about the intended transformation of the content. This collectively defines a knowledge base.
  • the off-line mode includes implementation of the configuration and translation processes described above. Once the knowledge base has been constructed, the SOLx system can be used in a file in/file out manner to transform content.
  • the SOLx system may be implemented in a variety of business-to- business (B2B) or other frameworks, including those shown in Fig. 14.
  • the Source 1402 the firm that controls the original content 1404, can be interfaced with three types of content processors 1406.
  • the SOLx system 1400 can interface at three levels: with a Local Platform 1408 (associated with the source 1402), with a Target Platform 1410 (associated with a target to whom the communication is addressed or is otherwise consumed by) and with a Global Platform 1412 (separate from the source 1402 and target 1408).
  • a primary B2B model of the present invention focuses on a Source/Seller managing all transformation/localization.
  • the Seller will communicate with other Integration Servers (such as WebMethods) and bare applications in a "Point to Point" fashion, therefore, all locales and data are registered and all localization is done on the seller side. However, all or some of the localization may be managed by the buyer or on a third party platform such as the global platform.
  • Another model which may be implemented using the global server, would allow two SOLx B2B-enabled servers to communicate in a neutral environment, e.g. English. Therefore, a Spanish and a Japanese system can communicate in English by configuring and registering the local communication in SOLx B2B.
  • a third model would include a local seller communicating directly (via HTTP) with the SOLx B2B enabled Buyer.
  • the SOLx Globalization server consists of two major components (1 ) the Document Processing Engine and (2) the Translated Content Server (TCS).
  • the Document Processing Engine is a WebMethods plug-compatible application that manages and dispenses localized content through XML- tagged business objects.
  • the TCS contains language-paired content that is accessed through a cached database.
  • This architecture assures very high-speed access to translated content.
  • This server uses a hash index on the translated content cross-indexed with the original part number or a hash index on the equivalent original content, if there is not a unique part number.
  • a direct link between the original and translated content via the part number (or hash entry) assures retrieval of the correct entry.
  • the indexing scheme also guarantees very fast retrieval times.
  • the process of adding a new localized item to the repository consists of creating the hash index, link to the original item, and its inclusion into the repository.
  • the TCS will store data in Unicode format.
  • the TCS can be used in a standalone mode where content can be accessed by the SKU or part number of the original item, or through text searches of either the original content or its translated variant. If the hashed index of the translated content is known. It, of course, can be assessed that way. Additionally, the TCS will support SQL style queries through the standard Oracle SQL query tools.
  • the Document Processing Engine is the software component of the Globalization Server that allows localized content in the TCS to be integrated into typical B2B Web environments and system-to-system transactions.
  • XML is rapidly replacing EDI as the standard protocol for Web-based B2B system- to-system communication.
  • WebMethods is one such adaptor but any such technology may be employed.
  • Figure 15 shows a conventional web system 1500 where, the WebMethods integration server 1502 takes as input an SAP-formatted content called an IDOC 1504 from a source back office 1501 via API 1503 and converts it into an XML-formatted document 1506 for transmission over the Web 1508 via optional application server 1510 and HTTP servers 1512 to some other receiver such as a Target back office 1510 or other ERP system.
  • the document 1506 may be transmitted to Target back office 1514 via HTTP servers 1516 and an integration server 1518.
  • FIG. 16 shows the modification of such a system that allows the TCS 1600 containing translated content to be accessed in a Web environment.
  • original content from the source system 1602 is translated by the NorTran Server 1604 and passed to a TCS repository 1606.
  • a transaction request whether requested from a foreign system or the source system 1602, will pass into the TCS 1600 through the Document Processing Engine 1608. From there, a communication can be transmitted across the Web 1610 via integration server adaptors 1612, an integration server 1614, an optional application server 1616 and HTTP servers 1618.
  • FIG. 17 depicts the major components of one implementation of the SOLx system 1700 and the SOLx normalization/classification processes as discussed above.
  • the NorTran Workbench/Server 1702 is that component of the SOLx system 1700 that, under the control of a SME/translator 1704, creates normalized/translated content.
  • the SOLx Server 1708 is responsible for the delivery of content either as previously cached content or as content that is created from the real-time application of the knowledge bases under control of various SOLx engines.
  • the initial step in either a normalization or translation process is to access legacy content 1710 that is associated with the firms' various legacy systems 1712.
  • the legacy content 1710 may be provided as level 1 commerce data consisting of short descriptive phrases delivered as flat file structures that are used as input into the NorTran Workbench 1702.
  • NTW NorTran Workbench
  • the NorTran Workbench (NTW) 1702 is used to learn the structure and vocabulary of the content.
  • the NTW user interface 1716 allows the SME 1704 to quickly provide the system 1700 with knowledge about the content.
  • NTW 1702 is used to normalize and translate large quantities of content.
  • NTW 1702 one purpose of NTW 1702 is to allow SMEs 1704 to use a visual tool to specify rules for parsing domain data and rules for writing out parsed data in a normalized form.
  • the NTW 1702 allows the SME 1704 to choose data samples from the main domain data, then to select a line at a time from that sample.
  • the SME 1704 can build up parse rules that tell the Natural Language Engine (NLE) 1718 how to parse the domain data.
  • NLE Natural Language Engine
  • the SME 1704 can then use visual tools to create rules to specify how the parsed data will be assembled for output - whether the data should be reordered, how particular groups of words should be represented, and so on.
  • the NTW 1702 is tightly integrated with the NLE 1718. While the NTW 1702 allows the user to easily create, see, and edit parse rules and normalization rules, the NLE 1718 creates and stores grammars from these rules.
  • GUI 1716 does not require the SME 1704 to have any background in computational linguistic, natural language processing or other abstract language skill whatsoever.
  • the content SME 1704 must understand what the content really is, and translators must be technical translators.
  • a "butterfly valve" in French does not translate to the French words for butterfly and valve.
  • the CSE 1720 is a system initially not under GUI 1716 control that identifies terms and small text strings that repeat often throughout the data set and are good candidates for the initial normalization process.
  • the SOLx system 1700 provides components and processes that allow the SME 1704 to incorporate the knowledge that he already has into the process of writing rules. However, some domains and data sets are so large and complex that they require normalization of things other than those that the SME 1704 is already aware of. Manually discovering these things in a large data set is time-consuming and tedious.
  • the CSE 1720 allows automatic application of the "rules of thumb” and other heuristic techniques that data analysts apply in finding candidates for rule writing.
  • the CSE component works through the programmatic application of heuristic techniques for the identification of rule candidates. These heuristics were developed from applying knowledge elicitation techniques to two experienced grammar writers. The component is given a body of input data, applies heuristics to that data, and returns a set of rule candidates.
  • the N-Gram Analysis (NGA) lexical based tool 1722 identifies word and string patterns that reoccur in the content. It identifies single and two and higher word phrases that repeat throughout the data set. It is one of the core technologies in the CSE 1720. It is also used to identify those key phrases that should be translated after the content has been normalized.
  • NGA N-Gram Analysis
  • the N-Gram Analysis tool 1722 consists of a basic statistical engine, and a dictionary, upon which a series of application engines rely.
  • the applications are a chunker, a tagger, and a device that recognizes the structure in structured text.
  • Fig. 18 shows the relationships between these layers.
  • base N-Gram Analyzer component 1800 One purpose of the base N-Gram Analyzer component 1800 is to contribute to the discovery of the structure in structured text. That structure appears on multiple levels, and each layer of the architecture works on a different level. The levels from the bottom up are "words”, “terms”, “usage”, and “dimensions of schema”. The following example shows the structure of a typical product description.
  • the word-level of structure is a list of the tokens in the order of their appearance.
  • the word “acetone” is first, then the word “amber”, and so forth.
  • the next two levels of structure connect the words and terms to the goal of understanding the product description.
  • the SOLx system approximates that goal with a schema for understanding.
  • the schema has a simple form that repeats across many kinds of products.
  • the schema for product descriptions looks like a table.
  • Each column of the table is a property that characterizes a product.
  • Each row of the table is a different product. In the cells of the row are the particular values of each property for that product. Different columns may be possible for different kinds of products.
  • This report refers to the columns as "dimensions" of the schema. For other subject matter, the schema may have other forms. This fragment does not consider those other forms.
  • the next level of structure is the usage level. That level classifies each word or term according to the dimension of the schema that it can describe. In the example, “acetone” is a “chemical”; “amber glass” is a material; “bottle” is a “product”; and so forth.
  • the following tagged text shows the usage level of structure of the example in detail.
  • N-Gram Analysis is parallel to the discovery of structure by parsing in the Natural Language Engine.
  • the two components are complementary, because each can serve where the other is weak.
  • the NLE parser could discover the structure of the decimal number, "[number](99.5)", saving NGA the task of modeling the grammar of decimal fractions.
  • the statistical model of grammar in NGA can make it unnecessary for human experts to write extensive grammars for NLE to extract a diverse larger-scale grammar. By balancing the expenditure of effort in NGA and NLE, people can minimize the work necessary to analyze the structure of texts.
  • One of the basic parts of the NGA component 1800 is a statistical modeler, which provides the name for the whole component. The statistical idea is to count the sequences of words in a body of text in order to measure the odds that a particular word appears after a particular sequence.
  • the statistical modeler computes the conditional probability of word n, given words 1 through n-1:
  • the dictionary component 1802 captures that kind of information at the levels of words, terms, and usage. Two sources may provide that information. First, a human expert could add words and terms to the dictionary, indicating their usage. Second, the NLE component could tag the text, using its grammar rules, and the NGA component adds the phrases inside the tags to the dictionary, using the name of the tag to indicate the usage.
  • the information in the dictionary complements the information in the statistical model by providing a better interpretation of text when the statistical assumption is inappropriate.
  • the statistical model acts as a fallback analysis when the dictionary does not contain information about particular words and phrases.
  • the chunker 1804 combines the information in the dictionary 1802 and the information in the statistical model to partition a body of texts into phrases. Partitioning is an approximation of parsing that sacrifices some of the details of parsing in order to execute without the grammar rules that parsing requires.
  • the chunker 1804 attempts to optimize the partitions so each cell is likely to contain a useful phrase. One part of that optimization uses the dictionary to identify function words and excludes phrases that would cut off grammatical structures that involve the function words.
  • the chunker can detect new terms for the dictionary in the form of cells of partitions that contain phrases that are not already in the dictionary.
  • the output of the chunker is a list of cells that it used to partition the body of text.
  • the tagger 1806 is an enhanced form of the chunker that reports the partitions instead of the cells in the partitions.
  • the tagger prints the phrase with the usage for a tag. Otherwise, the tagger prints the phrase without a tag.
  • the result is text tagged with the usage of the phrases.
  • the structurer 1808 uses the statistical modeler to determine how to divide the text into dimensions of the schema, without requiring a person to write grammar rules.
  • the training data for the structurer's statistical model is a set of tagged texts with explicit "walls" between the dimensions of the schema.
  • the structurer trains by using the N-Gram Analyzer 1800 to compute the conditional probabilities of the walls in the training data.
  • the structurer 1808 operates by first tagging a body of text and then placing walls into the tagged text where they are most probable.
  • the candidate heuristics are a series of knowledge bases, much like pre-defined templates that kick-start the normalization process. They are intended to address pieces of content that pervade user content. Items such as units of measure, power consumption, colors, capacities, etc. will be developed and semantic categories 1724 are developed.
  • the spell checker 1726 is a conventional module added to SOLx to increase the effectiveness of the normalization.
  • the Grammar & Rules Editor (GRE) 1728 is a text-editing environment that uses many Unix like tools for creation of rules and grammars for describing the content. It can always be used in a "fall-back" situation, but will rarely be necessary when the GUI 1716 is available.
  • the Taxonomy, Schemas, & Grammar Rules module 1730 is the output from either the GRE 1728 or the GUI 1716. It consists of a set of ASCII files that are the input into the natural language parsing engine (NLE) 1718.
  • NLE natural language parsing engine
  • the NLE 1718 reads a set of grammar and normalization rules from the file system or some other persistent storage medium and compiles them into a set of Rule objects employed by the runtime tokenizer and parser and a set of NormRule objects employed by the normalizer. Once initialized the NLE 1718 will parse and normalize input text one line at a time or may instead process a text input file in batch mode, generating a text output file in the desired form. Configuration and initialization generally requires that a configuration file be specified. The configuration file enumerates the contents of the NLE knowledge base, providing a list of all files containing format, grammar, and normalization rules.
  • NLE 1718 works in three steps: tokenization, parsing, and normalization.
  • Tokenization is based on what sequences of tokens may occur in any top-level phrase parsed by the grammar. Tokens must be delineated by white space unless one or more of such tokens are represented as regular expressions in the grammar, in which case the tokens may be contiguous, undelineated by white space. Tokenization may yield ambiguous results, i.e., identical strings that may be parsed by more than one grammar rule. The parser resolves such ambiguities.
  • the parser is a modified top-down chart parser.
  • Standard chart parsers assume that the input text is already tokenized, scanning the string of tokens and classify each according to its part-of-speech or semantic category. This parser omits the scanning operation, replacing it with the prior tokenization step. Like other chart parsers, it recursively predicts those constituents and child constituents that may occur per the grammar rules and tries to match such constituents against tokens that have been extracted from the input string. Unlike the prototypical chart parser, it is unconstrained where phrases may begin and end, how often they may occur in an input string, or some of the input text might be unable to be parsed.
  • Each parse tree object includes methods for transforming itself according to a knowledge base of normalization rules. Each parse tree object may also emit a String corresponding to text contained by the parse tree or such a String together with a string tag.
  • Normalization includes parse tree transformation and traversal methods for replacing or reordering children (rewrite rules), for unconditional or lookup table based text replacement, for decimal punctuation changes, for joining constituents together with specified delimiters or without white space, and for changing tag labels.
  • the Trial Parsed Content 1734 is a set of test samples of either tagged or untagged normalized content. This sample corresponds to a set of rules and grammars that have been parsed. Trial parsed content is the output of a statistical sample of the original input data. When a sequence of content samples parses to a constant level of unparsed input, then the set if grammars and rules are likely to be sufficiently complete that the entire data may be successfully parsed with a minimum of ambiguities and unparsed components. It is part of the interactive process to build grammars and rules for the normalization of content.
  • a complete tested grammar and rule set 1736 corresponding to the full unambiguous tagging of content is the goal of the normalization process. It insures that all ambiguous terms or phrases such as Mil that could be either a trade name abbreviation for Milwaukee or an abbreviation for Military have been defined in a larger context.
  • This set 1736 is then given as input to the NLE Parsing Engine 1738 that computes the final normalized content, and is listed in the figure as Taxonomy Tagged Normalized Content 1732.
  • the custom translation dictionary 1740 is a collection of words and phrases that are first identified through the grammar rule creation process and passed to an external technical translator. This content is returned and is entered into one of the custom dictionaries associated with the machine translation process. There are standard formats that translators typically use for sending translated content.
  • the MTS 1742 may be any of various conventional machine translation products that given a set of custom dictionaries as well as its standard ones, a string of text in one language, produces a string of test in the desired language.
  • Current languages supported by one such product marked under the name SYSTRAN include: French, Portuguese, English, German, Greek, Spanish, Italian, simplified Chinese, Japanese, and Korean.
  • Output from the MTS is a Translated Content file 1744.
  • the one purpose of the Machine Translation Server 1742 is to translate structured texts, such as product descriptions.
  • the state of the art in commercial machine translation is too weak for many practical applications.
  • the MTS component 1742 increases the number of applications of machine translation by wrapping a standard machine translation product in a process that simplifies its task.
  • the simplification that MTS provides comes from its ability to recognize the structure of texts to be translated.
  • the MTS decomposes the text to be translated into its structural constituents, and then applies machine translation to the constituents, where the translation problem is simpler.
  • This approach sacrifices the fidelity of references between constituents in order to translate the individual constituents correctly. For example, adjective inflections could disagree with the gender of their objects, if they occur in different constituents.
  • the compromise results in adequate quality for many new applications in electronic commerce. Future releases of the software will address this issue, because the compromise is driven by expedience.
  • the conditioning component of MTS 1742 uses the NGA component to recognize the structure of each text to be translated. It prepares the texts for translation in a way that exploits the ability of the machine translation system to operate on batches of texts. For example, SYSTRAN can interpret lists of texts delimited by new-lines, given a parameter stating that the document it receives is a parts list. Within each line of text, SYSTRAN can often translate independently between commas, so the conditioning component inserts commas between dimensions of the schema if they are not already present. The conditioning component may completely withhold a dimension from machine translation, if it has a complete translation of that dimension in its dictionary.
  • the machine translation component provides a consistent interface for a variety of machine translation software products, in order to allow coverage of language pairs.
  • the repair component is a simple automated text editor that removes unnecessary words, such as articles, from SYSTRAN'S Spanish translations of product descriptions. In general, this component will correct for small-scale stylistic variations among machine translation tools.
  • the Translation Quality Estimation Analyzer (TQA) 1746 merges the structural information from conditioning with the translations from repair, producing a list of translation pairs. If any phrases bypassed machine translation, this merging process gets their translations from the dictionary. After merging, translation quality estimation places each translation pair into one of three categories.
  • the "good” category contains pairs whose source and target texts have acceptable grammar, and the content of the source and target texts agrees.
  • a pair in the "bad” category has a source text with recognizable grammar, but its target grammar is unacceptable or the content of the source text disagrees with the content of the target text.
  • the "ugly" category contains pairs whose source grammar is unfamiliar.
  • the feedback loop extracts linguistic knowledge from a person.
  • the person examines the "bad” and “ugly” pairs and takes one of the following actions.
  • the person may define words and terms in the dictionary, indicating their usage.
  • the person may define grammar rules for the NLE component in order to tag some part of the text.
  • the person may correct the translation pair
  • the person may take the source text, mark it with walls between dimensions of the schema, and place it into the set of examples for training the structure model.
  • An appropriate graphical user interface will make the first and last actions implicit in the third action, so a person will only have to decide whether to write grammars or to correct examples.
  • the translation quality estimation component uses two models from the N-Gram Analyzer that represent the grammar of the source and target texts.
  • the translation quality estimation component also uses a content model that is partially statistical and partially the dictionary. The two parts overlap in their ability to represent the correspondence in content between source and target texts.
  • the dictionary can represent exact correspondences between words and terms.
  • the statistical model can recognize words that occur in one language, but are unnecessary in the other, and other inexact correspondences.
  • the TQA 1746 attempts to define a measure of accuracy for any single translation.
  • the basis for the accuracy estimate is a statistical overlap between the translated content at the individual phrase level, and prior translations that have been manually evaluated.
  • the Normalized Content 1748 and/or Translated Content 1706 can next be cached in the Normalized Content Server and Localized Content Server (LCS) 1752, respectively. This cached data is made available through the SOLx Server 1708.
  • LCS Localized Content Server
  • the LCS 1752 is a fast lookup translation cache. There are two parts to the LCS 1752: an API that is called by Java clients (such as a JSP server process) to retrieve translations, and an user interface 1754 that allows the user 1756 to manage and maintain translations in the LCS database 1752.
  • an API that is called by Java clients (such as a JSP server process) to retrieve translations
  • Java clients such as a JSP server process
  • the LCS 1752 is also intended to be used as a standalone product that can be integrated into legacy customer servers to provide translation lookups.
  • the LCS 1752 takes as input source language text, the source locale, and the target locale.
  • the output from LCS 1752 is the target text, if available in the cache, which represents the translation from the source text and source locale, into the target locale.
  • the LCS 1752 is loaded ahead of run-time with translations produced by the SOLx system 1700.
  • the cache is stored in a relational database.
  • the SOLx Server 1708 provides the customer with a mechanism for run-time access to the previously cached, normalized and translated data.
  • the SOLx Server 1708 also uses a pipeline processing mechanism that not only permits access to the cached data, but also allows true on-the-fly processing of previously unprocessed content.
  • the SOLx Server encounters content that has not been cached, it then performs the normalization and/or translation on the fly.
  • the existing knowledge base of the content structure and vocabulary is used to do the on-the-fly processing.
  • NCS and LCS user interface1754 provides a way for SMEs 1756 to search and use normalized 1748 and translated 1706 data.
  • the NCS and LCS data is tied back to the original ERP information via the customer's external key information, typically an item part number.
  • the primary NorTran Workbench engines are also used in the SOLx Server 1708. These include: N-Gram Analyzer 1722, Machine Translation Server 1742, Natural Language Engine 1718, Candidate Search Engine 1720, and Translation Quality Analyzer 1746.
  • the SOLx server 1708 also uses the grammar rules 1754 and custom and standard glossaries 1756 from the Workbench 1702. Integration of the SOLx server 1708 for managing communication between the source/legacy system 1712 and targets via the Web 1758 is managed by an integration server 1758 and a workflow control system 1760.
  • Fig. 21 is a flowchart illustrating a process 2100 for searching a database or network using normalization and classification as discussed above.
  • the process 2100 is initiated by establishing (2102) a taxonomy and establishing (2104) normalization rules as discussed above.
  • the taxonomy may define a subject matter area in the case of a specialized search engine or a substantial portion of a language for a more generalized tool.
  • a query is received (2106) and parsed (2108) into chunks.
  • the chunks are then normalized (2110) and classified (2112) using the normalization rules and taxonomy.
  • the classification information may be associated with the chunks via tags, e.g., XML tags.
  • the normalized chunks may be translated (2114 a-c) to facilitate multi-language searching.
  • the process for translating is described in more detail below.
  • One or more research engines are then used (2116 a-c) to perform term searches using the normalized chunks and the classification information.
  • documents that are searched have also been processed using compatible normalization rules and a corresponding taxonomy as discussed above such that responsive documents can be retrieved based on a term match and/or a tag match.
  • the illustrated process 2100 may be advantageously used even in connection with searching unprocessed documents, e.g., by using the normalized chunks and/or terms associated with the classification to perform a conventional term search.
  • the responsive documents may then be normalized and classified (2118 a-c) and translated (2120 a-c) as described in more detail below. Finally, the search results are compiled (2122) for presentation to the searcher. It will be appreciated that normalization and classification of the search query thus facilitates more structured searching of information in a database or network including in a multi-language environment. Normalization and classification also assist in translation by reducing the quantity of terms required to be translated and by using the classification structure to reduce ambiguities.
  • mapping rules for mapping source collection terms to standardized terminology or previously developed classification structure or taxonomy.
  • Such sharing of information may be used to provide a head-start in connection with a new knowledge base creation project, to accommodate multiple users or SMEs working on the same subject area or domain (including at the same time) or in various other information sharing contexts.
  • the invention is described below in connection with supporting multiple SMEs developing a SMM that involves working in the same domains or at least one common domain. While this example aptly illustrates the information sharing functionality, it will be appreciated that the invention is not limited to this context.
  • Two issues that are addressed by the Knowledge Builder tool in connection with sharing information are: 1) using or importing only selected information, as may be desired, rather than being limited to using or importing a full knowledge base; and 2) resolving potential conflicts or inconsistencies resulting from multiple users working in a single domain.
  • Figure 22 generally illustrates an architecture for an information sharing environment involving multiple SMEs. For purposes of illustration, this is shown as involving a server-client model involving server 2200 and clients 2202-2204. As will be described in more detail below, certain knowledge base development functionality including information sharing functionality is executed by a Knowledge Builder tool 2206. In the illustrated embodiment, the functionality of this tool is illustrated as being distributed over the server 2200 and client 2202-2204 platforms, however, it will be appreciated that other hardware implementations are possible.
  • the SMEs use graphical interfaces 2208 at the clients 2202-2204, in the illustrated embodiment, to access a project database 2210 and a developing knowledge base 2212, each of which is schematically illustrated, in this example, as residing at the server 2202.
  • the project database 2210 may include, for example, the collection of source data that is to be transformed.
  • the knowledge base 2212 includes classification or taxonomy structure, rules and the like, that have been developed by the SMEs or others.
  • the illustrated clients 2202-2204 also include storage 2214, for storing rules and the like under development, or to temporarily store a version of the knowledge base or portions thereof, as will be described in more detail below.
  • the Knowledge Builder tool includes a Domain Management module to address the issue of using or importing only selected information.
  • the Domain Management module segments the various rules in the developing knowledge base into smaller, easily managed compartments. More specifically, the knowledge base may be graphically represented in the familiar form of files and folders.
  • a new knowledge base project is started with at least two domain folders as shown in panel 2300 of a graphical user interface.
  • the knowledge base includes a default domain folder and the common folder 2302.
  • the default domain folder includes phrases and terms that have not been assigned to other domain folders. These phrases and terms appear in the knowledge base tree 2304 under the nodes labeled "Phrase Structure” 2306 and "Terminology" 2308 directly under the "Knowledge Base” node 2310.
  • the common folder does not contain any phrases or terms.
  • the Knowledge Builder tool attempts to automatically place the rules into the appropriate domain folder when they are created. If a domain has not been specified, all created rules are placed in the phrase structure or terminology folders 2306 or 2308 under the knowledge base node 2310.
  • the Knowledge Builder tool When a new domain is created, the Knowledge Builder tool continues to place rules in the phrase structure or terminology folders 2306 or 2308 until the user manually drags the rules into the new domain. Thereafter, when new rules are created, the Knowledge Builder tool analyzes the new rules to determine whether the new rules are related to the rules in the new folder. If so, the tool will automatically place the newly created rules in the same folder. Such analysis may involve consideration of the associated terminology or any identification of a classification or other taxonomical structure, for example, dependencies and references as described below.
  • Domains can be nested to any level.
  • the Knowledge Builder tool automatically creates a common folder at the same level. Whenever a subdomain is created the system creates a sub common folder that is initially empty. If an additional subdomain is created and populated with rules, the Knowledge Builder tool will automatically move rules common to the two subdomains into the sub common. The tool moves rules into and out of common domains as additional rules are created and depending on where they are positioned within the domain hierarchy.
  • phrase rules can also move phrase rules from one domain to another. As phrases are moved into a domain, related phrases and terminal rules are also moved either into the same domain or into the appropriate common domain. For improved efficiency, top-level phrased rules can be moved thereby implicitly dragging related phrase and terminal rules into a domain.
  • a user can also move domain folders into other domains. When a domain folder is moved, all of the associated rules are also moved. This can also create additional common folders. As noted above, information sharing can facilitate creation of new knowledge bases. In this regard, when a new project is created, the user can select a single domain from an existing project to import into the new project. Multiple domains can be imported in this manner with any resulting inconsistencies addressed as discussed below.
  • a fundamental step in the process of knowledge base development is domain creation. Domain creation can be accomplished using the Knowledge
  • Fig. 24 illustrates a graphical user interface 2400 that may be displayed upon launching the Knowledge Builder tool.
  • the graphical user interface 2400 generally includes a knowledge base structure or classification panel 2402, a source collection or project panel 2404, and a taxonomy or parse tree panel 2406. The interoperation of these panels is described below.
  • the user can right-click on the knowledge base node 2408 of the knowledge base panel 2402. This causes a pop-up window 2500 to be displayed as shown in Fig. 25. From the pop-up window 2500, the user selects the create subdomain entry 2502. The user is then prompted to name this first domain as shown in Fig. 26. In the illustrated example, the new domain is named "passives.” As shown in Fig. 27, the knowledge base panel 2700 is then updated to include a folder icon 2702 for the "passives" domain.
  • rules may be moved into the passives domain after that domain is established. For example, rules may be dragged from their current location in the knowledge base panel to the desired domain folder. Alternatively, rules can be dragged into domain folders using a move rules dialog. To open the move rules dialog, the edit/move rules menu (not shown) is selected and the rule is dragged from the knowledge base tree onto the desire domain in the resulting dialog. The advantage of using the move rules dialog is minimizing scrolling through the knowledge base tree.
  • Domains may be renamed by selecting the appropriate domain, right- clicking and selecting the rename domain menu item. A domain name dialog is then opened as shown above and can be used to enter the new name. Domains may be deleted by selecting the appropriate domain, right- clicking and selecting the delete domain menu item. It should be noted that the associated rules are not deleted. They move to the next level in the knowledge base tree. This may involve moving rules to other domains, other common folders or root folders. In the illustrated implementation, it is not possible to simultaneously delete domains and associated rules by deleting only the domain (though such functionality could optionally be supported). Individual rules are deleted either before or after the domain itself is deleted.
  • domains may be imported into a new project without importing the entire prior project. This allows for more efficient reuse of knowledge previously created.
  • the file/input domain's menu item is selected. This opens an import domain's dialog box 2800 as shown in Fig. 28.
  • a pull ⁇ down menu 2802 can then be utilized to select the project from which the user wishes to import a domain.
  • Panel 2804 of the dialog box 2800 displays the knowledge base tree from the selected project. The desired domain can then be dragged to the target position and the knowledge base tree of knowledge base panel 2900 as shown in Fig. 29.
  • Each SME may then "check-out" a version of the domain for revision and extension.
  • the revisions and extensions may be analyzed relative to pre-defined rules.
  • the rules may cause the Knowledge Builder tool to accept revisions and extensions that do not result in conflicts relative to the definitive version and reject all other revisions or extensions.
  • revisions and extensions that result in conflicts or inconsistencies may be identified so that they can be resolved by an authorized SME, e.g., by selecting one of the conflicting rules and editing the other to be replaced by or consistent therewith.
  • all conflicts or inconsistencies may be listed or highlighted for arbitration.
  • one of the SMEs may be designated as dominant with respect to a particular project, such that his revisions and extensions are accepted as definitive. Revisions and extensions by other, subservient SMEs would then be rejected or harmonized with the knowledge base of the dominant SME by arbitration rules as discussed above. Further, rather than checking-out and checking back-in domain versions as discussed above, arbitration can be executed in real time as knowledge base development is occurring.
  • an SME proposes that the term “mil” be rewritten as "milliliter” and a rule already exists (in the same domain or anywhere within the knowledge base, depending on the specific implementation) that requires “mil” to be rewritten as "Milwaukee,” the SME may be immediately notified by way of an error message upon entry of the proposed rule.
  • the Knowledge Builder tool executes logic for identifying or preventing conflicts and inconsistencies. This may be based on dependencies and references.
  • a rule dependency is a relationship between two rules. The dependency is a second rule that must be defined in order for the first rule to be valid. Only phrase structure rules have dependencies.
  • a first structure rule's dependency set is that set of rules that appear as constituents in its productions. Those dependencies are apparent by inspecting a parse tree of the knowledge base panel. Thus, the rule corresponding to a parent node in a parse tree is said to depend on any rule corresponding to child nodes. In the example of Fig.
  • [attrjesistance] has dependencies on [number] and [ohm], and [number] has at least a dependency on [period] that is apparent in this particular parse tree.
  • Other parse trees may reveal other [number] dependencies, e.g., [integer] and [fraction].
  • [separator_colon] References are the inverse of dependencies.
  • a reference is one of possibly several rules that depends on the current rule.
  • rules [screw_dimension], [thread_dia], [real], [separator.-], [separator pound] and [separator colon] are each referenced by [sae_thread_size], although each may be referenced by other rules, too.
  • the Knowledge Builder tool provides a utility to get a list of references for any rule. The utility is accessed by right-clicking on any rule in the knowledge tree. A menu is then displayed that includes the entry "get references.” By selecting the "get references" item, a display is provided as shown in Fig. 32.
  • a grammar rule that resides in some domain may have dependencies on any of the following objects: any rule that resides in the same domain; any rule that resides in a child or descendant domain; and any rule that resides in a Common domain that is a child of a parent or ancestor domain. Thus, the scope of a domain is limited to that domain, any child or descendant domain, and any Common domain that is the child of a parent or ancestor domain.
  • any rule in FactoryEquipment_and_supplies may have dependencies on all rules in adhesives_and_sealants, chemicals, Common, engines_and_motors, and any other of its subdomains, because they are all children and in the top level Common.
  • a rule in FactoryEquipment_and_supplies may not have dependencies on any rules in ComputerEquipment_and_supplies hardware, or other siblings. Nor may it reference rules at the root level in the "phrase structure" and "terminology” folders immediately under the "knowledge base” node.
  • “chemicals” may not have dependencies on “tools,” “hardware,” or the root domain, but rules in “chemicals” may reference the factory equipment Common and the root Common.
  • Fig. 34 generally illustrates a grammar. If a user creates a new domain "PassiveElectronics” and new subdomains “resistors” and “capacitors” under it, a new Common will automatically be inserted under “PassiveElectronics.”
  • the rule [resistor] has no dependencies. If one drags it to "resistors,” no other rules will move there. However, if the user was to drag [productjesistor] to "resistors,” more than half the rules in the grammar will be automatically moved there, including [resjype], [variable],
  • capacitors and [number] which has references in all three domains and may only be assigned to the Common under the root.
  • Fig. 36 is a flow chart illustrating a process 3600 for augmenting a grammar from component domains. The process 3600 is initiated by opening
  • the current project includes a source listing that is to be transformed.
  • the user tests 3604 the current project using the current knowledge base and saves the result for regression testing.
  • the user augments 3606 the knowledge base from a grammar in an external project, as described above. In this regard, the user may drag 3608 one or more domains from the external project into the current project.
  • the Knowledge Builder tool checks for inconsistencies and conflicts, for example, based on the dependency and reference listings. Each inconsistency and conflict is identified to the user who responds to each, for example, by harmonizing such inconsistencies and conflicts.
  • the user can then retest 3610 the project using the modified knowledge base and run a further regression test with the previously saved data.
  • the user judges 3612 the result to determine whether the original knowledge base or the modified knowledge base is more effective.
  • the Knowledge Builder tool may also analyze the regression test to identify the sources of progressions and regressions so as to facilitate trouble shooting.
  • the present invention thus allows for knowledge sharing for a variety of purposes including facilitating the processing of developing a new knowledge base and allowing multiple users to work on a single project, including simultaneous development involving a common domain. In this manner, the process for developing a knowledge base is significantly streamlined.
  • the present invention generally relates to converting data from a first or source form to a second or target form.
  • Such conversions may be desired in a variety of contexts relating, for example, to importing data into or otherwise populating an information system, processing a search query, exchanging information between information systems and translation.
  • the invention is set forth in the context of particular examples relating to processing a source stream including a product oriented attribute phrase.
  • Such streams may include information identifying a product or product type together with a specification of one or more attributes and associated attribute values.
  • the source stream e.g., a search query or product descriptor from a legacy information system
  • the product may be defined by the phrase “coffee cup” and the implicit attributes of size and material have attribute values of "8 oz.” and "ceramic” respectively.
  • a frame-slot architecture may function independently to define a full conversion model for a given conversion application, or may function in conjunction with one or more parse tree structures to define a conversion model. In the latter regard, the frame-slot architecture and parse tree structures may overlap with respect to subject matter.
  • the above-noted coffee cup example is illustrative in this regard. It may be desired to correlate the source string "8 oz. ceramic coffee cup" to a product database, electronic catalogue, web-based product information or other product listing.
  • a product listing may include a variety of product types, each of which may have associated attributes and grammar rules.
  • the product types and attributes may be organized by one or more parse-tree structures. These parse tree structures, which are described and shown in U.S. Patent Application Serial Number. 10/970,372, generally organize a given subject matter into a hierarchy of classes, subclasses, etc., down to the desired level of granularity, and are useful for improving conversion accuracy and improving efficiency in building a grammar among other things.
  • "coffee cup” may fall under a parse tree node “cups” which, in turn falls under a parent node “containers” which falls under "housewares”, etc.
  • the same or another parse tree may group the term “oz.”, or a standardized expression thereof (e.g., defined by a grammar) such as "ounce” under the node “fluid measurements” (ounce may also appear under a heading such as "weights" with appropriate grammar rules for disambiguation) which, in turn, may fall under the parent node "measurements”, etc.
  • parse tree structure has certain efficiencies in connection with conversion processes.
  • very deep parses may be required, e.g., in connection with processing terms associated with large data systems.
  • processing terms are often processed as individual fields of data rather than closer to the whole record level, thereby potentially losing contextual cues that enhance conversion accuracy and missing opportunities to quickly identify content anomalies or implement private schema to define legal attributes or values for a given information object.
  • parse tree processes may impose a rigid structure that limits applicability to a specific subject matter context, thereby limiting reuse of grammar segments.
  • a frame-slot architecture allows for consideration of source stream information at, or closer to, the whole record level. This enables substantial unification of ontology and syntax, e.g., collective consideration of attribute phrases, recognized by the grammar and attribute values contained therein. Moreover, this architecture allows for consideration of contextual cues, within or outside of the content to be converted or other external constraints or other external information.
  • the frame-slot architecture allows for consideration of the source stream "8 oz. coffee cup” in its entirety. In this regard, this stream may be recognized as an attribute phrase, having "coffee cup” as an object.
  • Grammar rules specific to this object or a class including this object or rules of a public schema may allow for recognition that "oz.” means “ounce” and "ounce” in this context is a fluid measure, not a weight measure.
  • a user-defined schema for example, a private schema of the source or target information owner, may limit legal quantity values associated with "ounce” in the context of coffee cups to, for example, "6", “8” and "16". In this case, recognition of "8" by the schema provides increased confidence concerning the conversion.
  • an anomaly e.g., in the case of mapping records from a legacy data system to a target system
  • identify an imperfect match e.g., in the case of a search query
  • the frame-slot architecture thus encompasses a utility for recognizing stream segments, obtaining contextual cues from within or external to the stream, accessing grammar rules specific to the subject matter of the stream segment and converting the stream segment. This may avoid deep parses and allow for greater conversion confidence and accuracy. Moreover, greater grammar flexibility is enabled, thus allowing for a higher degree of potential reuse in other conversion contexts. In addition, executing such processes by reference to a schema enables improved context-related analysis. In short, conversions benefit from surrounding and external context cues in a manner analogous to human processing.
  • the frame-slot architecture may be developed in a top- down or bottom-up fashion.
  • objects, associated attributes and legal attribute values may be defined as schema that are imposed on the data.
  • all of these may be defined based on an analysis of a product inventory or the structure of a legacy information system.
  • the schema may dictate the legal values for quantity to 6, 8 and 16. Any information not conforming to the schema would then be identified and processed as an anomaly.
  • the legal values may be defined based on the data. For example, files from a legacy information system may be used to define the legal attribute values which, then, develop as a function of the input information.
  • Figure 45 illustrates a system 4500 for implementing such conversion processing.
  • the illustrated system 4500 includes a conversion engine 4502 that is operative to execute various grammar rules and conversion rules for converting source information to a target form.
  • the system 4500 is operative to execute both frame-slot architecture methodology and parse tree structure methodology.
  • a frame-slot architecture may be executed in accordance with the present invention in the absence of a cooperating parse tree environment.
  • the illustrated grammar engine receives inputs and/or provides outputs via a workstation associated with the user interface 4504. For example, in a set-up mode, a user may select terms for processing and create associated relationships and grammar rules via the user interface 4504.
  • a search query may be entered, and search results may be received, via the user interface 4504.
  • the grammar engine 4502 may be resident at the work station associated with the user interface 4504, or may communicate with such a work station via a local or wide area network.
  • the source content 4506 includes the source string to be converted. Depending on the specific application, this content 4506 may come from any of a variety of sources. Thus, in the case of an application involving transferring information from one or more legacy information systems into a target information system, the source content 4506 may be accessed from the legacy systems. In the case of a search engine application, the source content may be derived from a query. In other cases, the source content 4506 may be obtained from a text to be translated or otherwise converted.
  • the source content 4506 may be preprocessed to facilitate conversion or may be in raw form. In the case of preprocessing, the raw content may be supplemented, for example, with markers to indicate phrase boundaries, tags to indicate context information, or other matter.
  • Such matter may be provided in a set-up mode process.
  • some such information may be present in a legacy system and may be used by the conversion engine 4502. It will be appreciated that the sources of the content 4506 and the nature thereof is substantially unlimited.
  • the illustrated conversion engine 4502 performs a number of functions.
  • the engine 4502 is operative to process the source content 4506 to parse the content into potential objects and attributes, identify the associated attribute values, and, in some cases, recognize contextual cues and other matter additional to the content to be transformed that may be present in the source content.
  • the engine 4502 then operates to convert the relevant portion of the source content 4506 using a parse tree structure 4510 and/or a frame-slot architecture 4511, and provides a converted output, e.g., to a user or target system.
  • parse tree structure 4500 such a structure is generally developed using the conversion engine 4502 in a set-up mode.
  • the nodes of the parse tree structure 4510 may be defined by someone familiar with the subject matter under consideration or based on an analysis of a data set.
  • certain structure developed in connection with prior conversion applications may be imported to facilitate the set-up process.
  • Such a set-up process is described in U.S. Patent Application Serial Number 10/970,372, which is incorporated herein by reference.
  • this set-up involves defining the hierarchical structure of the tree, populating the various nodes of the tree, developing standardized terminology and syntax and associated grammar and conversion rules associated with the tree and mapping source content variants to the standardized terminology and syntax.
  • the conversion engine 4502 obtains the source content 4502 and identifies potential objects, attributes and attribute values therein.
  • the source content 4506 may be parsed as discussed above.
  • the engine 4502 may obtain contextual cues 4512 to assist in the conversion.
  • cues may be internal or external to the source content 4506. External cues may be based on the identity or structure of a source information system, defined by a schema specific to the frame-slot conversion, or based on information regarding the subject matter under consideration obtained from any external source.
  • information indicating that, when used in connection with "coffee cup” the term “ounce” is a fluid (not a weight) measure may be encoded into metadata of a legacy information system, defined by a private schema developed for the subject conversion application or derived from an analysis of external information sources.
  • the conversion engine is operative to: identify potential objects, attributes and attribute values; process such information in relation to certain stored information concerning the objects, attributes and attribute values; access associated grammar and conversion rules; and convert the information from the source form to a target form.
  • the illustrated system 4500 includes stored object information 4514, stored attribute information 4516 and stored attribute value information 4518.
  • This information may be defined by a public or private schema or by reference to external information regarding the subject matter under consideration.
  • the object information 4514 may include a list of recognized objects for which the frame-slot architecture is applicable together with information associating the object with legal attributes and/or attribute values and other conversion rules associated with that object.
  • the attribute information 4516 may include a definition of legal attributes for the object together with information regarding associated attribute values and associated grammar and conversion rules.
  • the attribute value information 4518 may include a definition of legal attribute values for given attributes together with associated information concerning grammar and conversion rules.
  • Fig. 46 shows a flow chart illustrating a process 4600 that may be implemented by a conversion system such as described above. It will be appreciated that the various process steps illustrated in Fig. 41 may be combined or modified as to sequence or otherwise. Moreover, the illustrated process 4600 relates to a system that executes a parse tree structure as well as a frame-slot architecture. It will be appreciated that a frame-slot architecture in accordance with the present invention may be implemented independent of any associated parse tree structure.
  • the illustrated process 4600 is initiated by receiving (4602) a data stream from a data source.
  • a data stream may be entered by a user or accessed from a legacy or other information system.
  • a segment of the data stream is then identified (4604) for conversion.
  • the segment may comprise an attribute phrase or any other chunk of source data that may be usefully processed in a collective form.
  • Such a segment may be identified as the- entirety of an input such as a search query, the entirety or a portion of a file from a legacy or other information system, or based on a prior processing step whereby phrase boundaries have been marked for purposes of conversion processing or based on logic for recognizing attribute phrases or other chunks to be coprocessed.
  • the identified segment is then processed to identify (4606) a potential object within the segment.
  • the object may be identified as the term "cup” or "coffee cup.”
  • the potential object may be identified by comparison of individual terms to a collection of recognized objects or based on a preprocessing step wherein metadata has been associated with the source content to identify components thereof including objects.
  • the potential object is then compared (4608) to a known object list of a frame-slot architecture. As discussed above, within a given subject matter, there may be a defined subset for which frame-slot processing is possible.
  • the system accesses (4614) an associated grammar and schema for processing in accordance with the frame-slot architecture. Otherwise, the segment is processed (4612) using a parse tree structure. As a further alternative, if no object is recognized, an error message may be generated or the segment may be highlighted for set ⁇ up processing for out of vocabulary terms, e.g., so as to expand the vocabulary and associated grammar rules.
  • an attribute associated with the object is then identified (4616).
  • the terms "ceramic” or "8 oz.” may be identified as reflecting attributes. Such identification may be accomplished based on grammar rules or based on metadata associated with such terms by which such terms are associated with particular attribute fields.
  • the associated attribute values are then compared (4618) to legal values.
  • the value of "8 oz.” may be compared to a listing of legal values for the attribute "fluid measurement" in the context of "coffee cup.” These legal values may be defined by a private schema, for example, limited to the inventory of an entity's product catalog or may be based on other external information (e.g., defining a legal word form based on part of speech). If a match is found (4620) then the attribute phrase is recognized and an appropriate conversion process if executed (4623) in accordance with the associated grammar and conversion rules. The process 4600 then determines whether additional stream information (4624) is available for processing and either processes such additional information or terminates execution. In the case where the attribute value does not match a legal value, anomaly processing is executed (4622).
  • the anomalous attribute value may be verified and added to the legal values listing. For example, in the coffee cup example, if the attribute value is "12 oz.” and that value does not match a previously defined legal value but, in fact, represents a valid inventory entry, the term "12 oz.” (or a standardized version thereof) may be added to the legal values list for the attribute "fluid measurement" in the context of "coffee cup.” Alternatively, further processing may indicate that the attribute value is incorrect. For example, if the attribute value was "6 pack," an error in parsing may be indicated.
  • an appropriate error message may be generated or the segment may be reprocessed to associate an alternate attribute type, e.g., "object quantity,” with the term under consideration.
  • different anomaly processing may be executed. For example, in the case of processing a search query, illegal values may be ignored or closest match algorithms may be executed. Thus, in the case of a query directed to a "12 oz. coffee cup,” search results may be generated or a link may be executed relative to inventory related to coffee cups in general or to 8 and 16 oz. coffee cups. It will be appreciated that many other types of anomaly processing are possible in accordance with the present invention.
  • the conversion system can implement both a frame-slot architecture and a parse tree structure.
  • the illustrated conversion system 4800 includes a parser 4802 for use in parsing and converting an input stream 4803 from a source 4804 to provide an output stream 4811 in a form for use by a target system 4812.
  • the source stream 4803 includes the content "flat bar (1mm x 1" x 1')."
  • the parser 4802 uses information from a public schema 4806, a private schema 4808 and a grammar 4810.
  • the public schema 4806 may include any of various types of information that is generally applicable to the subject matter and is not specific to any entity or group of entities.
  • Fig. 49 illustrates an example structure 4900 showing how public information related to the subject matter area may be used to define a conversion rule.
  • a new structure 4900 includes a dictionary 4904 that forms a portion of the public schema 4902.
  • Panel 4906 shows definitions related to the object "flat bar.”
  • bar is defined as “straight piece that is longer than it is wide” and "flat” is defined as including “major surfaces distinctly greater than minor surfaces.”
  • Such definitions may be obtained from, for example, a general purpose dictionary, a dictionary specific to the subject matter, a subject matter expert or any other suitable source. These definitions are translated to define a rule as shown in panel 4908. Specifically, the associated rule indicates that "length is greater than width and width is greater than thickness.” This rule may then be written into the logic of a machine-based conversion tool. Referring again to Fig. 48, this rule is reflected in file 4807 of public schema 4806.
  • the parser 4802 also receives input information from private schema 4808 in the illustrated example.
  • the private schema 4808 may include conversion rules that are specific to an entity or group of entities less than the public as a whole.
  • the private schema 4808 may define legal values for a given attribute based on a catalog or inventory of an interested entity such as an entity associated with the target system 4812.
  • An associated user interface 5000 is shown in Fig. 5OA.
  • the user interface 5000 may be used in a start-up mode to populate the legal values for a given attribute.
  • the user interface is associated with a particular project 5002 such as assembling an electronic catalog.
  • the illustrated user interface 5000 includes a data structure panel 5004, in this case reflecting a parse-tree structure and a frame-slot structure.
  • the interface 5000 further includes a private schema panel 5005.
  • the private schema panel 5005 includes a number of windows 5006 and 5008 that define a product inventory of a target company.
  • a length field 5010 associated with a table for #6 machine screws is used to define legal attribute value 5012 at a node of panel 5004 corresponding to attribute values for #6 machine screws.
  • Associated legal value information is shown as a file 4809 of the private schema 4808 in Fig. 48.
  • FIG. 5OB shows a parse tree graphics panel 5022 and a parse tree node map panel 5024.
  • these panes 5022 and 5024 are shown in a stacked arrangement.
  • Panel 5022 shows a parse tree for a particular product descriptor.
  • the product descriptor is shown at the base level 5026 of the parse tree as "ruler 12" 1/16" divisions.”
  • Layers 5028-5030 show patent nodes of the parse tree.
  • both of the chunks "12"" and “1/16”" are associated with the high level node "[length_unit]" reflecting the recognition by a parse tool that each of these chunks indicates a measure of length.
  • a rule for interpreting "length unit” designations in the context of rulers (and, perhaps, other length measuring devices) is encoded under the “ruler” node. As shown, the rule interprets a given "length unit” as indicating “a measuring length” if the associated attribute value is greater than 1 unit of measure (uom) and treats the "length unit” as indicating an "increment” if the associated attribute value is less than 0.25 uom. This provides a certain and structurally efficient mechanism for disambiguating and converting length units in this context.
  • Grammar 4810 also provides information to the parser 4802.
  • the grammar may provide any of various information defining a lexicon, syntax and an ontology for the conversion process.
  • the grammar may involve definition of standardized terminology as described above.
  • file 4813 associates the standardized terms "inch,” “foot,” and “millimeter” with various alternate forms thereof.
  • the parser 4802 can then use the input from the public schema 4806, private schema 4808 and grammar 4810 to interpret the input stream 4803 to provide an output stream 4811 to the target 4812.
  • the noted input stream 4803 is interpreted as "flat bar-1 ' long, 1" wide and 1 mm thick.
  • Fig. 47 a further example related to a frame-slot architecture 4700 is illustrated.
  • the architecture 4700 is used to process a source stream 4702, in this case, "bearings for transmission ⁇ 100milli. bore.”
  • this source stream may be a record from a legacy information system or a search query.
  • the processing of this source stream 4702 may utilize various contextual cues.
  • contextual cues may be derived from the content of the source stream 4702 itself.
  • certain metadata cues 4704 may be included in connection with the source stream 4702.
  • legacy information systems such as databases may include a significant amount of structure that can be leveraged in accordance with the present invention.
  • structure may be provided in the form of links of relational databases or similar tags or hooks that define data relationships.
  • contextual information which can vary substantially in form, is generally referred to herein as metadata.
  • the frame-slot architecture 4700 is utilized to identify an object 4706 from the source stream 4702.
  • this may involve identifying a term within the stream 4702 and comparing the term to a list of recognized objects or otherwise using logic to associate an input term with a recognized object. It will be noted in this regard that some degree of standardization or conversion which may involve the use contextual information may be performed in this regard.
  • the identified object "roller bearing” does not literally correspond to any particular segment of the stream 4702. Rather, the object “roller bearing” is recognized from the term “bearing” from the stream 4702 together with contextual cues provided by the term “transmission" included within the content of the stream 4702 and, perhaps, from metadata cues 4704. Other sources including external sources of information regarding bearings may be utilized in this regard by logic for matching the stream 4702 to the object 4706.
  • attributes 4708 and attribute values 4714 may be accessed. As discussed above, such information may be derived from public and private schema. For example, an attribute type 4710 may be identified for the object 4706 and corresponding legal attribute values 4712 may be determined. In this case, one attribute associated with the object “roller bearing” is "type” that has legal values of "cylindrical, tapered and spherical.”
  • the stream 4702 may be processed using this information to determine a refined object 4716. In this case, the refined object is determined to be "cylindrical roller bearing.” Again, it will be noted that this refined object 4716 is not literally derived from the stream 4702 but rather, in the illustrated example, is determined based on certain contextual information and certain conversion processes.
  • the stream 4702 is determined to match the attribute value "cylindrical” based on contextual information related to the terms “transmission” and “bore” included within the content of the source stream 4702.
  • Information regarding the attributes 4708 and attribute values 4714 may again be accessed based on this refined object 4716 to obtain further attributes 4718 and associated attribute values 4720.
  • these attributes and attribute values 4718 and 4720 though illustrated as being dependent on the attribute 4710 and attribute value 4712 may alternatively be independent attributes and attribute values associated with the object 4706.
  • the attribute "size parameter" is associated with the legal values “inside diameter” and “outside diameter” based on the refined object “cylindrical roller bearings.”
  • the attribute 4718 and attribute value 4720 are used together with certain contextual cues to define a further refined object 4722.
  • the further refined object 4722 is defined as "cylindrical roller bearing inside diameter.”
  • a selection between the legal value “inside diameter” and “outside diameter” is made based on contextual information provided by the term “bore” included within the content of the stream 4702.
  • information regarding the attributes 4708 and attribute values 4714 can be used to identify a further attribute 4724 and associated legal values 4725.
  • the attribute 4724 is "legal dimensions” and associated legal values 4725 are defined as "50, 60, 70, 80, 90, 100, 150 . . .
  • the input stream 4702 is processed in view of the attribute 4724 and legal values 4725 to define an output 4726 identified as "100mm ID cylindrical roller bearings.”
  • the stream term "100 milli.” is found to match the legal value of "100” for the attribute "legal dimensions” in the context of cylindrical roller bearings inside diameter. It will be appreciated that the term “milli.” has thus been matched, based on a standardization or conversion process, to the designation "mm.” It should be noted in this regard that success in matching the source term "100 milli.” to the legal value "100mm” provides further confidence was correctly and accurately performed.
  • the input stream 4702 may be rewritten as "100 mm ID cylindrical roller bearing."
  • the output may be provided by way of linking the user to an appropriate web page or including associated information in a search results page. It will be appreciated that other types of output may be provided in other conversion environments. While various embodiments of the present invention have been described in detail, it is apparent that further modifications and adaptations of the invention will occur to those skilled in the art. However, it is to be expressly understood that such modifications and adaptations are within the spirit and scope of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Various data conversion applications including a search engine employ a semantic metadata model for improved conversion. A search system in this regard associates contextual metadata with search terms and/or stored terms to facilitate identification of relevant information. In one implementation, a search term is identified (4304) from a received search request. The search term is then rewritten (4306) in standard form and the standard form term is then set (4308) as the current search parameter. A source database is then searched (4310) using the current search parameter. If any results are obtained (4312) these results may be output (4320) to the user. If no results are obtained, a parent classification of the search term is set (4316) as the current search parameter and the process is repeated. The invention thereby provides the ease of use of term searching with the comprehensiveness of category searching.

Description

FUNCTIONALITY AND SYSTEM FOR CONVERTING DATA FROM A FIRST FORM TO A SECOND FORM
FIELD OF THE INVENTION The present invention relates generally to machine-based tools for use in converting data from one form to another and, in particular, to a framework for efficiently developing conversion rules as well as accessing and applying external information to improve such conversions. In this regard, the invention further relates to applying public or private rules for structuring or understanding data ("schema") to new data so as to reduce start-up efforts and costs associated with configuring such machine-based tools.
BACKGROUND OF THE INVENTION
In a variety of contexts, it is desired to convert data from a first or input form to a second or target form. Such conversions may involve, for example, linguistics, syntax and formats. In this regard, linguistic differences may be due to the use of different languages or, within a single language, due to terminology, proprietary names, abbreviations, idiosyncratic phrasings or structures and other matter that is specific to a location, region, business entity or unit, trade, organization or the like. Also within the purview of linguistic differences for present purposes are different currencies, different units of weights and measures and other systematic differences. Syntax relates to the phrasing, ordering and organization of terms as well as grammatic and other rules relating thereto. Differences in format may relate to data structures or conventions associated with a database or other application and associated tools.
One or more of these differences in form may need to be addressed in connection with a conversion process. Some examples of conversion environments include: importing data from one or more legacy systems into a target system; correlating or interpreting an external input (such as a search query) in relation to one or more defined collections of information; correlating or interpreting an external input in relation to one or more external documents, files or other sources of data; facilitating exchanges of information between systems; and translating words, phrases or documents. In all of these cases, a machine-based tool attempts to address differences in linguistics, syntax and/or formats between the input and target environments. It will be appreciated in this regard that the designations "input" and "target" are largely a matter of convenience and are process specific. That is, for example, in the context of facilitating exchanges of information between systems, which environment is the input environment and which is the target depends on which way a particular conversion is oriented and can therefore change.
One difficulty associated with machine-based conversion tools relates to properly handling context dependent conversions. In such cases, properly converting an item under consideration depends on understanding something about the context in which the item is used. For example, in the context of product descriptions, an attribute value of "one inch" might denote one inch in length, one inch in radius or some other dimension depending on the product under consideration. In the context of translation, the term "walking" functions differently in the phrase "walking shoe" than in "walking to work." Thus, in these examples and many others, understanding something about the context of an item under consideration may facilitate conversion. Although the value of context in disambiguating or otherwise properly converting information is well recognized, limited success has been achieved in applying this notion to machine-based tools.
The difficulties of conversion applications in general and context related confusion in particular are demonstrated by the example of search engines. Search engines are used in a variety of contexts to allow a user of a data terminal, e.g., a computer, PDA or data enabled phone, to search stored data for items of interest. For example, search engines are used for research, for on-line shopping, and for acquiring business information. The case of on¬ line catalog searching is illustrative. On-line sales are an increasingly important opportunity for many businesses. To encourage and accommodate on-line purchasing, some companies have devoted considerable resources to developing search tools that help customers identify products of interest. This is particularly important for businesses that have an extensive product line, for example, office supply companies. One type of search engine is the product category search engine. To implement a product category search engine, the available products are grouped by categories and subcategories. A user can then enter a product category term, or select a term from a puli-down window or the like, to access a list of available products. These search engines are very useful for customers that have considerable experience or expertise by which to understand the structure of the product space at interest. However, in many cases, the product category may not be obvious or may not be the most convenient way to identify a product. For example, a customer wishing to purchase Post-It notes may not be able to readily identify the category in which that product is grouped or may not want to work through a series of menus to narrow a search down to the desired product.
In addition or as an alternative to product category searching, web¬ sites often accommodate keyword searching. To execute a keyword search, the user enters a term to identify the product-of-interest; often a trademark or portion of a trademark. A conventional search engine can then access a database to identify hits or, in some cases, near hits. This allows a customer with a particular product in mind to quickly identify the product, even if the customer can not or does not wish to identify the product category for that product.
Unfortunately, keyword searching can result in a failed search, even when products of potential interest are available. For example, a customer needing to order appointment books may enter the popular trademark "Daytimer." If Daytimer appointment books are not carried or are not currently available at the site, the search results may indicate that there is no match, even though other appointment books, e.g., At-A-Glance brand books, are available. This, of course is a lost sales opportunity for the business.
SUMMARY OF THE INVENTION The present invention is directed to a computer-based tool and associated methodology for transforming electronic information so as to facilitate communications between different semantic environments and access to information across semantic boundaries. In one respect, the present invention is related to enabling sharing of knowledge developed in a process of configuring a transformation utility that provides the structure for transforming such electronic information. As set forth below, the present invention may be implemented in the context of a system where subject matter experts (SMEs) develop a semantic metadata model (SMM) for facilitating data transformation. The SMM utilizes contextual information and standardized rules and terminology to improve transformation accuracy. The present invention allows for sharing of knowledge developed in this regard so as to facilitate development of a matrix of transformation rules ("transformation rules matrix"). Such a transformation system and the associated knowledge sharing technology are described in turn below.
In a preferred implementation, the invention is applicable with respect to a wide variety of content including sentences, word strings, noun phrases, and abbreviations and can even handle misspellings and idiosyncratic or proprietary descriptors. The invention can also manage content with little or no predefined syntax as well as content conforming to standard syntactic rules. Moreover, the system of the present invention allows for substantially real-time transformation of content and handles bandwidth or content throughputs that support a broad range of practical applications. The invention is applicable to structured content such as business forms or product descriptions as well as to more open content such as information searches outside of a business context. In such applications, the invention provides a system for semantic transformation that works and scales.
The invention has particular application with respect to transformation and searching of both business content and non-business content. For the reasons noted above, transformation and searching of business content presents special challenges. At the same time the need for better access to business content and business content transformation is expanding. It has been recognized that business content is generally characterized by a high degree of structure and reusable "chunks" of content. Such chunks generally represent a core idea, attribute or value related to the business content and may be represented by a character, number, alphanumeric string, word, phrase or the like. Moreover, this content can generally be classified relative to a taxonomy defining relationships between terms or items, for example, via a hierarchy such as of family (e.g., hardware), genus (e.g., connectors), species (e.g., bolts), subspecies (e.g., hexagonal), etc.
Non-business content, though typically less structured, is also amenable to normalization and classification. With regard to normalization, terms or chunks with similar potential meanings including standard synonyms, colloquialisms, specialized jargon and the like can be standardized to facilitate a variety of transformation and searching functions. Moreover, such chunks of information can be classified relative to taxonomies defined for various subject matters of interest to further facilitate such transformation and searching functions. Thus, the present invention takes advantage of the noted characteristics to provide a framework by which locale-specific content can be standardized and classified as intermediate steps in the process for transforming the content from a source semantic environment to a target semantic environment and/or searching for information using locale-specific content. Such standardization may encompass linguistics and syntax as well as any other matters that facilitate transformation. The result is that content having little or no syntax is supplied with a standardized syntax that facilitates understanding, the total volume of unique chunks requiring transformation is reduced, ambiguities are resolved and accuracy is commensurately increased and, in general, substantially real-time communication across semantic boundaries is realized. Such classification further serves to resolve ambiguities and facilitate transformation as well as allowing for more efficient searching. For example, the word "butterfly" of the term "butterfly valve" when properly chunked, standardized and associated with tags for identifying a classification relationship, is unlikely to be mishandled. Thus, the system of the present invention does not assume that the input is fixed or static, but recognizes that the input can be made more amenable to transformation and searching, and that such preprocessing is an important key to more fully realizing the potential benefits of globalization.
According to one aspect of the present invention, a method and corresponding apparatus are provided for transforming content from a first semantic environment to a second semantic environment by first converting the input data into an intermediate form. The associated method includes the steps of: providing a computer-based device; using the device to access input content reflecting the first semantic environment and convert at least a portion of the input content into a third semantic environment, thereby defining a converted content; and using the converted content in transforming a communication between a first user system operating in the first semantic environment and a second user system operating in the second semantic environment.
In the context of electronic commerce, the input content may be business content such as a parts listing, invoice, order form, catalogue or the like. This input content may be expressed in the internal terminology and syntax (if any) of the source business. In one implementation, this business content is converted into a standardized content reflecting standardized terminology and syntax. The resulting standardized content has a minimized (reduced) set of content chunks for translation or other transformation and a defined syntax for assisting in transformation. The intermediate, converted content is thus readily amenable to transformation. For example, the processed data chunks may be manually or automatically translated using the defined syntax to enable rapid and accurate translation of business documents across language boundaries.
The conversion process is preferably conducted based on a knowledge base developed from analysis of a quantity of information reflecting the first semantic environment. For example, this quantity of information may be supplied as a database of business content received from a business enterprise in its native form. This information is then intelligently parsed into chunks by one or more SMEs using the computer-based tool. The resulting chunks, which may be words, phrases, abbreviations or other semantic elements, can then be mapped to standardized semantic elements. In general, the set of standardized elements will be smaller than the set of source elements due to redundancy of designations, misspellings, format variations and the like within the source content. Moreover, as noted above, business content is generally characterized by a high level of reusable chunks. Consequently, the transformation rules matrix or set of mapping rules is considerably compressed in relation to that which would be required for direct transformation from the first semantic environment to the second. The converted semantic elements can then be assembled in accordance with the defined syntax to create a converted content that is readily amenable to manual or at least partially automated translation.
According to another aspect of the present invention, a computer- based device is provided for use in efficiently developing a standardized semantic environment corresponding to a source semantic environment. The associated method includes the steps of: accessing a database of information reflecting a source semantic environment; using the computer-based device to parse at least a portion of the database into a set of source semantic elements and identify individual elements for potential processing; using the device to select one of the source elements and map it to a standardized semantic element; and iteratively selecting and processing additional source elements until a desired portion of the source elements are mapped to standardized elements.
In order to allow for more efficient processing, the computer-based device may perform a statistical or other analysis of the source database to identify how many times or how often individual elements are present, or may otherwise provide information for use in prioritizing elements for mapping to the standardized lexicon. Additionally, the device may identify what appear to be variations for expressing the same or related information to facilitate the mapping process. Such mapping may be accomplished by associating a source element with a standardized element such that, during transformation, appropriate code can be executed to replace the source element with the associated standardized element. Architecturally, this may involve establishing corresponding tables of a relational database, defining a corresponding XML tagging structure and/or establishing other definitions and logic for handling structured data. It will be appreciated that the "standardization" process need not conform to any industry, syntactic, lexicographic or other preexisting standard, but may merely denote an internal standard for mapping of elements. Such a standard may be based in whole or in part on a preexisting standard or may be uniquely defined relative to the source semantic environment. In any case, once thus configured, the system can accurately transform not only known or recognized elements, but also new elements based on the developed knowledge base.
The mapping process may be graphically represented on a user interface. The interface preferably displays, on one or more screens
(simultaneously or sequentially), information representing source content and a workspace for defining standardized elements relative to source elements. In one implementation, as source elements are mapped to standardized elements, corresponding status information is graphically shown relative to the source content, e.g., by highlighting or otherwise identifying those source elements that have been mapped and/or remain to be mapped. In this manner, an operator can readily select further elements for mapping, determine where he is in the mapping process and determine that the mapping process is complete, e.g., that all or a sufficient portion of the source content has been mapped. The mapping process thus enables an operator to maximize effective mapping for a given time that is available for mapping and allows an operator to define a custom transformation "dictionary" that includes a minimized number of standardized terms that are defined relative to source elements in their native form. According to another aspect of the present invention, contextual information is added to source content prior to transformation to assist in the transformation process. The associated method includes the steps of: obtaining source information in a first form reflecting a first semantic environment; using a computer-based device to generate processed information that includes first content corresponding the source information and second content, provided by the computer-based device, regarding a context of a portion of the first content; and converting the processed information into a second form reflecting a second semantic environment. The second content may be provided in the form of tags or other context cues that serve to schematize the source information. For example, the second content may be useful in defining phrase boundaries, resolving linguistic ambiguities and/or defining family relationships between source chunks. The result is an information added input for transformation that increases the accuracy and efficiency of the transformation.
According to a further aspect of the present invention, an engine is provided for transforming certain content of electronic transmissions between semantic environments. First, a communication is established for transmission between first and second user systems associated with first and second semantic environments, respectively, and transmission of the communication is initiated. For example, a business form may be selected, filled out and addressed. The engine then receives the communication and, in substantially real-time, transforms the content relative to the source semantic environment, thereby providing transformed content. Finally, the transmission is completed by conveying the transformed content between the user systems.
The engine may be embodied in a variety of different architectures. For example, the engine may be associated with the transmitting user system relative to the communication under consideration, the receiving user system, or at a remote site, e.g., a dedicated transformation gateway. Also, the transformed content may be fully transformed between the first and second semantic environments by the engine, or may be transformed from one of the first and second semantic environments to an intermediate form, e.g., reflecting a standardized semantic environment and/or neutral language. In the latter case, further manual and/or automated processing may be performed in connection with the receiving user system. In either case, such substantially real-time transformation of electronic content marks a significant step towards realizing the ideal of globalization.
According to a still further aspect of the present invention, information is processed using a structure for normalization and classification of locale- specific content. A computer-based processing tool is used to access a communication between first and second data systems, where the first data system operates in a first semantic environment defined by at least one of linguistics and syntax specific to that environment. The processing tool converts at least one term of the communication between the first semantic environment and a second semantic environment and associates a classification with the converted or unconverted term. The classification identifies the term as belonging to the same class as certain other terms based on a shared characteristic, for example, a related meaning (e.g., a synonym or conceptually related term), a common lineage within a taxonomy system (e.g., an industry-standard product categorization system, entity organization chart, scientific or linguistic framework, etc.), or the like.
The classification is then used to process the communication. In this regard, the communication may be directed to and/or received from the first semantic environment. For example, a communication, such as a search query, may be transmitted from the first semantic environment and include locale-specific information such as abbreviations, proprietary names, colloquial terminology, or the like. Such a term in the query may first be normalized or cleaned such that the term is converted to a standardized or otherwise defined lexicon. This may involve syntax conversion, linguistic conversion and/or language translation. The converted or unconverted term is classified and the associated classification is used to identify information responsive to the query.
Conversely, the communication may be directed to the first semantic environment as by an individual or business consumer seeking product information from a company information system. In such a case, a term may be converted from an external form of the second semantic environment to the first semantic environment. For example, a term of the communication (e.g., 10mm hexagonal Allen nut) may be converted to an internal product identifier (name, number, description of the like, e.g., hex nut-A), of the company. The converted or unconverted term is associated with a classification (e.g., metric fasteners) and the classification is used to process the communication (e.g., by constructing a menu, page or screen with product options of potential interest).
It will be appreciated that the noted process generally involves one or more operators or subject matter experts (SMEs) for developing the knowledge base which reflects a semantic metadata model (SMM) involving a matrix of transformation rules, e.g., relating to standard terminology, grammar and term classification. This process can be time-consuming and is somewhat subjective. That is, the process of developing the SMM generally involves, among other things, mapping individual terms from the source collection to standardized terminology and associating those terms with information identifying their respective classification or position within a defined taxonomy structure. In the case of large source collections, this may involve considerable time. In order to accelerate the process, it would be useful to re-use knowledge that has been previously developed. For example, knowledge may be re-used to import rules developed in connection with another project, or to allow multiple SMEs to work on a given domain with each SME benefiting from, and not being handicapped by, the work of the other(s). However, the selection of standardized terminology, mapping of terms from the source collection to the standardized terminology and the association of taxonomy tags to individual terms all involve subjective determinations. These subjective determinations can result in inconsistencies or ambiguities in the SMM when knowledge is imported. The elimination of such ambiguities is, of course, an important motivation for developing the SMM.
To alleviate such concerns in the case of multiple SMEs, it may be possible in some cases to assign different SMEs to different domains. In this regard, such domains correspond to different subject matter areas presumptively encompassing different terms of the source collection. For example, such different domains may correspond in a business context to different business divisions, product categories or catalog sections. These different domains may be associated with different tags, at a high (broad) level, of a taxonomy structure. However, it is often difficult to make neat divisions of domains such that all overlap of terminology is avoided. Moreover, it may be desirable to have multiple SMEs work on a single domain where a large volume of data is at issue. More generally, it would be beneficial to jointly develop and share knowledge rather than enforcing insularity.
Thus, in accordance with another aspect of the present invention, a method and apparatus ("utility") are provided for sharing transformation information, i.e., importing transformation information developed in connection with one set of data for use in connection with a second set of data. That is, one part of the transformation process is developing a transformation rules matrix as noted above. This transformation rules matrix is developed by establishing a set of rules in connection with consideration of a first set of terms of a source system. The instant utility is directed to re-using such rules in connection with a second set of terms that may be associated with the same source system (e.g., in the case of multiple SMEs developing a single transformation rules matrix) or may be associated with a different source system. The utility involves providing logic for converting data from a first form to a second form, where the logic is configurable to apply transformation rules (e.g., a transformation rules matrix as noted above) developed for a particular transformation application. The logic is first used in connection with a first set of source data including a set of first terms to develop first transformation information, and to establish a first storage structure (e.g., a database) for indexing the first transformation information to the first terms. The logic is further (e.g., subsequently) used to develop second transformation information for a second set of source data that may be from the same or a different system than the first set of source data. The second terms are different than the first terms. In this regard, the logic is operative for: establishing a second storage structure (e.g., a database that may be embodied in the same or a different machine than the first storage structure), for relating the second transformation information to the second terms; importing at least a portion of the first transformation information to the second storage structure; and relating, in the second storage structure, the first transformation information to the second terms.
In practice, the utility is employed to re-use transformation information. For example, transformation information may be developed in connection with a first development effort for developing the first storage structure. That transformation information may be imported and supplemented or otherwise modified in connection with a second development effort for developing the second storage structure. The modified transformation information may then be exported for further use, e.g., in connection with further development of yet another storage structure. The noted utility may be used in any of these transformation information re-use contexts. Moreover, the logic is preferably operative to address any potential inconsistencies entailed by such re-use. For example, the logic may identify to all "owners" of transformation information any differences between original transformation information and modified transformation information, or apparent naming or rules conflicts so that such potential inconsistencies can be addressed, e.g., by selectively accepting, rejecting, or modifying any differences associated with the modified transformation information or arbitrating conflicts. Thus, in accordance with a still further aspect of the present invention, a method and apparatus is provided for allowing multiple SMEs to work in the same domain in connection with developing an SMM. The utility involves: providing logic for use in converting data from a first form to a second form, where the logic is configurable to associate elements of the first form with elements of the second form to develop a transformation model such as an SMM; first operating the logic to establish a first association of elements of the first form with elements of the second form; second operating the logic to establish a second association of elements of the first form with elements of the second form; and third operating the logic to process at least one of the first and second associations to address any inconsistencies associated with the overlap.
In many contexts, the transformation process involves mapping a source collection to an SMM, i.e., a form that facilitates transformation of the data. A target form of a particular transformation transaction may also be mapped to the same or a different SMM. Accordingly, multiple SMEs or other users or systems, may be involved in establishing associations between a source form and an SMM, between a target form and an SMM, or between SMMs, and the noted utility may be utilized in any of these contexts. The associations may involve developing a transformation rules matrix for mapping terms of the first form to terms of the second form (e.g., from the source collection to standardized terminology) and/or applying tags identifying a position of a term within a taxonomy structure so as to facilitate contextual understanding of the term. The noted steps of first and second operating the logic to establish associations may be executed by different SMEs or other users. For example, such users may use intuitive graphical interfaces to relate a term from the source collection to an associated standardized term. Additionally, such graphical interfaces may be used to relate terms to appropriate classifications or positions of a taxonomy structure, e.g., via a drag-and-drop operation.
The multiple users may work on data of the same domain, or that otherwise involve a potential overlap of terms, and may work on the data during overlapping time periods. Thus, for example, it may not be desirable to have users check-out a definitive version of the transformation logic for sequential use and modification. A potential inconsistency in this regard may relate, for example, to associating a given term of a source form with two different terms of an SMM or associating the given term with two different taxonomy tags or overall tag structures, thereby creating a transformation ambiguity. Such potential inconsistencies may be resolved in a number of ways. For example, such potential inconsistencies may be resolved by establishing a definitive transformation rules matrix and automatically rejecting conflicting transformation information. Alternatively, potentially conflicting transformation information may be identified to all concerned users who may arbitrate such potential conflicts on an individual basis. It will be appreciated that such inconsistencies may be addressed in other ways.
Additionally, it has been recognized that there is a need for search logic that provides the ease-of-use of term searching with the comprehensiveness of category searching. Such search logic would be useful for catalog searching or other data system searching applications. In accordance with the present invention, a knowledge base is constructed by which item descriptor terms and/or potential search terms are associated with contextual information by which the search logic can associate such a term including a specific, colloquial or otherwise idiosyncratic term, with a subject matter context, so as to enable a more complete search to be performed and increase the likelihood of yielding useful results. In accordance with one aspect of the present invention, a method and apparatus ("utility") are provided for use in establishing a searchable data structure where search terms are associated with a subject matter context. The searchable data structure may be, for example, a database system or other data storage resident on a particular machine or distributed across a local or wide area network. The utility involves providing a list of potential search terms pertaining a subject matter area of interest and establishing a classification structure for the subject matter area of interest. For example, the list of potential search terms may be an existing list that has been developed based on analysis of the subject matter area or may be developed by a subject matter expert or based on monitoring search requests pertaining to the subject matter of interest. Alternatively, the list may be drawn from multiple sources, e.g., starting from existing lists and supplemented by monitoring search requests. It will be appreciated that lists exist in many contexts such as in connection with pay-per-click search engines.
The classification structure preferably has a hierarchical form defined by classes, each of which includes one or more sub-classes, and so on. The utility further involves associating each of the potential search terms with the classification structure such that the term is assigned to at least one sub-class and a parent class. For example, such associations may be reflected in an XML tag structure or by any other system for reflecting such metadata structure. In this manner, search terms are provided with a subject matter context for facilitating searching. Thus, in the Daytimer example noted above, a search query including the term Daytimer may be interpreted so as to provide search results related more generally to appointment books. For example, such a search may be implemented iteratively such that the search system first seeks results matching "Daytimer" and, if no responsive information is available, proceeds to the next rung on the classification system, for example, "Appointment Books." Such iterations may be repeated until results are obtained or until a predetermined number iterations are completed, at which point the system may return an error message such as "no results found." In accordance with another aspect of the present invention, similar context information may be provided to terms associated with the data to be searched or source data. The utility generally involves providing a list of source data terms defining a subject matter area of interest and establishing a classification structure for the source data terms. Again, the classification structure preferably has a hierarchical form including classes each of which includes one or more sub-classes, and so on. Each of the source terms is associated with the classification structure such that the source term is assigned to at least one of the sub-classes and an associated parent class. In this manner, context is provided in connection with source data to facilitate searching. Thus, for example, a search query including the term "Appointment Book" may retrieve source data pertaining to Daytimer products, even though those products' descriptors may not include the term "Appointment Book." In a preferred implementation, a data structure is established such that both potential search terms and source data terms are associated with a classification structure. This allows specific items of source data to be matched to specific search terms based on a common subject matter context despite the lack of overlap between the specific search and source terms. Thus, for example, a search query including the term "Daytimer" may be associated with a classification "Appointment Books." Similarly, a data item associated with the trademark "At-A-Glance" may be associated with the subject matter classification "Appointment Books." Consequently, a search query including the term "Daytimer" may return search results including the "At-A-Glance" products of potential interest.
In accordance with a still further aspect of the present invention, a utility is provided for searching stored data using contextual metadata. The utility involves establishing a knowledge base for a given subject matter area, receiving a search request including a first descriptive term, accessing a source data collection using the knowledge base, and responding to the search request using the responsive information. The knowledge base defines an association between a term of the search request and an item of source data based on a classification within a context of the subject matter area. Such a classification may be associated with the search term and/or a source term. A search request may thereby be addressed based on a second matter context even though the search is entered based on specific search terms and the item of source data is associated with specific source terms. As will be set forth below, the knowledge base may optionally include additional information related to the subject matter area, such as a system of rules for standardizing terminology and syntax, i.e., a grammar.
In accordance with a still further aspect of the present invention, a data search is facilitated based on a standardization of terms utilized to execute the search. It has been recognized that term searches are complicated by the fact that searchers may enter terms that are misspelled, colloquial, or otherwise idiosyncratic. Similarly, source data may include jargon, abbreviations or other matter that complicates term matching. Accordingly, term searches can be facilitated by standardizing one or both of the search terms and source terms. For example, a user searching for Post-it notes may enter a colloquial term such as "sticky tabs." This term may be rewritten by a utility according to the present invention, as, for example, "adhesive notepad" or some other selected standard term. In addition, the term may be associated with a classification as discussed above. Similarly, a source collection, such as a catalog, may include a highly stylized entry for a Post-it note product such as "3-Pk, 3x3 Pl notes (pop-up) -Asst'd." Such an entry may be rewritten to include standard terminology and syntax. In relevant part, the term "Pl notes" may be rewritten as "Post-it notes" and may be associated with the classification "adhesive notepad." Thus, a first order classification of the source term matches the standardized search term, thereby facilitating retrieval of relevant information. As this example illustrates, such matching is not limited to matching of terms rewritten in standardized form or matching of classifications, but may involve matching a rewritten search term to a classification or vice-versa. Such searching using a data structure of standardized terms and/or associated classifications, e.g., a knowledge base, may be used for a variety of contexts. For example, such functionality may facilitate searching of a web¬ site, product database or other data of an entity by an outside party. In this regard, it may be useful to associate a product or product descriptor with multiple, alternative classifications to accommodate various types of search strategies that may be employed. Thus, a knowledge base may be constructed such that the classification "pen" or specific pen product records are retrieved in response to a search query including "writing instruments" and "office gifts."
As a further example, such functionality may facilitate searching of multiple legacy databases, e.g., by an inside or outside party or for advanced database merging functionality. Oftentimes, an entity may have information related to a particular product, company or other subject matter in multiple legacy databases, e.g., a product database and an accounting database. These databases may employ different conventions, or no taut conventions, regarding linguistics and syntax for identifying common data items. This complicates searching using conventional database search tools and commands, and can result in incomplete search results. In accordance with the present invention, a defined knowledge base can be used to relate a search term to corresponding information of multiple legacy systems, e.g., so that a substantially free form search query can retrieve relevant information from the multiple legacy system despite differing forms of that information in those legacy environments.
In accordance with yet another aspect of the present invention, a searchable data system using contextual metadata is provided. The searchable data system includes an input port for receiving a search request including a search term, a first storage structure for storing searchable data defining a subject matter of the searchable data system, a second storage structure for storing a knowledge base, and logic for identifying the search term and using the knowledge base to obtain responsive information. The system further comprises an output port for outputting the responsive data, e.g., to the user or an associated network node. The knowledge base relates a potential search term to a defined classification structure of the subject matter of the searchable data system. For example, the classification structure may include classes, sub-classes and so on to define the subject matter to a desired granularity. The logic then uses the knowledge base to relate the search term to a determined classification of the classification structure and, in turn, uses the determined classification to access the first storage structure to obtain the responsive data.
The present invention is further directed to a machine-based tool and associated logic and methodology for use in converting data from an input form to a target form using context dependent conversion rules. In this manner conversions are improved, as ambiguities can be resolved based on context cues. In particular, existing public or private schema can be utilized to establish conversion rules for new data thereby leveraging existing structure developed by an entity or otherwise developed for or inherent in a given subject matter context. In this manner, structure can be imported a priori to a given conversion environment and need not, in all cases, be developed based on a detailed analysis of the new data. That is, structure can be imparted in a top-down fashion to a data set and is not limited to bottom-up evolution from the data. This facilitates greater automation of the development of a grammar for a conversion environment as pre-existing knowledge is leveraged. Moreover, in accordance with the invention, context dependent conversion rules can be efficiently accessed without the need to access a rigid and complex classification structure defining a larger subject matter context. A rule structure developed in this manner can provide a high degree of reusability across different conversion environments for reduced start-up effort and cost. Moreover, subject matter cues and structure can be based on or adopt existing data structures and metadata elements (e.g., of an existing database or other structured data system) so as to provide further efficiencies and functionality.
It has been recognized that conversion processes can benefit from context dependent conversion rules that allow for, inter alia, appropriate resolution of ambiguities. Just as humans can often readily resolve such ambiguities based on an understanding of a surrounding context, machine- based tools can be adapted to identify contextual cues and to access and apply context dependent rules and conversion processes. Such context cues can be reflected, in accordance with the present invention, by a parse-tree structure, a frame-slot architecture or a combination thereof. The present inventors have recognized that the frame-slot architecture has particular advantages for certain applications, but each approach has significant utility as discussed below.
The parse-tree involves developing a classification structure by which terms under consideration can be mapped to or associated with a particular classification taxonomy. For example, in the context of a database or catalog of business products, a product attribute term may be associated with a parent product classification, which in turn belongs to a grandparent product grouping classification, etc. The associated classification structure may be referred to as a parse tree. By accessing rules appropriate to this classification structure, conversions can be executed with improved accuracy. This represents a substantial improvement in relation to conventional conversion tools.
However, such a classification taxonomy entails certain inefficiencies. First, in order to encompass a subject matter area of significant size or complexity to a useful degree of classification granularity, very deep parses may be required reflecting a complicated parse tree. These deep parses require substantial effort and processing resources to develop and implement. Moreover, the resulting classification structures impose significant rigidity on the associated conversion processes such that it may be difficult to adapt the structures to a new conversion environment or to reuse rules and structures as may be desired. Moreover, such predefined, complex structures have limited ability to leverage context cues that may exist in source structured data or that may otherwise be inferred based on an understanding of the subject matter at issue, thereby failing to realize potential efficiencies.
In accordance with the present invention, a frame-slot architecture is provided for use in converting information. In this regard, a frame represents an intersection between a contextual cue recognized by the machine tool, associated content and related constraint information specific to that conversion environment, whereas a slot represents an included chunk of information. For example, in the context of product descriptions, a chunk of information such as "1 inch roller bearing" may be recognized by the machine tool logic or grammar as an attribute phrase. The term "1 inch" may then be recognized as an attribute value. In the context of describing a "roller bearing," it may be readily understood that "1 inch" represents a radius dimension and not a length, width, height or similar rectilinear designation. Such contextual cues can be inferred from a general, public understanding of the subject matter, i.e., what a roller bearing is. Such understanding is a kind of public schema. Moreover, an associated private schema may define acceptable values or ranges for this attribute. For example, only certain values or a certain values range for the attribute at issue may be "legal"; that is, only those values may be acceptable within rules defined by an interested entity. In many cases, such private schema may be pre-defined and thus available for use in a conversion process prior to any detailed analysis of the data sets at issue. The attribute value can be compared to such constraints to confirm the identification of the attribute phrase or to identify corrupted or nonconforming data. The frame is thus a specification of context or other disambiguating cues at or close to the whole-record level, less sensitive to syntax and more sensitive to the intersection of attributes and their values. Thus, a frame functions as a container for grammatical information used to convert data, analogous to a software object. The frame-slot architecture thus can resolve ambiguities without deep parses and yields flexible and more readily reusable syntactic rules. Moreover, constraint information is readily available, e.g., for attribute values, thus allowing for more confidence in conversions and better recognition of conversion anomalies.
In accordance with one aspect of the present invention, a method and apparatus ("utility") is provided for converting a semantic element under consideration. The utility involves receiving content associated with a data source and obtaining first information from the content for use in a conversion. The nature of the content depends, for example, on the conversion environment. In this regard, the content may be structured (e.g., in the case of converting data from a database or other structured source) or unstructured (e.g., in the case of a search query or other textual data source). The first information can be any of a variety of data chunks that are recognized by the utility, for example, an attribute phrase or other chunk including context cues in data or metadata form.
The utility uses the first information to obtain second information, from a location external to the content, for use in the conversion, and uses the first and second information in converting the content from a first form to a second form. For example, the second information may include context specific interpretation rules (e.g., "1 inch" means "1 inch in radius"), context specific constraints (e.g., acceptable attribute values must fall between 0.5-6.0 inches) and/or context-specific syntax or format rules (e.g., re-write as "roller bearing - 1 inch radius").
In this manner, a frame-slot architecture can be implemented with attendant advantages as noted above. It will be appreciated that such an architecture can be imposed on data in a top-down fashion or developed from data in a bottom-up fashion. That is, frames may be predefined for a particular subject matter such that data chunks can then be slotted to appropriate frames, or frames can evolve from the data and make use of the data's intrinsic or existing structures. In the latter regard, it will be appreciated that existing databases and structured data often have a high degree of embedded contextual cues that the utility of the present invention can leverage to efficiently define frame-slot architecture.
In accordance with another aspect of the present invention, a utility is provided for converting data from a first form to a second form based on an external schema. Specifically, the utility involves establishing a number of schema, each of which includes one or more conversion rules for use in converting data within a corresponding context of a subject matter area. A set of data is identified for conversion from the first form to the second form and a particular context of the set of data is determined. Based on this context, a first schema is accessed and a conversion rule of the first schema is used in a process for converting the set of data from the first form to the second form. The schemas are established based on external knowledge of a subject matter area independent of analysis of a particular set of data to be converted. In this regard, the schema may include one or more public schema including conversion rules generally applicable to the subject matter area independent of any entity or group of entities associated with the set of data. For example, such public schema may involve an accepted public definition of a semantic object, e.g., a "flat bar" may be defined as a rectilinear object having a length, width, and thickness where the length is greater than the width which, in turn, is greater than the thickness. Alternatively or additionally, the external schema may include one or more private schema, each including conversion rules specific to an entity or group of entities less than the public as a whole. For example, such a private schema may define legal attribute values in relation to a product catalog of a company. The examples of schema noted above involved some relationship between elements included in a single attribute phrase, e.g., an object such as "bar" and an associated attribute such as "flat." It should be appreciated that schema are not limited to such contexts but more broadly encompass public or private rules for structuring or understanding data. Thus, for example, rules may be based on relationships between different objects such as "paint brush," on the one hand, and "bristles," "handle" or "painter" on the other.
The set of data to be converted may include, for example, an attribute phrase (or phrases) including a semantic object, an attribute associated with the object and an attribute value for that attribute. This attribute phrase may be identified by parsing a stream of data. In this regard, the context of the subject matter area may be determined from the semantic object. Thus, the attribute phrase includes information potentially identifying the semantic object, attribute and attribute value. Logic may be executed to interpret this information so as to identify the object, attribute and/or attribute value. In any event, the object, attribute or attribute value may be compared to a set of objects, attributes or attribute values defined by the first schema. Such a comparison may enable conversion of the set of data from the first form to the second form or may identify an anomaly regarding the set of data.
It will be appreciated that the process of establishing the schema may be implemented in a start-up mode for configuration of a machine-based tool. Such a start-up mode may be employed to configure the tool so as to convert data based on contextual cues inferred from an understanding of the subject matter area. In this regard, the schema enables conversion of data which was not specifically addressed during configuration. Thus, the machine tool is not limited to converting data elements or strings of elements for which context cues have been embedded but can infer contextual cues with respect to new data. In this manner, start-up efforts and costs can be substantially reduced.
BRIEF DESCRIPTION OF THE DRAWINGS
For a more complete understanding of the present invention and further advantages thereof, reference is now made to the following detailed description taken in conjunction with the drawings, in which:
Figure 1 is a monitor screen shot illustrating a process for developing replacement rules in accordance with the present invention;
Figure 2 is a monitor screen shot illustrating a process for developing ordering rules in accordance with the present invention; Figure 3 is a schematic diagram of the NorTran Server components of a SOLx system in accordance with the present invention;
Figure 4 is a flowchart providing an overview of SOLx system configuration in accordance with the present invention;
Figures 5-10 are demonstrative monitor screen shots illustrating normalization and translation processes in accordance with the present invention;
Figure 11 is a flowchart of a normalization configuration process in accordance with the present invention;
Figure 12 is a flowchart of a translation configuration process in accordance with the present invention;
Figure 13 is an illustration of a graphical desktop implementation for monitoring the configuration process in accordance with the present invention;
Figure 14 illustrates various network environment alternatives for implementation of the present invention; Figure 15 illustrates a conventional network/web interface;
Figure 16 illustrates a network interface for the SOLx system in accordance with the present invention; Figure 17 illustrates a component level structure of the SOLx system in accordance with the present invention;
Figure 18 illustrates a component diagram of an N-Gram Analyzer of the SOLx system in accordance with the present invention; Figure 19 illustrates a taxonomy related to the area of mechanics in accordance with the present invention;
Figure 20 is a flowchart illustrating a process for constructing a database in accordance with the present invention;
Figure 21 is a flowchart illustrating a process for searching a database in accordance with the present invention;
Figure 22 is a schematic diagram of a transformation information sharing system in accordance with the present invention;
Figures 23-35 are sample user interface screens illustrating transformation information sharing functionality in accordance with the present invention;
Figure 36 is a flowchart illustrating an information import and testing process in accordance with the present invention;
Figure 37 is a schematic diagram of a search system in accordance with the present invention operating in the startup mode; Figure 38 is a schematic diagram illustrating the mapping of the potential search terms and source terms to a single parse tree in accordance with the present invention;
Figures 39 and 40 illustrate graphical user interfaces for mapping terms to a parse tree in accordance with the present invention; Figure 41 is a flow chart illustrating a process for mapping terms to a parse tree in accordance with the present invention;
Figure 42 is a schematic diagram illustrating a search system, in accordance with the present invention, in a use mode; Figure 43 is a flow chart illustrating a process for operating the system of Fig. 42 in the use mode; Figure 44 is a schematic diagram illustrating use of a knowledge base to search multiple legacy systems in accordance with the present invention.
Fig. 45 is a schematic diagram of a semantic conversion system in accordance with the present invention; Fig. 46 is a flow chart illustrating a semantic conversion process in accordance with the present invention;
Fig. 47 is a schematic diagram showing an example of a conversion that may be implemented using the system of Fig. 45; Fig. 48 is a schematic diagram illustrating the use of public and private schema in a conversion process in accordance with the present invention; and
Figs.49 - 5OB illustrate exemplary user interfaces in accordance with the present invention;
DETAILED DESCRIPTION
In the following description, the invention is set forth in the context of a search system involving standardization of source and search terms, and the association of classification information with both source terms and search terms and in other conversion contexts. Specific examples are provided in the environment of business information, e.g., searching a website or electronic cata\og for products of interest. Although this particular implementation of the invention and this application environment is useful for illustrating the various aspects of the invention, it will be appreciated that the invention is more broadly applicable to a variety of application environments and searching functions. In particular, various aspects of the invention as set forth above may be beneficially used independent of others of these aspects and are not limited to combinative uses as set forth in the discussion that foihws. The discussion below begins by describing, at a functional and system component level, a search system constructed in accordance with the present invention. This description is contained in Section I. Thereafter, in Sections U and III, the underlying framework for term standardization, classification and transformation is described in greater detail including certain utilities for sharing rule information and development between multiple users and between applications. Finally, Section IV describes a novel frame-slot architecture. 1. SEARCH SYSTEM
Generally, the search system of the present invention is operable in two modes; the setup mode and the use mode. In the setup mode, the user, generally a subject matter expert as will be described below, performs a number of functions including accessing lists of potential search terms and/or source terms, developing a standardized set or set of terms, establishing a classification structure, associating the standardized terms with the classification structure and selectively transforming (e.g., translating) the terms as necessary. Figure 37 is a schematic diagram of a search system 3700, in accordance with the present invention, operating in the startup mode. Generally, the system 3700 includes a controller 3702 and storage configured to store a term listing 3704, a parse tree structure 3706 and a set of structured standardized terms 3708. Although the system 3700 is illustrated as being implemented on a single platform 3710, it will be appreciated that the functionality of the system 3700 may be distributed over multiple platforms, for example, interconnected by a local or wide area network.
The user 3712 uses the controller 3702 to access a previously developed parse tree structure 3706 or to develop the structure 3706. In this regard, the parse tree structure 3706 generally defines a number of classifications, each generally including one or more sub-classifications that collectively define the subject matter area. Examples will be provided below. The number of layers of classifications and sub-classifications will generally be determined by the user 3712 and is dependent on the nature of the subject matter. In many cases, many such classifications will be available, for example, corresponding to headings and subheadings of a catalog or other pre-existing subdivisions of a subject matter of interest. In other cases, the subject matter expert may develop the classifications and sub-classifications based on an analysis of the subject matter. The user can then use the controller 3702 to access a term listing 3704 to be processed. As noted above, such a term listing 3704 may include potential search terms, source terms from a source data collection or both. In the case of potential search terms, the terms may be obtained from a pre- existing list or may be developed by the user 3712. For example, the potential search terms may be drawn from a stored collection of search terms entered by users in the context of the subject matter of interest. Additional sources may be available, in a variety of contexts, for example, lists that have been developed in connection with administering a pay-per-click search engine. The list may be updated over time based on monitoring search requests. Similarly, the source term listing may be previously developed or may be developed by the user 3712. For example, in the context of online shopping applications, the source listing may be drawn from an electronic product catalog or other product data base.
After accessing the term listing, the user may perform a number of functions including standardization and classification. Standardization refers to mapping of terms from the term listing 3704 to a second set, generally a smaller set, of standardized terms. In this manner, misspellings, abbreviations, colloquial terms, synonyms, different linguistic/syntax conventions of multiple legacy systems and other idiosyncratic matter can be addressed such that the list of standardized terms is substantially reduced in relation to the original term listing 3704. It will be appreciated from the discussion below that such standardization facilitates execution of the searching functionality as well as transformation functions as may be desired in some contexts, e.g., translation.
The resulting list of standardized terms can then be mapped to the parse tree structure 3706. As will be described below, this can be executed via a simple drag and drop operation on a graphical user interface. Thus, an item from a source listing, for example, identifying a particular Post-it note product, may be associated with an appropriate base level classification, for example, "Adhesive Notepad." Similarly, a term from a potential search term listing such as "Sticky Pad" may be associated with the same base level classification. It will be appreciated that a given term may be associated with more than one base level classification, a given base level classification may be associated with more than one parent classification, etc.
As noted above, such a base level classification may be associated with a parent classification, grandparent classification, etc. All of these relationships are inherited when the term under consideration is associated with a base level classification. The result is that the standardized term is associated with a string of classes and sub-classes of the parse tree structure 3706. For example, these relationships may be reflected in an XML tag system or other metadata representation associated with the term. The resulting structured standardized terms are then stored in a storage structure 3708 such as a database.
It will thus be appreciated that, in the illustrated embodiment, both source terms and potential search terms may be mapped to elements of the same parse tree structure. This is shown in Figure 38. As shown, multiple terms 3802 from the source collection are mapped to the parse tree structure 3800. Similarly, multiple terms from the potential search term listing 3804 are mapped to corresponding elements of the parse tree structure 3800. In this manner, a particular search term entered by a user can be used to identify responsive information from the source collection based on a common classification or sub-classification despite the absence of any overlap between the entered search term and the corresponding items from the source collection. It will be appreciated that it may be desirable to link a given term 3802 or 3804 with more than one classification or classification lineage of the parse tree 3800. This may have particular benefits in connection with matching a particular product or product category to multiple potential search strategies, e.g., mapping "pen" to searches including "writing instrument" or "office gift."
An example of this process is shown in Figure 39 with respect to particular search terms. In particular, Figure 39 shows a user interface representing a portion of a parse tree 3900 for a particular subject matter such as the electronic catalog of a office supply warehouse. In this case, the user uses the graphical user interface to establish an association between search terms 3902 and 3904 and the parse tree 3900. Specifically, search term 3902, in this case "sticky pad" is dragged and dropped on the node 3906 of the parse tree 3900 labeled "Adhesive." This node 3906 or classification is a sub-classification of "Notepads" 3908 which is a sub-classification of "Paper
Products" 3910 which, finally, is a sub-classification of "Office_Supplies" 3912. Similarly, term 3904, in this case "Daytimer," is associated with classification "Appointment_Books which is a sub-classification of "Non-electronic" 3916 which, in turn, is a sub-classification of Organizers" 3918 which, finally, is a sub-classification of "Office_Supplies" 3912. Data strings 3920 and 3922 illustrate the resulting structured terms reflecting the classification relationships (other syntax, such as standard XML tag syntax, may be used to reflect the classification structure). It will be appreciated that the example of Fig. 39 omits the optional step of term standardization. That is, the potential search term "Sticky Pad" may alternatively first be mapped to a standardized term such as "Post-it note" before being associated with the parse tree. Such standardization will be described in more detail below.
Fig. 40 illustrates how the same parse tree 3900 may be used to associate a classification with items from a source collection. For example, such a source collection may be drawn from an electronic catalog or other database of the business. In this case, the source term 4002 denoted "3- pack, 3x3 Post-it notes (Pop-up)-Asst'd" is associated with the same node 3906 as "Sticky Pad" was in the previous example. Similarly, term 4004 denoted "2005 Daytimer-Weekly-7x10-Blk" is associated with the same node 3914 as potential search term "Daytimer" was in the previous example. As will be appreciated from the discussion below, such common associations with respect to the parse tree 3900 facilitate searching.
This process for establishing a knowledge base may be summarized with respect to the flow chart of Fig. 41. The illustrated process 4100 is initiated by developing (4102) a parse tree that defines the subject matter of interest in terms of a number of classifications and sub-classifications. As noted above, such parsing of the subject matter may be implemented with enough levels to divide the subject matter to the desired granularity. The process 4100 then proceeds on two separate paths relating to establishing classifications for potential search terms and classifications for items from the source collection. It will be appreciated that these two paths may be executed in any order or concurrently. On the potential search term path, the process involves obtaining or developing (4104) a potential search term listing. As noted above, an existing list may be obtained, a new list may be developed by a subject matter expert, or some combination of these processes may occur. The terms are then mapped (4106) to the parse tree structure such as by a drag and drop operation on a graphical user interface as illustrated above. On the source term process line, the process 4100 proceeds by obtaining or developing (4108) a source term listing. Again, the source term listing may be obtained from existing sources, developed by subject matter expert or some combination of these processes may occur. The individual terms are then mapped (4110) to the parse tree structure, again, for example, by way of a drag and drop operation as illustrated above. Although not shown, the process 4100 may further include the steps of re-writing the potential search terms and source terms in a standardized form.
The search system of the present invention is also operative in a use mode. This is illustrated in Fig. 42. The illustrated system 4200 includes input structure 4202 for receiving a search request from a user 4204. Depending on the specific network context in which the system 4200 is implemented, the search request may be entered directly at the machine executing the search system, or may be entered at a remote node interconnected to the platform 4206 via a local or wide area network. The nature of the input structure 4202 may vary accordingly. The search request is processed by a controller 4208 to obtain responsive information that is transmitted to the user 4204 via output structure 4210. Again, the nature of the output structure 4210 may vary depending on the specific network implementation.
In the illustrated implementation, in order to obtain the responsive information, the controller accesses the knowledge base 4212. The knowledge base 4212 includes stored information sufficient to identify a term from the search request, rewrite the term in a standardized form, transform the term if necessary, and obtain the metadata associated with the term that reflects the classification relationships of the term. The controller then uses the standardized term together with the classification information to access responsive information from the source data 4214.
Fig. 4300 is a flow chart illustrating a corresponding process 4300.
The process 4300 is initiated by receiving (4302) a search request, for example, from a keyboard, graphical user interface or network port. The system is then operative to identify (4304) a search term from the search request. In this regard, any appropriate search query syntax may be supported. For example, a search term may be entered via a template including predefined Boolean operators or may be entered freeform. Existing technologies allow for identification of search terms thus entered.
The search term is then rewritten (4306) in standard form. This may involve correcting misspellings, mapping multiple synonyms to a selected standard term, implementing a predetermined syntax and grammar, etc., as will be described in more detail below. The resulting standard form term is then set (4308) as the current search parameter.
In the illustrated implementation, the search then proceeds iteratively through the hierarchy of the parse tree structure. Specifically, this is initiated by searching (4310) the source database using the current search parameter. If any results are obtained (4312) these results may be output (4320) to the user. If no results are obtained, the parent classification at the next level of the parse tree is identified (4314). That parent classification is then set (4316) as the current search parameter and the process is repeated. Optionally, the user may be queried (4318) regarding such a classification search. For example, the user may be prompted to answer a question such as "no match found - - would you like to search for other products in the same classification?" In addition, the logic executed by the process controller may limit such searches to certain levels of the parse tree structure, e.g., no more than three parse levels (parent, grandparent, great grandparent) in order to avoid returning undesired results. Alternatively or additionally, such searching may be limited to a particular number of responsive items. The responsive items as presented to the user may be ordered or otherwise prioritized based on relevancy as determined in relation to proximity to the search term in the parse tree structure. It will be appreciated that searching functionalities such as discussed above is not limited to searching of a web-site or electronic catalog by outside parties but is more generally useful in a variety of searching and database merging environments. Fig. 44 illustrates a system 4400 for using a knowledge base 4404 to access information from multiple legacy databases 4401-4403. Many organizations have related information stored in a variety of legacy databases, for example, product databases and accounting databases. Those legacy databases may have been developed or populated by different individuals or otherwise include different conventions relating to linguistics and syntax.
In the illustrated example, a first record 4406 of a first legacy database 4401 reflects a particular convention for identifying a manufacturer ("Acme") and product ("300W AC Elec. Motor . . ."). Record 4407 associated with another legacy database 4403 reflects a different convention including, among other things, a different identification of the manufacturer ("AcmeCorp") and a misspelling ("Moter").
In this case, an internal or external user can use the processor 4405 to enter a substantially freeform search request, in this case "Acme Inc. Power Equipment." For example, such a search request may be entered in the hopes of retrieving all relevant information from all of the legacy databases 4401-4403. This is accommodated, in the illustrated embodiment, by processing the search request using the knowledge base 4404. The knowledge base 4404 executes functionality as discussed above and in more detail below relating to standardizing terms, associating terms with a classification structure and the like. Thus, the knowledge base 4404 may first process the search query to standardize and/or classify the search terms. For example, Acme, Inc. may be associated with the standardized term "Acme." The term polar equipment may be associated with the standardized term or classification "motor." Each of these terms/classifications may in turn be associated with associated legacy forms of the databases 4401-4403 to retrieve responsive information from each of the databases. Additional conventional functionality such as merge functionality may be implemented to identify and prioritize the responsive information provided as search results to the processor 4405. In this manner, searching or merging of legacy data systems is accommodated with minimal additional code.
From the discussion above, it will be appreciated that substantial effort is involved in transforming data from one form to another, e.g., from a raw list of potential search or source terms to a set or sets of standardized, classified and, perhaps, translated terms. The present invention also accommodates sharing information established in developing a transformation model such as a semantic metadata model (SMM) used in this regard. Such sharing of information allows multiple users to be involved in creating the knowledge base, e.g., at the same time, and allows components of such information to be utilized in starting new knowledge base creation projects.
The invention is preferably implemented in connection with a computer- based tool for facilitating substantially real-time transformation of electronic communications. As noted above, the invention is useful in a variety of contexts, including transformation of business as well as non-business content and also including transformation of content across language boundaries as well as within a single language environment.
It will be appreciated that transformation of data in accordance with the present invention is not limited to searching applications as described above, but is useful in a variety of applications including translation assistance. In the following description, such a system is described in connection with the transformation of business content from a source language to a target language using a Structured Object Localization expert (SOLx) system. The invention is further described in connection with classification of terminology for enhanced processing of electronic communications in a business or non- business context. The information sharing functionality and structure of the invention is then described. Such applications serve to fully illustrate various aspects of the invention. It will be appreciated, however, that the invention is not limited to such applications.
In addition, in order to facilitate a more complete understanding of the present invention and its advantages over conventional machine translation systems, the following description includes considerable discussion of grammar rules and other linguistic formalities. It shall be appreciated that, to a significant degree, these formalities are developed and implemented with the assistance of the SOLx system. Indeed, a primary advantage of the SOLx system is that it is intended for use by subject matter experts, not linguistic experts. Moreover, the SOLx system can handle source data in its native form and does not require substantial database revision within the source system. The SOLx system thereby converts many service industry transformation tasks into tools that can be addressed by in-house personnel or substantially automatically by the SOLx system. The following description is generally divided into two sections. In the first section, certain subjects relevant to the configuration of SOLx are described. This includes a discussion of configuration objectives as well as the normalization classification and translation processes. Then, the structure of SOLx is described, including a discussion of network environment alternatives as well as the components involved in configuration and run-time operation. In the second section, the information sharing functionality and structure is described. This includes a discussion of the creation, editing and extension of data domains, as well as domain management and multi-user functionality.
II. TRANSFORMATION CONFIGURATION
As noted above, the information sharing technology of the present invention is preferably implemented in connection with a machine based tool that is configured or trained by one or more SMEs who develop a knowledge base including an SMM. This machine based tool is first described in this Section I. The knowledge sharing functionality and structure is described in Section U that follows.
A. System Configuration
1. Introduction - Configuration Challenges
The present invention addresses various shortcomings of conventional data transformation, including manual translation and conventional machine translation, especially in the context of handling business content. In the former regard, the present invention is largely automated and is scalable to meet the needs of a broad variety of applications. In the latter regard, there are a number of problems associated with typical business content that interfere with good functioning of a conventional machine translation system. These include out-of-vocabulary (OOV) words that are not really OOV and covert phrase boundaries. When a word to be translated is not in the machine translation system's dictionary, that word is said to be OOV. Often, words that actually are in the dictionary in some form are not translated because they are not in the dictionary in the same form in which they appear in the data under consideration. For example, particular data may contain many instances of the string "PRNTD CRCT BRD", and the dictionary may contain the entry "PRINTED CIRCUIT BOARD," but since the machine translation system cannot recognize that "PRNTD CRCT BRD" is a form of "PRINTED CIRCUIT BOARD" (even though this may be apparent to a human), the machine translation system fails to translate the term "PRNTD CRCT BRD". The SOLx tool set of the present invention helps turn these "false OOV" terms into terms that the machine translation system can recognize.
Conventional language processing systems also have trouble telling which words in a string of words are more closely connected than other sets of words. For example, humans reading a string of words like Acetic Acid Glass Bottle may have no trouble telling that there's no such thing as "acid glass," or that the word Glass goes together with the word Bottle and describes the material from which the bottle is made. Language processing systems typically have difficulty finding just such groupings of words within a string of words. For example, a language processing system may analyze the string Acetic Acid Glass Bottle as follows:
i) Acetic and Acid go together to form a phrase ii) Acetic Acid and Glass go together to form a phrase iii) Acetic Acid Glass and Bottle go together to form a phrase
The first item of the analysis is correct, but the remaining two are not, and they can lead to an incorrect analysis of the item description as a whole. This faulty analysis may lead to an incorrect translation. The actual boundaries between phrases in data are known as phrase boundaries. Phrase boundaries are often covert - that is, not visibly marked. The SOLx tool of the present invention, as described in detail below, prepares data for translation by finding and marking phrase boundaries in the data. For example, it marks phrase boundaries in the string Acetic Acid Glass Bottle as follows:
• Acetic Acid | Glass Bottle
This simple processing step - simple for a human, difficult for a language processing system - helps the machine translation system deduce the correct subgroupings of words within the input data, and allows it to produce the proper translation.
The present invention is based, in part, on the recognition that some content, including business content, often is not easily searchable or analyzable unless a schema is constructed to represent the content. There are a number of issues that a computational system must address to do this correctly. These include: deducing the "core" item; finding the attributes of the item; and finding the values of those attributes. As noted above, conventional language processing systems have trouble telling which words in a string of words are more closely connected than other sets of words. They also have difficulty determining which word or words in the string represent the "core," or most central, concept in the string. For example, humans reading a string of words like Acetic Acid Glass Bottle in a catalogue of laboratory supplies may have no trouble telling that the item that is being sold is acetic acid, and that Glass Bottle just describes the container in which it is packaged. For conventional language processing systems, this is not a simple task. As noted above, a conventional language processing system may identify a number of possible word groupings, some of which are incorrect. Such a language processing system may deduce, for example, that the item that is being sold is a bottle, and that the bottle is made of "acetic acid glass." Obviously, this analysis leads to a faulty representation of bottles (and of acetic acid) in a schema and, therefore, is of little assistance in building an electronic catalogue system.
In addition to finding the "core" of an item description, it is also useful to find the groups of words that describe that item. In the following description, the terms by which an item can be described are termed its attributes, and the contents or quantity of an attribute is termed its value. Finding attributes and their values is as difficult for a language processing system as is finding the "core" of an item description. For instance, in the string Acetic Acid Glass Bottle, one attribute of the core item is the package in which it is distributed. The value of this attribute is Glass Bottle. It may also be deemed that one attribute of the core item is the kind of container in which it is distributed. The value of this attribute would be Bottle. One can readily imagine other container types, such as Drum, Bucket, etc., in which acetic acid could be distributed. It happens that the kind of container attribute itself has an attribute that describes the material that the container is made of. The value of this attribute is Glass. Conventional natural language processing systems have trouble determining these sorts of relationships. Continuing with the example above, a conventional language processing system may analyze the string Acetic Acid Glass Bottle as follows:
• Acetic and Acid go together to describe Glass
• Acetic Acid and Glass go together to describe Bottle
This language processing system correctly deduced that Acetic and Acid go together. It incorrectly concluded that Acetic Acid go together to form the value of some attribute that describes a kind of Glass, and also incorrectly concluded that Acetic Acid Glass go together to give the value of some attribute that describes the bottle in question.
The SOLx system of the present invention, as described in detail below, allows a user to provide guidance to its own natural language processing system in deducing which sets of words go together to describe values. It also adds one very important functionality that conventional natural language processing systems cannot perform without human guidance. The SOLx system allows you to guide it to match values with specific attribute types. The combination of (1) finding core items, and (2) finding attributes and their values, allows the SOLx system to build useful schemas. As discussed above, covert phrase boundaries interfere with good translation. Schema deduction contributes to preparation of data for machine translation in a very straightforward way: the labels that are inserted at the boundaries between attributes correspond directly to phrase boundaries. In addition to identifying core items and attributes, it is useful to classify an item. In the example above, either or both of the core item (acetic acid) and its attributes (glass, bottle and glass bottle) may be associated with classifications. Conveniently, this may be performed after phrase boundaries have been inserted and core items and attributes have been defined. For example, acetic acid may be identified by a taxonomy where acetic acid belongs to the class aqueous solutions, which belongs to the class industrial chemicals and so on. Glass bottle may be identified by a taxonomy where glass bottle (as well as bucket, drum, etc.) belong to the family aqueous solution containers, which in turn belongs to the family packaging and so on. These relationships may be incorporated into the structure of a schema, e.g., in the form of grandparent, parent, sibling, child, grandchild, etc. tags in the case of a hierarchical taxonomy. Such classifications may assist in translation, e.g., by resolving ambiguities, and allow for additional functionality, e.g., improve searching for related items.
The next section describes a number of objectives of the SOLx system configuration process. All of these objectives relate to manipulating data from its native from to a form more amenable for translation or other localization, i.e., performing an initial transformation to an intermediate form.
2. Configuration Objectives
Based on the foregoing, it will be appreciated that the SOLx configuration process has a number of objectives, including solving 0OVs and solving covert phrase boundaries based on identification of core items, attribute/value pairs and classification. Additional objectives, as discussed below, relate to taking advantage of reusable content chunks and resolving ambiguities. Many of these objectives are addressed automatically, or are partially automated, by the various SOLx tools described below. The following discussion will facilitate a more complete understanding of the internal functionality of these tools as described below.
False OOV words and true 00V words can be discovered at two stages in the translation process: before translation, and after translation. Potential OOV words can be found before translation through use of a Candidate Search Engine as described in detail below. OOV words can be identified after translation through analysis of the translated output. If a word appears in data under analysis in more than one form, the Candidate Search Engine considers the possibility that only one of those forms exists in the machine translation system's dictionary. Specifically, the Candidate Search Engine offers two ways to find words that appear in more than one form prior to submitting data for translation: the full/abbreviated search option; and the case variant search option. Once words have been identified that appear in more than one form, a SOLx operator can force them to appear in just one form through the use of vocabulary adjustment rules.
In this regard, the full/abbreviated search may output pairs of abbreviations and words. Each pair represents a potential false OOV term where it is likely that the unabbreviated form is in-vocabulary. Alternatively, the full/abbreviated search may output both pairs of words and unpaired abbreviations. In this case, abbreviations that are output paired with an unabbreviated word are potentially false OOV words, where the full form is likely in-vocabulary. Abbreviations that are output without a corresponding full form may be true OOV words. The machine translation dictionary may therefore be consulted to see if it includes such abbreviations. Similarly, some entries in a machine translation dictionary may be case sensitive. To address this issue, the SOLx system may implement a case variant search that outputs pairs, triplets, etc. of forms that are composed of the same letters, but appear with different variations of case. The documentation for a given machine translation system can then be consulted to learn which case variant is most likely to be in-vocabulary. To determine if a word is falsely OOV, words that are suspected to be OOV can be compared with the set of words in the machine translation dictionary. There are three steps to this procedure: 1 ) for each word that you suspect is falsely OOV, prepare a list of other forms that that word could take; 2) check the dictionary to see if it contains the suspected false OOV form; 3) check the dictionary to see if it contains one of the other forms of the word that you have identified. If the dictionary does not contain the suspected false OOV word and does contain one of the other forms of the word, then that word is falsely OOV and the SOLx operator can force it to appear in the "in-vocabulary" form in the input data as discussed below. Generally, this is accomplished through the use of a vocabulary adjustment rule. The vocabulary adjustment rule converts the false OOV form to the in-vocabulary form. The process for writing such rules is discussed in detail below.
Problems related to covert phrase boundaries appear as problems of translation. Thus, a problem related to covert phrase boundaries may initially be recognized when a translator/ translation evaluator finds related errors in the translated text. A useful objective, then, is to identify these problems as problems related to covert phrase boundaries, rather than as problems with other sources. For example, a translation evaluator may describe problems related to covert phrase boundaries as problems related to some word or words modifying the wrong word or words. Problems related to potential covert phrase boundaries can also be identified via statistical analysis. As discussed below, the SOLx system includes a statistical tool called the N- gram analyzer (NGA) that analyzes databases to determine, among other things, what terms appear most commonly and which terms appear in proximity to one another. A mistranslated phrase identified in the quality control analysis (described below in relation to the TQE module) which has a low NGA probability for the transition between two or more pairs of words suggests a covert phrase boundary. Problems related to covert phrase boundaries can also be addressed through modifying a schematic representation of the data under analysis. In this regard, if a covert phrase boundary problem is identified, it is often a result of attribute rules that failed to identify an attribute. This can be resolved by modifying the schema to include an appropriate attribute rule. If a schema has not yet been produced for the data, a schema can be constructed at this time. Once a categorization or attribute rule has been constructed for a phrase that the translator/translation evaluator has identified as poorly translated, then the original text can be re-translated. If the result is a well-translated phrase, the problem has been identified as one of a covert phrase boundary and the operator may consider constructing more labeling rules for the data under analysis. Covert phrase boundary problems can be addressed by building a schema, and then running the schematized data through a SOLx process that inserts a phrase boundary at the location of every labeling/tagging rule.
The core item of a typical business content description is the item that is being sold/described. An item description often consists of its core item and some terms that describe its various attributes. For example, in the item description Black and Decker 3/8" drill with accessories, the item that is being described is a drill. The words or phrases Black and Decker, 3/8", and with accessories all give us additional information about the core item, but do not represent the core item itself. The core item in an item description can generally be found by answering the question, what is the item that is being sold or described here? For example, in the item description Black and Decker 3/8" drill with accessories, the item that is being described is a drill. The words or phrases Black and Decker, 3/8", and with accessories all indicate something about the core item, but do not represent the core item itself.
A subject matter expert (SME) configuring SOLx for a particular application can leverage his domain-specific knowledge by listing the attributes of core items before beginning work with SOLx, and by listing the values of attributes before beginning work with SOLx. Both classification rules and attribute rules can then be prepared before manipulating data with the SOLx system. Domain-specific knowledge can also be leveraged by recognizing core items and attributes and their values during configuration of the SOLx system and writing rules for them as they appear. As the SME works with the data within the SOLx system, he can write rules for the data as the need appears. The Candidate Search Engine can also be used to perform a collocation search that outputs pairs of words that form collocations. If one of those words represents a core item, then the other word may represent an attribute, a value, or (in some sense) both. Attribute-value pairs can also be identified based on a semantic category search implemented by the SOLx system. The semantic category search outputs groups of item descriptions that share words belonging to a specific semantic category. Words from a specific semantic category that appear in similar item descriptions may represent a value, an attribute, or (in some sense) both.
Business content is generally characterized by a high degree of structure that facilitates writing phrasing rules and allows for efficient reuse of content "chunks." As discussed above, much content relating to product descriptions and other structured content is not free-flowing sentences, but is an abbreviated structure called a 'noun phrase'. Noun phrases are typically composed of mixtures of nouns (N), adjectives (A), and occasionally prepositions (P). The mixtures of nouns and adjectives may be nested. The following are some simple examples:
Figure imgf000045_0001
Table 1
Adjective phrases also exist mixed with adverbs (Av). Table 2 lists some examples.
Figure imgf000046_0002
Table 2
The noun phrase four-strand color-coded twisted-pair telephone wire has the pattern NNNAANNN. It is grouped as (fourN
Figure imgf000046_0001
(color N codedA)A (twistedA pairN)N telephoneN wireN . Another way to look at this item is an object-attribute list. The primary word or object is wire; of use type telephone; strand type twisted-pair, color property color-coded, and strand number type is four-stranded. The structure is NiAN2N3N4 . With this type of compound grouping, each group is essentially independent of any other group. Hence, the translation within each group is performed as an independent phrase and then linked by relatively simple linguistic rules. For example, regroup NiAN2N3N4 as NN3N4 where N= N1AN2. In
Spanish this can be translated as NN3N4-> N4 'de' N3 'de' {N} where {N} means the translated version of N, and -> means translated as. In Spanish, it would be NiAN2-^ N2A 'de' Ni. The phrase then translates as NiAN2N3N4 -> N4 'de' N3 'de' N2A 'de' Ni. In addition to defining simple rule sets for associating translated components of noun phrases, there is another factor that leads to the feasibility of automatically translating large component databases. This additional observation is that very few terms are used in creating these databases. For example, databases have been analyzed that have 70,000 part descriptions, yet are made up of only 4,000 words or tokens. Further, individual phrases are used hundreds of times. In other words, if the individual component pieces or "chunks" are translated, and there are simple rules for relating theses chunks, then the translation of large parts of the content, in principle, is straightforward. The SOLx system includes tools as discussed in more detail below for identifying reusable chunks, developing rules for translation and storing translated terms/chunks for facilitating substantially real-time transformation of electronic content.
Another objective of the configuration process is enabling SOLx to resolve certain ambiguities. Ambiguity exists when a language processing system does not know which of two or more possible analyses of a text string is the correct one. There are two kinds of ambiguity in item descriptions: lexical ambiguity and structural ambiguity. When properly configured, the SOLx system can often resolve both kinds of ambiguity.
Lexical ambiguity occurs when a language processing system does not know which of two or more meanings to assign to a word. For example, the abbreviation mil can have many meanings, including million, millimeter, military, and Milwaukee. In a million-item database of tools and construction materials, it may occur with all four meanings. In translation, lexical ambiguity leads to the problem of the wrong word being used to translate a word in your input. To translate your material, it is useful to expand the abbreviation to each of its different full forms in the appropriate contexts. The user can enable the SOLx system to do this by writing labeling rules that distinguish the different contexts from each other. For example, mil might appear with the meaning million in the context of a weight, with the meaning millimeter in the context of a length, with the meaning military in the context of a specification type (as in the phrase MIL SPEC), and with the meaning Milwaukee in the context of brand of a tool. You then write vocabulary adjustment rules to convert the string mil into the appropriate full form in each individual context. In schematization, resolving lexical ambiguity involves a number of issues, including identification of the core item in an item description; identification of values for attributes; and assignment of values to proper attributes.
Lexical ambiguity may also be resolved by reference to an associated classification. The classification may be specific to the ambiguous term or a related term, e.g., another term in the same noun phrase. Thus, for example, the ambiguous abbreviation "mil" may be resolved by 1) noting that it forms an attribute of an object-attribute list, 2) identifying the associated object (e.g., drill), 3) identifying a classification of the object (e.g., power tool), and 4) applying a rule set for that classification to select a meaning for the term (e.g., mil - Milwaukee). These relationships may be defined by the schema. Structural ambiguity occurs when a language processing system does not know which of two or more labeling rules to use to group together sets of words within an item description. This most commonly affects attribute rules and may require further nesting of parent/child tag relationships for proper resolution. Again, a related classification may assist in resolving structural ambiguity.
3. Configuration Processes
a. Normalization
As the foregoing discussion suggests, the various configuration objectives (e.g., resolving false OOVs, identifying covert phrase boundaries, taking advantage of reusable chunks and resolving ambiguities) can be addressed in accordance with the present invention by transforming input data from its native form into an intermediate form that is more amenable to translation or other localization/transformation. The corresponding process, which is a primary purpose of SOLx system configuration, is termed "normalization." Once normalized, the data will include standardized terminology in place of idiosyncratic terms, will reflect various grammar and other rules that assist in further processing, and will include tags that provide context including classification information for resolving ambiguities and otherwise promoting proper transformation. The associated processes are executed using the Normalization Workbench of the SOLx system, as will be described below. There are two kinds of rules developed using the Normalization Workbench: grammatical rules, and normalization rules. The purpose of a grammatical rule is to group together and label a section of text. The purpose of a normalization rule is to cause a labeled section of text to undergo some change. Although these rules are discussed in detail below in order to provide a more complete understanding of the present invention, it will be appreciated that these rules are, to a large extent, developed and implemented internally by the various SOLx tools. Accordingly, SOLx operators need not have linguistics expertise to realize the associated advantages.
i) Normalization Rules
The Normalization Workbench offers a number of different kinds of normalization rules relating to terminology including: replacement rules, joining rules, and ordering rules. Replacement rules allow the replacement of one kind of text with another kind of text. Different kinds of replacement rules allow the user to control the level of specificity of these replacements. Joining rules allow the user to specify how separated elements should be joined together in the final output. Ordering rules allow the user to specify how different parts of a description should be ordered relative to each other.
With regard to replacement rules, data might contain instances of the word centimeter written four different ways — as cm, as cm., as cm., and as centimeter — and the user might want to ensure that it always appears as centimeter. The Normalization Workbench implements two different kinds of replacement rules: unguided replacement, and guided replacement. The rule type that is most easily applicable to a particular environment can be selected. Unguided replacement rules allow the user to name a tag/category type, and specify a text string to be used to replace any text that is under that tag. Guided replacement rules allow the user to name a tag/category type, and specify specific text strings to be used to replace specific text strings that are under that tag. Within the Normalization Workbench logic, the format of unguided replacement rules may be, for example:
[categoryjype] => 'what to replace its text with'
For instance, the following rule says to find any [foot] category label, and replace the text that it tags with the word feet:
[foot] => 'feet'
If that rule was run against the following input,
Steel piping 6 [foot] foot long Steel piping 3 [foot] feet long
it would produce the following output:
Steel piping 6 [foot] feet long Steel piping 3 [foot] feet long
The second line is unchanged; in the first line, foot has been changed to feet.
Guided replacement rules allow the user to name a tag/category type, and specify specific text strings to be used to replace specific text strings that are under that tag. This is done by listing a set of possible content strings in which the normalization engine should "look up" the appropriate replacement. The format of these rules is:
[categoryjype] :: lookup
'text to replace' => 'text to replace it with' 'other text to replace' => 'text to replace it with' 'more text to replace' => 'text to replace it with' end lookup
For instance, the following rule says to find any [length_metric] label. If you see mm, mm., m.m., orm. m. beneath it, then replace it with millimeter. If you see cm, cm., cm., or c. m. beneath it, then replace it with centimeter:
[lengthjTietric] :: lookup 'mm' => 'millimeter' 'mm.' => 'millimeter' 'm.m.' => 'millimeter' 'm. m.' => 'millimeter' 'cm' => 'centimeter' 'cm.' => 'centimeter' 'cm.' => 'centimeter' 'c. m.' => 'centimeter' end lookup
If that rule was run against the following input
Stainless steel scalpel handle, [lengthjmetric] ( 5 mm )
[length_metric] ( 5 mm ) disposable plastic scalpel handle
it would produce the following output:
Stainless steel scalpel handle, [lengthjmetric] ( 5 millimeter )
[lengthjnetric] ( 5 millimeter ) disposable plastic scalpel handle
From the user's perspective, such replacement rules may be implemented via a simple user interface such as shown in Fig. 1. Fig. 1 shows a user interface screen 100 including a left pane 102 and a right pane 104. The left pane 102 displays the grammar rules that are currently in use. The rules are shown graphically, including alternative expressions (in this case) as well as rule relationships and categories. Many alternative expressions or candidates therefor are automatically recognized by the workbench and presented to the user. The right pane 104 reflects the process to update or add a text replacement rule. In operation, a grammar rule is selected in the left pane 102. All text that can be recognized by the rule appears in the left column of the table 106 in the right pane 104. The SME then has the option to unconditionally replace all text with the string from the right column of the table 106 or may conditionally enter a replacement string. Although not shown in each case below, similar interfaces allow for easy development and implementation of the various rules discussed herein. It will be appreciated that "liter" and "ounce" together with their variants thus are members of the class "volume" and the left pane 102 graphically depicts a portion of a taxonomy associated with a schema.
Joining rules allow the user to specify how separated elements should be joined together in the final output. Joining rules can be used to re-join elements that were separated during the process of assigning category labels. The user can also use joining rules to combine separate elements to form single delimited fields.
Some elements that were originally adjacent in the input may have become separated in the process of assigning them category labels, and it may be desired to re-join them in the output. For example, the catheter tip configuration JL4 will appear as [catheter_tip_configu ration] (J L 4) after its category label is assigned. However, the customary way to write this configuration is with all three of its elements adjacent to each other. Joining rules allow the user to join them together again.
The user may wish the members of a particular category to form a single, delimited field. For instance, you might want the contents of the category label [litter_box] ( plastic hi-impact scratch-resistant ) to appear as plastic, hi-impact, scratch -resistant in order to conserve space in your data description field. Joining rules allow the user to join these elements together and to specify that a comma be used as the delimiting symbol.
The format of these rules is:
[categoryjabel] :: join with 'delimiter'
The delimiter can be absent, in which case the elements are joined immediately adjacent to each other. For example, numbers emerge from the category labeler with spaces between them, so that the number twelve looks like this:
[real] ( 1 2 )
A standard normalization rule file supplied with the Normalization Workbench contains the following joining rule:
[real] :: join with "
This rule causes the numbers to be joined to each other without an intervening space, producing the following output:
[real] ( 12 )
The following rule states that any content that appears with the category label [litter_box] should be joined together with commas:
[litter_box] :: join with ','
If that rule was run against the following input,
[litterjDox] ( plastic hi-impact dog-repellant ) [litter_box] ( enamel shatter-resistant )
it would produce the following output:
[litter_box] ( plastic, hi-impact,dog-repellant ) [litterjDox] ( enamel, shatter-resistant )
Ordering rules allow the user to specify how different parts of a description should be ordered relative to each other. For instance, input data might contain catheter descriptions that always contain a catheter size and a catheter type, but in varying orders — sometimes with the catheter size before the catheter type, and sometimes with the catheter type before the catheter size:
[catheter] ( [catheter_size] ( 8Fr ) [catheter_type] ( JL4 ) [item] ( catheter ) ) [catheter] ( [catheter_type] ( JL5 ) [catheter_size] ( 8Fr ) [item] ( catheter ) ) The user might prefer that these always occur in a consistent order, with the catheter size coming first and the catheter type coming second. Ordering rules allow you to enforce this ordering consistently.
The internal format of ordering rules is generally somewhat more complicated than that of the other types of rules. Ordering rules generally have three parts. Beginning with a simple example:
[catheter] / [catheterjype] [catheter_size] => ( $2 $1 )
The first part of the rule, shown in bold below, specifies that this rule should only be applied to the contents of a [catheter] category label:
[catheter] / [catheterjype] [catheter_size] => ( $2 $1 )
The second part of the rule, shown in bold below, specifies which labeled elements are to have their orders changed:
[catheter] / [catheterjype] [catheter_size] => ( $2 $1 )
Each of those elements is assigned a number, which is written in the format $number in the third part of the rule. The third part of the rule, shown in bold below, specifies the order in which those elements should appear in the output:
[catheter] / [catheterjype] [catheter_size] => ( $2 $1 )
The order $2 $1 indicates that the element which was originally second (i.e., $2) should be first (since it appears in the leftmost position in the third part of the rule), while the element which was originally first (i.e., $1) should be second (since it appears in the second position from the left in the third part of the rule). Ordering rules can appear with any number of elements. For example, this rule refers to a category label that contains four elements. The rule switches the position of the first and third elements of its input, while keeping its second and fourth elements in their original positions:
[resistor] / [resistance] [tolerance] [wattage] [manufacturer] => ( $3 $2 $1 $4 )
Fig. 2 shows an example of a user interface screen 200 that may be used to develop and implement an ordering rule. The screen 200 includes a left pane 202 and a right pane 204. The left pane 202 displays the grammar rules that are currently in use - in this case, ordering rules for container size - as well as various structural productions under each rule. The right pane 204 reflects the process to update or add structural reorganization to the rule. In operation, a structural rule is selected using the left pane 202. The right pane 204 can then be used to develop or modify the rule. In this case, the elements or "nodes" can be reordered by simple drag-and-drop process. Nodes may also be added or deleted using simple mouse or keypad commands. Ordering rules are very powerful, and have other uses besides order- changing per se. Other uses for ordering rules include the deletion of unwanted material, and the addition of desired material.
To use an ordering rule to delete material, the undesired material can be omitted from the third part of the rule. For example, the following rule causes the deletion of the second element from the product description:
[notebook] / [item] [academicjield] [purpose] => ( $1 $3 )
If that rule was run against the following input,
[notebook] ( [item] ( notebook ) [academic_field] (linguistics) [purpose] (fieldwork) [notebook] ( [item] ( notebook ) [academic_field] (sociology) [purpose] (fieldwork )
it would produce the following output:
[notebook] ( [item] ( notebook ) [purpose] ( fieldwork ) [notebook] ( [item] ( notebook ) [purpose] ( fieldwork )
To use an ordering rule to add desired material, the desired material can be added to the third part of the rule in the desired position relative to the other elements. For example, the following rule causes the string [real_cnx]"-' to be added to the product description:
[real] / ( integer] [fraction] ) => ( $1 [real_cnx]'-' $2 )
If that rule was run against the following input,
[real] ( 11/2 ) [real] ( 15/8 )
it would produce the following output:
[real] ( 1 [real_cnx] ( - ) 1/2 ) [real] ( 1 [real_cnx] ( - ) 5/8 )
After final processing, this converts the confusing 11/2 and 15/8 to 1-1/2 ( "one and a half) and 1-5/8 ("one and five eighths").
In addition to the foregoing normalization rules relating to terminology, the SOLx system also involves normalization rules relating to context cues, including classification and phrasing. The rules that the SOLx system uses to identify contexts and determine the location and boundaries of attribute/value pairs fall into three categories: categorization rules, attribute rules, and analysis rules. Categorization rules and attribute rules together form a class of rules known as labeling/tagging rules, labeling/tagging rules cause the insertion of labels/tags in the output text when the user requests parsed or labeled/tagged texts. They form the structure of the schema in a schematization task, and they become phrase boundaries in a machine translation task. Analysis rules do not cause the insertion of labels/tags in the output. They are inserted temporarily by the SOLx system during the processing of input, and are deleted from the output before it is displayed.
Although analysis tags are not displayed in the output (SOLx can allow the user to view them if the data is processed in a defined interactive mode), they are very important to the process of determining contexts for vocabulary adjustment rules and for determining where labels/tags should be inserted. The analysis process is discussed in more detail below.
ii. Grammar Rules
The various rules described above for establishing normalized content are based on grammar rules developed for a particular application. The process for developing grammar rules is set forth in the following discussion. Again, it will be appreciated that the SOLx tools guide an SME through the development of these rules and the SME need not have any expertise in this regard. There are generally two approaches to writing grammar rules, known as "bottom up" and "top down." Bottom-up approaches to writing grammar rules begin by looking for the smallest identifiable units in the text and proceed by building up to larger units made up of cohesive sets of the smaller units. Top-down approaches to writing grammar rules begin by identifying the largest units in the text, and proceed by identifying the smaller cohesive units of which they are made.
Consider the following data for an example of building grammar rules from the bottom up. It consists of typical descriptions of various catheters used in invasive cardiology:
8Fr. JR4 Cordis 8 Fr. JR5 Cordis 8Fr JL4 catheter, Cordis, 6/box 8Fr pigtail 6/box 8 French pigtail catheter, 135 degree 8Fr Sones catheter, reusable
4Fr. LC angioplasty catheter with guidewire and peelaway sheath
Each of these descriptions includes some indication of the (diametric) size of the catheter, shown in bold text below:
8Fr. JR4 Cordis 8 Fr. JR5 Cordis 8Fr JL4 catheter, Cordis, 6/box 8Fr pigtail 6/box
8 French pigtail catheter, 135 degree 8Fr Sones catheter, reusable
4Fr. LC angioplasty catheter with guidewire and peelaway sheath
One can make two very broad generalizations about these indications of catheter size: all of them include a digit, and the digits all seem to be integers.
One can further make two weaker generalizations about these indications of catheter size: all of them include either the letters Fr, or the word French; and if they include the letters Fr, those two letters may or may not be followed by a period. A subject matter expert (SME) operating the SOLx system will know that Fr, Fr., and French are all tokens of the same thing: some indicator of the unit of catheter size. Having noted these various forms in the data, a first rule can be written. It will take the form x can appear as w, y, or z, and this rule will describe the different ways that x can appear in the data under analysis.
The basic fact that the rule is intended to capture is French can appear as Fr, as Fr., or as French.
In the grammar rules formalism, that fact may be indicated like this:
[French] (Fr)
(Fr.)
(French)
[French] is the name assigned to the category of "things that can be forms of the word that expresses the unit of size of catheters" and could just as well have been called [catheter_size_unit], or [Fr], or [french]. The important thing is to give the category a label that is meaningful to the user.
(Fr)1 (Fr.), and (French) are the forms that a thing that belongs to the category [French] can take. Although the exact name for the category [French] is not important, it matters much more how these "rule contents" are written. For example, the forms may be case sensitive. That is, (Fr) and (fr) are different forms. If your rule contains the form (Fr), but not the form (fr), then if there is a description like this:
8 fr cordis catheter
The fr in the description will not be recognized as expressing a unit of catheter size. Similarly, if your rule contained the form (fr), but not the form (Fr), then Fr would not be recognized. "Upper-case" and "lower-case" distinctions may also matter in this part of a rule.
Returning to the list of descriptions above, a third generalization can be made: all of the indications of catheter size include an integer followed by the unit of catheter size.
This suggests another rule, of the form all x consist of the sequence a followed by b. The basic fact that the rule is intended to capture is: all indications of catheter size consist of a number followed by some form of the category [French].
In the grammar rules formalism, that fact may be indicated like this:
>[catheter_size] ([real] [French])
[catheter_size] is the name assigned to the category of "groups of words that can indicate the size of a catheter;" and could just as well have been called [size], or [catheterSize], or [sizeOfACatheter]. The important thing is to give the category a label that is meaningful to the user.
([real] [French]) is the part of the rule that describes the things that make up a [catheter_size] — that is, something that belongs to the category of things that can be [French], and something that belongs to the categories of things that can be [real] — and what order they have to appear in — in this case, the [real] first, followed by the [French]. In this part of the rule, exactly how things are written is important.
In this rule, the user is able to make use of the rule for [French] that was defined earlier. Similarly, the user is able to make use of the [real] rule for numbers that can generally be supplied as a standard rule with the Normalization Workbench. Rules can make reference to other rules. Furthermore, rules do not have to be defined in the same file to be used together, as long as the parser reads in the file in which they are defined.
So far this example has involved a set of rules that allows description of the size of every catheter in a list of descriptions. The SME working with this data might then want to write a set of rules for describing the various catheter types in the list. Up to this point, this example has started with the smallest units of text that could be identified (the different forms of [French]) and worked up from there (to the [catheter_size] category). Now, the SME may have an idea of a higher-level description (i.e., catheter type), but no lower-level descriptions to build it up out of; in this case, the SME may start at the top, and think his way down through a set of rules. The SME can see that each of these descriptions includes some indication of the type of the catheter, shown in bold text below:
8Fr. JR4 Cordis 8 Fr. JR5 Cordis 8Fr JL4 catheter, Cordis, 6/box 8Fr pigtail 6/box
8 French pigtail catheter, 135 degree 8Fr Sones catheter, reusable 4Fr. angioplasty catheter with guidewire and peelaway sheath He is aware that a catheter type can be described in one of two ways: by the tip configuration of the catheter, and by the purpose of the catheter. So, the SME may write a rule that captures the fact that catheter types can be identified by tip configuration or by catheter purpose.
In the grammar rules formalism, that fact may be indicated like this:
>[catheter__type]
([catheter_tip_configuration]) ([catheter_purpose])
This involves a rule for describing tip configuration, and a rule for identifying a catheter's purpose.
Starting with tip configuration, the SME knows that catheter tip configurations can be described in two ways: 1) by a combination of the inventor's name, an indication of which blood vessel the catheter is meant to engage, and by an indication of the length of the curve at the catheter tip; or 2) by the inventor's name alone.
The SME can write a rule that indicates these two possibilities in this way: [catheter_tip_configuration]
([inventor] [coronary_artery] [curve_size]) ([inventor])
In this rule, [catheter_tip_configuration] is the category label; ([inventor] [coronary_artery] [curve_size]) and ([inventor]) are the two forms that things that belong to this category can take. In order to use these rules, the SME will need to write rules for [inventor], [coronary_artery], and [curve_size]. The SME knows that in all of these cases, the possible forms that something that belongs to one of these categories can take are very limited, and can be listed, similarly to the various forms of [French]:
[inventor]
(J) (Sones)
[coronary_artery] (L)
(R)
[curve_size] (3.5)
(4) (5)
With these rules, the SME has a complete description of the [catheter_tip_configuration] category. Recall that the SME is writing a [catheter_tip_configuration] rule because there are two ways that a catheter type can be identified: by the configuration of the catheter's tip, and by the catheter's purpose. The SME has the [catheter_tip_configuration] rule written now and just needs a rule that captures descriptions of a catheter's purpose. The SME is aware that (at least in this limited data set) a catheter's purpose can be directly indicated, e.g. by the word angioplasty, or can be inferred from something else — in this case, the catheter's shape, as in pigtail. So, the SME writes a rule that captures the fact that catheter purpose can be identified by purpose indicators or by catheter shape.
In the grammar rules formalism, that fact can be indicated like this:
[catheter_purpose]
([catheter_purpose_indicator]) ([catheter_shape])
The SME needs a rule for describing catheter purpose, and a rule for describing catheter shape. Both of these can be simple in this example:
[catheter_purpose_indicator]
(angioplasty) [catheter_shape] (pigtail)
With this, a complete set of rules is provided for describing catheter type, from the "top" (i.e., the [catheterjype] rule) "down" (i.e., to the rules for [inventor], [coronary_artery], [curve_size], [catheter__purpose], and [catheter_shape]).
"Top-down" and "bottom-up" approaches to writing grammar rules are both effective, and an SME should use whichever is most comfortable or efficient for a particular data set. The bottom-up approach is generally easier to troubleshoot; the top-down approach is more intuitive for some people. A grammar writer can use some combination of both approaches simultaneously. Grammar rules include a special type of rule called a wanker. Wankers are rules for category labels that should appear in the output of the token normalization process. In one implementation, wankers are written similarly to other rules, except that their category label starts with the symbol >. For example, in the preceding discussion, we wrote the following wanker rules:
>[catheter_size]
([real] [French])
>[catheter_type]
([catheter_tip_configuration]) ([catheter_purpose])
Other rules do not have this symbol preceding the category label, and are not wankers.
Chunks of text that have been described by a wanker rule will be tagged in the output of the token normalization process. For example, with the rule set that we have defined so far, including the two wankers, we would see output like the following:
[catheter_size] (8Fr.) [catheterjype] (JR4) Cordis [catheter_size] (8 Fr.) [catheterjype] (J R5) Cordis
[catheter_size] (8Fr) [catheterjype] (JL4) catheter, Cordis, 6/box [catheter_size] (8Fr) [catheterjype] (pigtail) 6/box [catheter_size] (8 French) [catheterjype] (pigtail) catheter, 135 degree [catheter_size](8Fr) [catheterjype] (Sones) catheter, reusable [catheter_size] (4Fr.) LC [catheterjype] (angioplasty) catheter with guidewire and peelaway sheath
Although the other rules are used in this example to define the wanker rules, and to recognize their various forms in the input text, since the other rules are not wankers, their category labels do not appear in the output. If at some point it is desired to make one or more of those other rules1 category labels to appear in the output, the SME or other operator can cause them to do so by converting those rules to wankers.
Besides category labels, the foregoing example included two kinds of things in rules. First, the example included rules that contained other category labels. These "other" category labels are identifiable in the example by the fact that they are always enclosed in square brackets, e.g.,
[catheter_purpose]
([catheter_purpose_jndicator]) ([catheter_shape])
The example also included rules that contained strings of text that had to be written exactly the way that they would appear in the input. These strings are identifiable by the fact that they are directly enclosed by parentheses, e.g.
[French] (Fr) (Fr.) (French)
There is a third kind of thing that can be used in a rule. These things, called regular expressions, allow the user to specify approximately what a description will look like. Regular expressions can be recognized by the facts that, unlike the other kinds of rule contents, they are not enclosed by parentheses, and they are immediately enclosed by "forward slashes."
Regular expressions in rules look like this:
[angiography_catheter_french_size]
/7|8/ [rocket_engine_size] /ΛX\d{2}/
[naval_vesseLhull_number]
Λw+\d+/
Although the foregoing example illustrated specific implementations of specific rules, it will be appreciated that a virtually endless variety of specialized rules may be provided in accordance with the present invention. The SOLx system of the present invention consists of many components, as will be described below. One of these components is the Natural Language Engine module, or NLE. The NLE module evaluates each item description in data under analysis by means of rules that describe the ways in which core items and their attributes can appear in the data. The exact (machine- readable) format that these rules take can vary depending upon the application involved and computing environment. For present purposes, it is sufficient to realize that these rules express relationships like the following (stated in relation to the drill example discussed above):
• Descriptions of a drill include the manufacturer's name, the drill size, and may also include a list of accessories and whether or not it is battery powered. • A drill's size may be three eighths of an inch or one half inch
• inch may be written as inch or as "
• If inch is written as ", then it may be written with or without a space between the numbers 3/8 or 1/2 and the "
The NLE checks each line of the data individually to see if any of the rules seem to apply to that line. If a rule seems to apply, then the NLE inserts a label/tag and marks which string of words that rule seemed to apply to. For example, for the set of rules listed above, then in the item description Black and Decker 3/8" drill with accessories, the NLE module would notice that 3/8" might be a drill size, and would mark it as such. If the user is running the NLE in interactive mode, he may observe something like this in the output:
[drill_size] (3/8")
In addition to the rules listed above, a complete set of rules for describing the ways that item descriptions for/of drills and their attributes would also include rules for manufacturers' names, accessory lists, and whether or not the drill is battery powered. If the user writes such a set of rules, then in the item description Black and Decker 3/8" drill with accessories, the NLE module will notice and label/tag the following attributes of the description:
[manufacturer_name] (Black and Decker) [drill_size] (3/8")
The performance of the rules can be analyzed in two stages. First, determine whether or not the rules operate adequately. Second, if it is identified that rules that do not operate adequately, determine why they do not operate adequately.
For translations, the performance of the rules can be determined by evaluating the adequacy of the translations in the output text. For schematization, the performance of the rules can be determined by evaluating the adequacy of the schema that is suggested by running the rule set. For any rule type, if a rule has been identified that does not perform adequately, it can be determined why it does not operate adequately by operating the NLE component in interactive mode with output to the screen. For tagging rules, test data set can be analyzed to determine if: every item that should be labeled/tagged has been labeled/tagged and any item that should not have been labeled/tagged has been labeled/tagged in error.
In order to evaluate the rules in this way, the test data set must include both items that should be labeled/tagged, and items that should not be tagged.
Vocabulary adjustment rules operate on data that has been processed by tagging/tagging rules, so troubleshooting the performance of vocabulary adjustment rules requires attention to the operation of tagging/tagging rules, as well as to the operation of the vocabulary adjustment rules themselves.
In general, the data set selected to evaluate the performance of the rules should include: examples of different types of core items, and for each type of core item, examples with different sets of attributes and/or attribute values.
b. Processing
1. Searching
Normalization facilitates a variety of further processing options. One important type of processing is translation as noted above and further described below. However, other types of processing in addition to or instead of translation are enhanced by normalization including database and network searching, document location and retrieval, interest/personality matching, information aggregation for research/analysis, etc.
For purposes of illustration, a database and network searching application will now be described. It will be appreciated that this is closely related to the context assisted searching described above. In many cases, it is desirable to allow for searching across semantic boundaries. For example, a potential individual or business consumer may desire to access company product descriptions or listings that may be characterized by abbreviations and other terms, as well as syntax, that are unique to the company or otherwise insufficiently standardized to enable easy access. Additionally, submitting queries for searching information via a network (e.g., LAN, WAN, proprietary or open) is subject to considerable lexicographic uncertainty, even within a single language environment, which uncertainty expands geometrically in the context of multiple languages. It is common for a searcher to submit queries that attempt to encompass a range of synonyms or conceptually related terms when attempting to obtain complete search results. However, this requires significant knowledge and skill and is often impractical, especially in a multi-language environment. Moreover, in some cases, a searcher, such as a consumer without specialized knowledge regarding a search area, may be insufficiently knowledgeable regarding a taxonomy or classification structure of the subject matter of interest to execute certain search strategies for identifying information of interest through a process of progressively narrowing the scope of responsive information based on conceptual/class relationships.
It will be observed that the left panel 102 of Fig. 1 graphically depicts a portion of a taxonomy where, for example, the units of measure "liter" and "ounce", as well as variants thereof, are subclasses of the class "volume." Thus, for example, a searcher entering a query including the term "ounce" (or "oz") may access responsive information for a database or the like including the term "oz" or ("ounce"). Moreover, metric equivalent items, e.g., including the term "ml," may be retrieved in response to the query based on tags commonly linking the search term and the responsive item to the class
"volume." In these cases, both normalization (oz = ounce) and classification (<_volume«ounce» «liter»_>) (where the markings <> and «» indicate parent-child tag relationships) are used to enhance the search functionality. Such normalization may involve normalizing a locale-specific search term and/or normalizing terms in a searched database to a normalized form. It will be appreciated that the normalized (or unnormalized) terms may be translated from one language to another, as disclosed herein, to provide a further degree of search functionality.
Moreover, such normalization and classification assisted searches are not limited to the context of product descriptions but may extend to the entirety of any language. In this regard, Fig. 19 illustrates a taxonomy 1900 related to the area of mechanics that may be used in connection with research related to small aircraft runway accidents attributed to following in the wake of larger aircraft. Terms 1902 represent alternative terms that may be normalized by an SME using the present invention, such as an administrator of a government crash investigation database, to the normalized terms 1904, namely, "vorticity" and "wake." These terms 1904 may be associated with a parent classification 1906 ("wingtip vortices") which in turn is associated with a grandparent classification 1908 ("aerodynamic causes") and so on. In this context, normalization allows for mapping of a range of colloquial or scientific search terms into predefined taxonomy, or for tagging of documents including such terms relative to the taxonomy. The taxonomy can then be used to resolve, lexicographic ambiguities and to retrieve relevant documents.
Fig. 20 is a flowchart illustrating a process 2000 for constructing a database for enhanced searching using normalization and classification. The illustrated process 2000 is initiated by establishing (2002) a taxonomy for the relevant subject matter. This may be performed by an SME and will generally involve dividing the subject matter into conceptual categories and subcategories that collectively define the subject matter. In many cases, such categories may be defined by reference materials or industry standards. The SME may also establish (2004) normalization rules, as discussed above, for normalizing a variety of terms or phrases into a smaller number of normalized terms. For example, this may involve surveying a collection or database of documents to identify sets of corresponding terms, abbreviations and other variants. It will be appreciated that the taxonomy and normalization rules may be supplemented and revised over time based on experience to enhance operation of the system.
Once the initial taxonomy and normalization rules have been established, a document to be stored is received (2004) and parsed (2006) into appropriate chunks, e.g., words or phrases. Normalization rules are then applied (2008) to map the chunks into normalized expressions. Depending on the application, the document may be revised to reflect the normalized expressions, or the normalized expressions may merely be used for processing purposes. In any case, the normalized expressions are then used to define (2010) a taxonomic lineage (e.g., wingtip vortices, aerodynamic causes, etc.) for the subject term and to apply (2012) corresponding tags. The tagged document (2014) is then stored and the tags can be used to retrieve, print, display, transmit, etc., the document or a portion thereof. For example, the database may be searched based on classification or a term of a query may be normalized and the normalized term may be associated with a classification to identify responsive documents.
2. Translating
The SOLx paradigm is to use translators to translate repeatable complex terms and phrases, and translation rules to link these phrases together. It uses the best of both manual and machine translation. The SOLx system uses computer technology for repetitive or straightforward applications, and uses people for the complex or special-case situations. The NorTran (Normalization/Translation) server is designed to support this paradigm. Figure 3 represents a high-level architecture of the NorTran platform 300. Each module is discussed below as it relates to the normalization/classification process. A more detailed description is provided below in connection with the overall SOLx schematic diagram description for configuration and run-time operation. The GUI 302 is the interface between the subject matter expert (SME) or human translator (HT) and the core modules of the NorTran server. Through this interface, SMEs and HTs define the filters for content chunking, classification access dictionaries, create the terms and phrases dictionaries, and monitor and edit the translated content. This N-Gram 304 filter for the N-gram analysis defines the parameters used in the N-gram program. The N-gram program is the key statistical tool for identifying the key reoccurring terms and phrases of the original content. The N-Gram and other statistical tools module 306 is a set of parsing and statistical tools that analyze the original content for significant terms and phrases. The tools parse for the importance of two or more words or tokens as defined by the filter settings. The output is a sorted list of terms with the estimated probabilities of the importance of the term in the totality of the content. The goal is to aggregate the largest re-usable chunks and have them directly classified and translated.
The chunking classification assembly and grammar rules set 308 relates the pieces from one language to another. For example, as discussed earlier, two noun phrases NiNa are mapped in Spanish as N2 'de' N-i. Rules may need to be added or existing ones modified by the translator. The rules are used by the translation engine with the dictionaries and the original content (or the normalized content) to reassemble the content in its translated form. The rules/grammar base language pairs and translation engine 310 constitute a somewhat specialized machine translation (MT) system. The translation engine portion of this system may utilize any of various commercially available translation tools with appropriate configuration of its dictionaries. Given that the translation process is not an exact science and that round trip processes (translations from A to B to A) rarely work, a statistical evaluation is likely the best automatic tool to assess the acceptability of the translations.
The Translation Accuracy Analyzer 312 assesses words not translated, heuristics for similar content, baseline analysis from human translation and other criteria.
The chunking and translation editor 314 functions much like a translator's workbench. This tool has access to the original content; helps the SME create normalized content if required; the normalized content and dictionaries help the translator create the translated terms and phase dictionary, and when that repository is created, helps the translator fill in any missing terms in the translation of the original content. A representation of the chunking functionality of this editor is shown in the example in Table 3.
Figure imgf000073_0001
Table 3
The first column lists the original content from a parts list of cooking dishes. 5 The term (A) etc. are dimensional measurements that are not relevant to the discussion. The second column lists the chunked terms from an N-gram analysis; the third column lists the frequency of each term in the original content set. The fourth column is the number associated with the chunk terms in column 2. The fifth column is the representation of the first column in 10 terms of the sequence of chunked content. Although not shown, a classification lineage is also associated with each chunk to assist in translation, e.g., by resolving ambiguities.
If the translation of each chunk is stored in another column, and translation rules exist for reassembling the chunks, then the content is 15 translated.lt could be listed in another column that would have a direct match or link to the original content. Table 4 lists the normalized and translated normalized content.
Figure imgf000074_0001
Table 4
Finally, Table 5 shows the Original Content and the Translated Content that is created by assembling the Translated Normalized Terms in Table 4 according to the Chunked Original Content sequence in Table 3.
Figure imgf000075_0001
Table 5
This example shows that when appropriately "chunked," machine translation grammar knowledge for noun phrases can be minimized. However, it cannot be eliminated entirely.
Referring to Fig. 3, the Normalized Special Terms and Phrases repository 316 contains chunked content that is in a form that supports manual translation. It is free of unusual acronyms, misspellings, and strived for consistency. In Table 3 for example, Emile Henry was also listed as E. Henry. Terms usage is maximized.
The Special Terms and Phrases Translation Dictionary repository 318 is the translated normalized terms and phrases content. It is the specialty dictionary for the client content.
Other translation dictionaries 320 may be any of various commercially available dictionary tools and/or SOLx developed databases. They may be general terms dictionaries, industry specific, SOLx acquired content, or any other knowledge that helps automate the process. One of the tenets of the SOLx process is that the original content need not be altered. Certainly, there are advantages to make the content as internally consistent as possible, and to define some form of structure or syntax to make translations easier and more accurate. However, there are situations where a firm's IT department does not want the original content modified in any way. Taking advantage of the benefits of normalized content, but without actually modifying the original, SOLx uses a set of meta or non- persistent stores so that the translations are based on the normalized meta content 322. Tags reflecting classification information may also be kept here. The above discussion suggests a number of processes that may be implemented for the automatic translation of large databases of structured content. One implementation of these processes is illustrated in the flowchart of Fig. 4 and is summarized below. It will be appreciated that these processes and the ordering thereof can be modified. First, the firm's IT organization extracts 400 the content from their IT systems — ideally with a part number or other unique key. As discussed above, one of the key SOLx features is that the client need not restructure or alter the original content in their IT databases. However, there are reasons to do so. In particular, restructuring benefits localization efforts by reducing the translation set up time and improving the translation accuracy. One of these modifications is to adopt a 'normalized' or fixed syntactic, semantic, and grammatical description of each content entry.
Next, software tools identify (402) the most important terms and phrases. Nearest neighbor, filtered N-gram, and other analysis tools identify the most used and important phrases and terms in the content. The content is analyzed one description or item at a time and re-usable chunks are extracted.
Subject matter experts then "internationalize" (404) the important terms and phrases. These experts "translate" the abbreviations and acronyms, correct misspellings and in general redefine and terms that would be ambiguous for translation. This is a list of normalized terms and phrases. It references the original list of important terms and phrases. The SMEs also associate such terms and phrases with a classification lineage.
Translators can then translate (406) the internationalized important terms and phrases. This translated content forms a dictionary of specialty terms and phrases. In essence, this translated content corresponds to the important and re-usable chunks. Depending on the translation engine used, the translator may need to specify the gender alternatives, plural forms, and other language specific information for the special terms and phrases dictionary. Referring again to an example discussed above, translators would probably supply the translation for (four-strand), (color-coded), (twisted-pair), telephone, and wire. This assumes that each term was used repeatedly. Any other entry that uses (color-coded) or wire would use the pre-translated term.
Other dictionaries for general words and even industry specific nomenclature can then be consulted (408) as available. This same approach could be used for the creation of general dictionaries. However, for purposes of this discussion it is assumed that they already exist.
Next, language specific rules are used to define (410) the assembly of translated content pieces. The types of rules described above define the way the pre-translated chunks are reassembled. If, in any one description, the grammatical structure is believed to be more complicated than the pre-defined rule set, then the phrase is translated in its entirety.
The original content (on a per item basis) is then mapped (412) against the dictionaries. Here, the line item content is parsed and the dictionaries are searched for the appropriate chunked and more general terms (content chunks to translated chunks). Ideally, all terms in the dictionaries map to a single-line item in the content database, i.e. a single product description. This is the first function of the translation engine. The classification information may be used to assist in this mapping and to resolve ambiguities. A software translation engine then assembles (414) the translated pieces against the language rules. Input into the translation engine includes the original content, the translation or assembly rules, and the translated pieces. A translation tool will enable a translator to monitor the process and directly intercede if required. This could include adding a new chunk to the specialty terms database, or overriding the standard terms dictionaries.
A statistically based software tool assesses (416) the potential accuracy of the translated item. One of the difficulties of translation is that when something is translated from one language to another and then retranslated back to the first, the original content is rarely reproduced. Ideally, one hopes it is close, but rarely will it be exact. The reason for this is there is not a direct inverse in language translation. Each language pair has a circle of 'confusion' or acceptability. In other words, there is a propagation of error in the translation process. Short of looking at every translated phrase, the best than can be hoped for in an overall sense is a statistical evaluation.
Translators may re-edit (418) the translated content as required. Since the content is stored in a database that is indexed to the original content on an entry-by-entry basis, any entry may be edited and restored if this process leads to an unsatisfactory translation. Although not explicitly described, there are terms such as proper nouns, trade names, special terms, etc., that are never translated. The identification of these invariant terms would be identified in the above process. Similarly, converted entries such as metrics would be handled through a metrics conversion process. The process thus discussed uses both human and machine translation in a different way than traditionally employed. This process, with the correct software systems in place should generate much of the accuracy associated with manual translation. Further, this process should function without manual intervention once sufficient content has been pre-translated. The various configuration processes are further illustrated by the screenshots of Figs. 5 - 10. Although these figures depict screenshots, it will be appreciated that these figures would not be part of the user interface as seen by an SME or other operator. Rather, these screenshots are presented here for purposes of illustration and the associated functionality would, to a significant extent, be implemented transparently. These screenshots show the general processing of source content. The steps are importing the data, normalizing the data based on a set of grammars and rules produced by the
SME using the NTW user interface, then analysis of the content to find phrases that need to be translated, building a translation dictionary containing the discovered phrases, translation of the normalized content, and finally, estimation of the quality of the translated content.
The first step, as illustrated in Fig. 5, is to import the source structured content file. This will be a flat set file with the proper character encoding, e.g., UTF-8. There will generally be one item description per line. Some basic formatting of the input may be done at this point.
Fig. 6 shows normalized form of the content on the right and the original content (as imported above) on the left. What is not shown here are the grammars and rules used to perform the normalization. The form of the grammars and rules and how to created them are described above.
In this example, various forms of the word resistor that appear on the original content, for example "RES" or RESS" have been normalized to the form "resistor". The same is true for "W" being transformed to "watt" and "MW" to "milliwatt". Separation was added between text items, for example, "1/4W" is now "1/4 watt" or "75OHM" is now "75 ohm". Punctuation can also be added or removed, for example, "RES, 35.7" is now "resistor 35.7". Not > shown in the screenshot: the order of the text can also be standardized by the normalization rules. For example, if the user always want a resistor description to of the form:
resistor <ohms rating> <tolerance> <watts rating>
the normalization rules can enforce this standard form, and the normalized content would reflect this structure.
Another very valuable result of the normalization step can be to create a schematic representation of the content. In the phrase analysis step, as illustrated, the user is looking for the phrases in the now normalized content that still need to be translated to the target language. The purpose of Phrase Analysis, and in fact, the next several steps, is to create a translation dictionary that will be used by machine translation. The value in creating the translation dictionary is that only the phrases need translation not the complete body of text, thus providing a huge savings in time and cost to translate. The Phrase Analyzer only shows us here the phrases that it does not already have a translation for. Some of these phrases we do not want to translate, which leads us to the next step.
In the filter phrases step as shown in Fig. 7, an SME reviews this phrase data and determines which phrases should be translated. Once the SME has determined which phrases to translate, then a professional translator and/or machine tool translates the phrases (Figs. 8 - 9) from the source language, here English, to the target language, here Spanish, using any associated classification information. A SOLx user interface could be used to translate the phrases, or the phrases are sent out to a professional translator as a text file for translation. The translated text is returned as a text file and loaded into SOLx. The translated phrases become the translation dictionary that is then used by the machine translation system.
The machine translation system uses the translation dictionary created above as the source for domain specific vocabulary. By providing the domain specific vocabulary in the form of the translation dictionary, the SOLx system greatly increases the quality of the output from the machine translation system.
The SOLx system can also then provide an estimation of the quality of the translation result (Fig. 10). Good translations would then be loaded into the run-time localization system for use in the source system architecture. Bad translations would be used to improve the normalization grammars and rules, or the translation dictionary. The grammars, rules, and translation dictionary form a model of the content. Once the model of the content is complete, a very high level of translations are of good quality.
Particular implementations of the above described configuration processes can be summarized by reference to the flowcharts of Figs. 11 - 12. Specifically, Fig. 11 summarizes the steps of an exemplary normalization configuration process and Fig. 12 summarizes an exemplary translation configuration process.
Referring first to Fig. 11 , a new SOLx normalization process (1000) is initiated by importing (1102) the content of a source database or portion thereof to be normalized and selecting a quantify of text from a source database. For example, a sample of 100 item descriptions may be selected from the source content "denoted content.txt file." A text editor may be used to select the 100 lines. These 100 lines are then saved to a file named samplecontent.txt for purposes of this discussion.
The core items in the samplecontent.txt file are then found (1104) using the Candidate Search Engine, for example, by running a words-in-common search. Next, attribute/value information is found (1106) in the samplecontent.txt file using the Candidate Search Engine by running collocation and semantic category searches as described above. Once the attributes/values have been identified, the SOLx system can be used to write (1108) attribute rules. The formalism for writing such rules has been discussed above. It is noted that the SOLx system performs much of this work for the user and simple user interfaces can be provided to enable "writing" of these rules without specialized linguistic or detailed code-writing skills. The SOLx system can also be used at this point to write (1110) categorization or classification rules. As noted above, such categorization rules are useful in defining a context for avoiding or resolving ambiguities in the transformation process. Finally, the coverage of the data set can be analyzed (1112) to ensure satisfactory run time performance. It will be appreciated that the configuration process yields a tool that can not only translate those "chunks" that were processed during configuration, but can also successfully translate new items based on the knowledge base acquired and developed during configuration. The translation process is summarized below.
Referring to Fig. 12, the translation process 1200 is initiated by acquiring (1202) the total set of item descriptions that you want to translate as a flat file, with a single item description per line. For purposes of the present discussion, it is assumed that the item descriptions are in a file with the name of content.txt. A text editor may be used to setup an associated project configuration file.
Next, a sample of 100 item descriptions is selected (1204) from the content.txt file. A text editor may be used to select the 100 lines. These 100 lines to a file named samplecontent.txt.
The translation process continues with finding (1206) candidates for vocabulary adjustment rules in the samplecontent.txt file using the Candidate Search Engine. The Candidate Search Engine may implement a case variant search and full/abbreviated variant search, as well as a classification analysis, at this point in the process. The resulting information can be used to write vocabulary adjustment rules. Vocabulary adjustment rules may be written to convert abbreviated forms to their full forms.
Next, candidates for labeling/tagging rules are found (1208) in the sample/content.txt file using the Candidate Search Engine. Labeling/tagging rules may be written to convert semantic category and collocation forms. Attribute rules can then be written (1210) following the steps set forth in the previous flowchart.
Vocabulary adjustment rules are then run (1212) using the Natural Language Engine against the original content. Finally, the coverage of the data set can be analyzed (1214) evaluating performance of your vocabulary adjustment rules and evaluating performance of your attribute rules. At this point, if the proper coverage is being achieved by the vocabulary adjustment rules, then the process proceeds to building (1216) a domain-specific dictionary.
Otherwise, a new set of 100 item descriptions can be selected for analysis and the intervening steps are repeated.
To build a domain specific dictionary, the SME can run a translation dictionary creation utility. This runs using the rule files created above as input, and produces the initial translation dictionary file. This translation dictionary file contains the words and phrases that were found in the rules. The words and phrases found in the translation dictionary file can then be manually and/or machine translated (1218). This involves extracting a list of all word types using a text editor and then translating the normalized forms manually or through a machine tool such as SYSTRAN. The translated forms can then be inserted into the dictionary file that was previously output.
Next, the SME can run (1220) the machine translation module, run the repair module, and run the TQE module. The file outputs from TQE are reviewed (1222) to determine whether the translation results are acceptable. The acceptable translated content can be loaded (1224) into the Localized Content Server (LCS), if desired. The remainder of the translated content can be analyzed (1226) to determine what changes to make to the normalization and translation knowledge bases in order to improve the quality of the translation. Words and phrases that should be deleted during the translation process can be deleted (1228) and part-of-speech labels can be added, if needed. The SME can then create (1230) a file containing the translated words in the source and target languages. Once all of the content is found to be acceptable, the system is fully trained. The good translated content is then loaded into the LCS.
It has been found that it is useful to provide graphical feedback during normalization to assist the SME in monitoring progress. Any appropriate user interface may be provided in this regard. Fig. 13 shows an example of such an interface. As shown, the graphical desktop 1300 is divided into multiple workspaces, in this case, including workspaces 1302, 1304 and 1306. One workspace 1302 presents the source file content that is in process, e.g., being normalized and translated. A second area 1304, in this example, functions as the normalization workbench interface and is used to perform the various configuration processes such as replacing various abbreviations and expressions with standardized terms or, in the illustrated example, defining a parse tree. Additional workspaces such as workspace 1306 may be provided for accessing other tools such as the Candidate Search Engine which can identify terms for normalization or, as shown, allow for selection of rules. In the illustrated example, normalized terms are highlighted relative to the displayed source file in workspace 1302 on a currently updated basis. In this manner, the SME can readily determine when all or enough of the source file has been normalized. In a traditional e-business environment, this translation process essentially is offline. It becomes real-time and online when new content is added to the system. In this case, assuming well-developed special-purpose dictionaries and linguistic information already exists, the process can proceed in an automatic fashion. Content, once translated is stored in a specially indexed look-up database. This database functions as a memory translation repository. With this type of storage environment, the translated content can be scaled to virtually any size and be directly accessed in the e-business process. The associated architecture for supporting both configuration and run-time operation is discussed below.
B. SOLx Architecture
1. Network Architecture Options
The SOLx system operates in two distinct modes. The "off-line" mode is used to capture knowledge from the SME/translator and knowledge about the intended transformation of the content. This collectively defines a knowledge base. The off-line mode includes implementation of the configuration and translation processes described above. Once the knowledge base has been constructed, the SOLx system can be used in a file in/file out manner to transform content.
The SOLx system may be implemented in a variety of business-to- business (B2B) or other frameworks, including those shown in Fig. 14. Here the Source 1402, the firm that controls the original content 1404, can be interfaced with three types of content processors 1406. The SOLx system 1400 can interface at three levels: with a Local Platform 1408 (associated with the source 1402), with a Target Platform 1410 (associated with a target to whom the communication is addressed or is otherwise consumed by) and with a Global Platform 1412 (separate from the source 1402 and target 1408). A primary B2B model of the present invention focuses on a Source/Seller managing all transformation/localization. The Seller will communicate with other Integration Servers (such as WebMethods) and bare applications in a "Point to Point" fashion, therefore, all locales and data are registered and all localization is done on the seller side. However, all or some of the localization may be managed by the buyer or on a third party platform such as the global platform. Another model, which may be implemented using the global server, would allow two SOLx B2B-enabled servers to communicate in a neutral environment, e.g. English. Therefore, a Spanish and a Japanese system can communicate in English by configuring and registering the local communication in SOLx B2B.
A third model would include a local seller communicating directly (via HTTP) with the SOLx B2B enabled Buyer.
2. Network Interface
Previously, it was discussed how structured content is localized. The next requirement is to rapidly access this content. If there are ongoing requests to access a particular piece of localized content, it may be inefficient to continually translate the original entry. The issues, of course, are speed and potentially quality assurance. One solution is to store the translated content along with links to the original with a very fast retrieval mechanism for accessing the translated content. This is implemented by the SOLx Globalization Server.
The SOLx Globalization server consists of two major components (1 ) the Document Processing Engine and (2) the Translated Content Server (TCS). The Document Processing Engine is a WebMethods plug-compatible application that manages and dispenses localized content through XML- tagged business objects. The TCS contains language-paired content that is accessed through a cached database. This architecture assures very high-speed access to translated content. This server uses a hash index on the translated content cross-indexed with the original part number or a hash index on the equivalent original content, if there is not a unique part number. A direct link between the original and translated content via the part number (or hash entry) assures retrieval of the correct entry. The indexing scheme also guarantees very fast retrieval times. The process of adding a new localized item to the repository consists of creating the hash index, link to the original item, and its inclusion into the repository. The TCS will store data in Unicode format.
The TCS can be used in a standalone mode where content can be accessed by the SKU or part number of the original item, or through text searches of either the original content or its translated variant. If the hashed index of the translated content is known. It, of course, can be assessed that way. Additionally, the TCS will support SQL style queries through the standard Oracle SQL query tools.
The Document Processing Engine is the software component of the Globalization Server that allows localized content in the TCS to be integrated into typical B2B Web environments and system-to-system transactions. XML is rapidly replacing EDI as the standard protocol for Web-based B2B system- to-system communication. There are a number of core technologies often call "adaptors" or "integration servers" that translate ERP content, structures, and formats, from one system environment to another. WebMethods is one such adaptor but any such technology may be employed.
Figure 15 shows a conventional web system 1500 where, the WebMethods integration server 1502 takes as input an SAP-formatted content called an IDOC 1504 from a source back office 1501 via API 1503 and converts it into an XML-formatted document 1506 for transmission over the Web 1508 via optional application server 1510 and HTTP servers 1512 to some other receiver such as a Target back office 1510 or other ERP system. The document 1506 may be transmitted to Target back office 1514 via HTTP servers 1516 and an integration server 1518.
Figure 16 shows the modification of such a system that allows the TCS 1600 containing translated content to be accessed in a Web environment. In this figure, original content from the source system 1602 is translated by the NorTran Server 1604 and passed to a TCS repository 1606. A transaction request, whether requested from a foreign system or the source system 1602, will pass into the TCS 1600 through the Document Processing Engine 1608. From there, a communication can be transmitted across the Web 1610 via integration server adaptors 1612, an integration server 1614, an optional application server 1616 and HTTP servers 1618. 3. SOLx Component Structure
Figure 17 depicts the major components of one implementation of the SOLx system 1700 and the SOLx normalization/classification processes as discussed above. The NorTran Workbench/Server 1702 is that component of the SOLx system 1700 that, under the control of a SME/translator 1704, creates normalized/translated content. The SOLx Server 1708 is responsible for the delivery of content either as previously cached content or as content that is created from the real-time application of the knowledge bases under control of various SOLx engines. The initial step in either a normalization or translation process is to access legacy content 1710 that is associated with the firms' various legacy systems 1712. The legacy content 1710 may be provided as level 1 commerce data consisting of short descriptive phrases delivered as flat file structures that are used as input into the NorTran Workbench 1702. There are a number of external product and part classification schemas
1714, both proprietary and public. These schemas 1714 relate one class of part in terms of a larger or more general family, a taxonomy of parts for example. These schemas 1714 define the attributes that differentiate one part class from another. For example, in bolts, head style is an attribute for various types of heads such as hex, fillister, Phillips, etc. Using this knowledge in the development of the grammar rules will drastically shorten the time to normalize large quantities of data. Further, it provides a reference to identify many of the synonyms and abbreviations that are used to describe the content. The NorTran Workbench (NTW) 1702 is used to learn the structure and vocabulary of the content. The NTW user interface 1716 allows the SME 1704 to quickly provide the system 1700 with knowledge about the content. This knowledge is captured in the form of content parsing grammars, normalization rules, and the translation dictionary. As the SME 1704 "trains" the system 1700 in this manner, he can test to see how much of the content is understood based on the knowledge acquired so far. Once the structure and vocabulary are well understood, in other words an acceptable coverage has been gained, then NTW 1702 is used to normalize and translate large quantities of content.
Thus, one purpose of NTW 1702 is to allow SMEs 1704 to use a visual tool to specify rules for parsing domain data and rules for writing out parsed data in a normalized form. The NTW 1702 allows the SME 1704 to choose data samples from the main domain data, then to select a line at a time from that sample. Using visual tools such as drag and drop, and connecting items on a screen to establish relationships, the SME 1704 can build up parse rules that tell the Natural Language Engine (NLE) 1718 how to parse the domain data. The SME 1704 can then use visual tools to create rules to specify how the parsed data will be assembled for output - whether the data should be reordered, how particular groups of words should be represented, and so on. The NTW 1702 is tightly integrated with the NLE 1718. While the NTW 1702 allows the user to easily create, see, and edit parse rules and normalization rules, the NLE 1718 creates and stores grammars from these rules.
Although content parsing grammars, normalization rules, and context tokens constitute the core knowledge created by the SME 1704 using the system 1700, the GUI 1716 does not require the SME 1704 to have any background in computational linguistic, natural language processing or other abstract language skill whatsoever. The content SME 1704 must understand what the content really is, and translators must be technical translators. A "butterfly valve" in French does not translate to the French words for butterfly and valve.
The CSE 1720 is a system initially not under GUI 1716 control that identifies terms and small text strings that repeat often throughout the data set and are good candidates for the initial normalization process.
One purpose of this component is to address issues of scale in finding candidates for grammar and normalization rules. The SOLx system 1700 provides components and processes that allow the SME 1704 to incorporate the knowledge that he already has into the process of writing rules. However, some domains and data sets are so large and complex that they require normalization of things other than those that the SME 1704 is already aware of. Manually discovering these things in a large data set is time-consuming and tedious. The CSE 1720 allows automatic application of the "rules of thumb" and other heuristic techniques that data analysts apply in finding candidates for rule writing.
The CSE component works through the programmatic application of heuristic techniques for the identification of rule candidates. These heuristics were developed from applying knowledge elicitation techniques to two experienced grammar writers. The component is given a body of input data, applies heuristics to that data, and returns a set of rule candidates.
The N-Gram Analysis (NGA) lexical based tool 1722 identifies word and string patterns that reoccur in the content. It identifies single and two and higher word phrases that repeat throughout the data set. It is one of the core technologies in the CSE 1720. It is also used to identify those key phrases that should be translated after the content has been normalized.
The N-Gram Analysis tool 1722 consists of a basic statistical engine, and a dictionary, upon which a series of application engines rely. The applications are a chunker, a tagger, and a device that recognizes the structure in structured text. Fig. 18 shows the relationships between these layers.
One purpose of the base N-Gram Analyzer component 1800 is to contribute to the discovery of the structure in structured text. That structure appears on multiple levels, and each layer of the architecture works on a different level. The levels from the bottom up are "words", "terms", "usage", and "dimensions of schema". The following example shows the structure of a typical product description.
acetone amber glass bottle, assay > 99.5% color (alpha) < 11
The word-level of structure is a list of the tokens in the order of their appearance. The word "acetone" is first, then the word "amber", and so forth.
The terminology-level of structure is a list of the groups of words that act like a single word. Another way of describing terminology is to say that a group of words is a term when it names a standard concept for the people who work in the subject matter. In the example, "acetone", "amber glass", and "color (alpha)" are probably terms.
The next two levels of structure connect the words and terms to the goal of understanding the product description. The SOLx system approximates that goal with a schema for understanding. When the SOLx system operates on product description texts, the schema has a simple form that repeats across many kinds of products. The schema for product descriptions looks like a table.
Figure imgf000090_0001
Each column of the table is a property that characterizes a product. Each row of the table is a different product. In the cells of the row are the particular values of each property for that product. Different columns may be possible for different kinds of products. This report refers to the columns as "dimensions" of the schema. For other subject matter, the schema may have other forms. This fragment does not consider those other forms.
Returning to the example, the next level of structure is the usage level. That level classifies each word or term according to the dimension of the schema that it can describe. In the example, "acetone" is a "chemical"; "amber glass" is a material; "bottle" is a "product"; and so forth. The following tagged text shows the usage level of structure of the example in detail.
[chemical](acetone) [material](amber glass) [product](bottle) [,](,) [measurement] (assay) [>](>) [number](99) [.](.) [number](5) [unit_of_measure](%) [measurement](color (alpha)) [<](<) [number](11) The top level of structure that SOLx considers for translation consists of the dimensions of the schema. At that level, grammatical sequences of words describe features of the product in some dimensions that are relevant to that product. In the example, "acetone" describes the dimension "product"; "amber glass bottle" describes a "container of product"; and so forth. The following doubly tagged text shows the dimension-level of structure for the example, without identifying the dimensions.
[schema]([chemical](acetone) )
[schema]([material](amber glass) [product](bottle) [,](,) ) [schema]([measurement](assay) [>](>) [number](99) [.](.[) [number](5)
[unit_of_measure](%) )
[schema]([measurement](color (alpha)) [<](<) [number](11))
Given the structure above, it is possible to insert commas into the original text of the example, making it more readable. The following text shows the example with commas inserted.
acetone, amber glass bottle, assay > 99.5%, color (alpha) < 11
This model of the structure of text makes it possible to translate more accurately.
The discovery of structure by N-Gram Analysis is parallel to the discovery of structure by parsing in the Natural Language Engine. The two components are complementary, because each can serve where the other is weak. For example, in the example above, the NLE parser could discover the structure of the decimal number, "[number](99.5)", saving NGA the task of modeling the grammar of decimal fractions. The statistical model of grammar in NGA can make it unnecessary for human experts to write extensive grammars for NLE to extract a diverse larger-scale grammar. By balancing the expenditure of effort in NGA and NLE, people can minimize the work necessary to analyze the structure of texts. One of the basic parts of the NGA component 1800 is a statistical modeler, which provides the name for the whole component. The statistical idea is to count the sequences of words in a body of text in order to measure the odds that a particular word appears after a particular sequence. In mathematical terms, the statistical modeler computes the conditional probability of word n, given words 1 through n-1:
Figure imgf000092_0001
Using that statistical information about a body of text, it is possible to make reasonable guesses about the structure of text. The first approximation of a reasonable guess is to assume that the most likely structure is also the structure that the author of the text intended. That assumption is easily incorrect, given the variety of human authors, but it is a good starting place for further improvement.
The next improvement toward recognizing the intent of the author is to add some specific information about the subject matter. The dictionary component 1802 captures that kind of information at the levels of words, terms, and usage. Two sources may provide that information. First, a human expert could add words and terms to the dictionary, indicating their usage. Second, the NLE component could tag the text, using its grammar rules, and the NGA component adds the phrases inside the tags to the dictionary, using the name of the tag to indicate the usage.
The information in the dictionary complements the information in the statistical model by providing a better interpretation of text when the statistical assumption is inappropriate. The statistical model acts as a fallback analysis when the dictionary does not contain information about particular words and phrases.
The chunker 1804 combines the information in the dictionary 1802 and the information in the statistical model to partition a body of texts into phrases. Partitioning is an approximation of parsing that sacrifices some of the details of parsing in order to execute without the grammar rules that parsing requires. The chunker 1804 attempts to optimize the partitions so each cell is likely to contain a useful phrase. One part of that optimization uses the dictionary to identify function words and excludes phrases that would cut off grammatical structures that involve the function words.
The chunker can detect new terms for the dictionary in the form of cells of partitions that contain phrases that are not already in the dictionary. The output of the chunker is a list of cells that it used to partition the body of text.
The tagger 1806 is an enhanced form of the chunker that reports the partitions instead of the cells in the partitions. When a phrase in a cell of a partition appears in the dictionary, and the dictionary entry has the usage of the phrase, the tagger prints the phrase with the usage for a tag. Otherwise, the tagger prints the phrase without a tag. The result is text tagged with the usage of the phrases.
The structurer 1808 uses the statistical modeler to determine how to divide the text into dimensions of the schema, without requiring a person to write grammar rules. The training data for the structurer's statistical model is a set of tagged texts with explicit "walls" between the dimensions of the schema. The structurer trains by using the N-Gram Analyzer 1800 to compute the conditional probabilities of the walls in the training data. The structurer 1808 operates by first tagging a body of text and then placing walls into the tagged text where they are most probable. Referring again to Fig. 17, the candidate heuristics are a series of knowledge bases, much like pre-defined templates that kick-start the normalization process. They are intended to address pieces of content that pervade user content. Items such as units of measure, power consumption, colors, capacities, etc. will be developed and semantic categories 1724 are developed.
The spell checker 1726 is a conventional module added to SOLx to increase the effectiveness of the normalization.
The Grammar & Rules Editor (GRE) 1728 is a text-editing environment that uses many Unix like tools for creation of rules and grammars for describing the content. It can always be used in a "fall-back" situation, but will rarely be necessary when the GUI 1716 is available.
The Taxonomy, Schemas, & Grammar Rules module 1730 is the output from either the GRE 1728 or the GUI 1716. It consists of a set of ASCII files that are the input into the natural language parsing engine (NLE) 1718.
On initialization, the NLE 1718 reads a set of grammar and normalization rules from the file system or some other persistent storage medium and compiles them into a set of Rule objects employed by the runtime tokenizer and parser and a set of NormRule objects employed by the normalizer. Once initialized the NLE 1718 will parse and normalize input text one line at a time or may instead process a text input file in batch mode, generating a text output file in the desired form. Configuration and initialization generally requires that a configuration file be specified. The configuration file enumerates the contents of the NLE knowledge base, providing a list of all files containing format, grammar, and normalization rules.
NLE 1718 works in three steps: tokenization, parsing, and normalization. First, the input text is tokenized into one or more candidate token sequences. Tokenization is based on what sequences of tokens may occur in any top-level phrase parsed by the grammar. Tokens must be delineated by white space unless one or more of such tokens are represented as regular expressions in the grammar, in which case the tokens may be contiguous, undelineated by white space. Tokenization may yield ambiguous results, i.e., identical strings that may be parsed by more than one grammar rule. The parser resolves such ambiguities.
The parser is a modified top-down chart parser. Standard chart parsers assume that the input text is already tokenized, scanning the string of tokens and classify each according to its part-of-speech or semantic category. This parser omits the scanning operation, replacing it with the prior tokenization step. Like other chart parsers, it recursively predicts those constituents and child constituents that may occur per the grammar rules and tries to match such constituents against tokens that have been extracted from the input string. Unlike the prototypical chart parser, it is unconstrained where phrases may begin and end, how often they may occur in an input string, or some of the input text might be unable to be parsed. It generates all possible parses that occur, starting at any arbitrary white space delineated point in the input text, and compares all possible parse sequences, selecting the best scoring alternative and generating a parse tree for each. If more than one parse sequence achieves the best score, both parse trees are extracted from the chart and retained. Others are ignored. Output of the chart parser and the scoring algorithm is the set of alternative high scoring parse trees. Each parse tree object includes methods for transforming itself according to a knowledge base of normalization rules. Each parse tree object may also emit a String corresponding to text contained by the parse tree or such a String together with a string tag. Most such transformation or emission methods traverse the parse tree in post-order, being applied to a parse tree's children first, then being applied to the tree itself. For example, a toString() method collects the results of toString() for each child and only then concatenates them, returning the parse tree's String representation. Thus, normalization and output is accomplished as a set of traversal methods inherent in each parse tree. Normalization includes parse tree transformation and traversal methods for replacing or reordering children (rewrite rules), for unconditional or lookup table based text replacement, for decimal punctuation changes, for joining constituents together with specified delimiters or without white space, and for changing tag labels. The Trial Parsed Content 1734 is a set of test samples of either tagged or untagged normalized content. This sample corresponds to a set of rules and grammars that have been parsed. Trial parsed content is the output of a statistical sample of the original input data. When a sequence of content samples parses to a constant level of unparsed input, then the set if grammars and rules are likely to be sufficiently complete that the entire data may be successfully parsed with a minimum of ambiguities and unparsed components. It is part of the interactive process to build grammars and rules for the normalization of content.
A complete tested grammar and rule set 1736 corresponding to the full unambiguous tagging of content is the goal of the normalization process. It insures that all ambiguous terms or phrases such as Mil that could be either a trade name abbreviation for Milwaukee or an abbreviation for Military have been defined in a larger context. This set 1736 is then given as input to the NLE Parsing Engine 1738 that computes the final normalized content, and is listed in the figure as Taxonomy Tagged Normalized Content 1732.
The custom translation dictionary 1740 is a collection of words and phrases that are first identified through the grammar rule creation process and passed to an external technical translator. This content is returned and is entered into one of the custom dictionaries associated with the machine translation process. There are standard formats that translators typically use for sending translated content.
The MTS 1742 may be any of various conventional machine translation products that given a set of custom dictionaries as well as its standard ones, a string of text in one language, produces a string of test in the desired language. Current languages supported by one such product marked under the name SYSTRAN include: French, Portuguese, English, German, Greek, Spanish, Italian, simplified Chinese, Japanese, and Korean. Output from the MTS is a Translated Content file 1744.
The one purpose of the Machine Translation Server 1742 is to translate structured texts, such as product descriptions. The state of the art in commercial machine translation is too weak for many practical applications. The MTS component 1742 increases the number of applications of machine translation by wrapping a standard machine translation product in a process that simplifies its task. The simplification that MTS provides comes from its ability to recognize the structure of texts to be translated. The MTS decomposes the text to be translated into its structural constituents, and then applies machine translation to the constituents, where the translation problem is simpler. This approach sacrifices the fidelity of references between constituents in order to translate the individual constituents correctly. For example, adjective inflections could disagree with the gender of their objects, if they occur in different constituents. The compromise results in adequate quality for many new applications in electronic commerce. Future releases of the software will address this issue, because the compromise is driven by expedience.
The conditioning component of MTS 1742 uses the NGA component to recognize the structure of each text to be translated. It prepares the texts for translation in a way that exploits the ability of the machine translation system to operate on batches of texts. For example, SYSTRAN can interpret lists of texts delimited by new-lines, given a parameter stating that the document it receives is a parts list. Within each line of text, SYSTRAN can often translate independently between commas, so the conditioning component inserts commas between dimensions of the schema if they are not already present. The conditioning component may completely withhold a dimension from machine translation, if it has a complete translation of that dimension in its dictionary. The machine translation component provides a consistent interface for a variety of machine translation software products, in order to allow coverage of language pairs.
The repair component is a simple automated text editor that removes unnecessary words, such as articles, from SYSTRAN'S Spanish translations of product descriptions. In general, this component will correct for small-scale stylistic variations among machine translation tools.
The Translation Quality Estimation Analyzer (TQA) 1746 merges the structural information from conditioning with the translations from repair, producing a list of translation pairs. If any phrases bypassed machine translation, this merging process gets their translations from the dictionary. After merging, translation quality estimation places each translation pair into one of three categories. The "good" category contains pairs whose source and target texts have acceptable grammar, and the content of the source and target texts agrees. A pair in the "bad" category has a source text with recognizable grammar, but its target grammar is unacceptable or the content of the source text disagrees with the content of the target text. The "ugly" category contains pairs whose source grammar is unfamiliar.
The feedback loop extracts linguistic knowledge from a person. The person examines the "bad" and "ugly" pairs and takes one of the following actions. The person may define words and terms in the dictionary, indicating their usage. The person may define grammar rules for the NLE component in order to tag some part of the text. The person may correct the translation pair
(if it requires correction), and place it into the set of examples for training the translation quality estimation models. The person may take the source text, mark it with walls between dimensions of the schema, and place it into the set of examples for training the structure model. An appropriate graphical user interface will make the first and last actions implicit in the third action, so a person will only have to decide whether to write grammars or to correct examples.
The translation quality estimation component uses two models from the N-Gram Analyzer that represent the grammar of the source and target texts. The translation quality estimation component also uses a content model that is partially statistical and partially the dictionary. The two parts overlap in their ability to represent the correspondence in content between source and target texts. The dictionary can represent exact correspondences between words and terms. The statistical model can recognize words that occur in one language, but are unnecessary in the other, and other inexact correspondences.
It is well known that the accuracy of machine translations based on standard glossaries are only sufficient to get the gist of the translation. There are no metrics associated with the level of accuracy of any particular translation. The TQA 1746 attempts to define a measure of accuracy for any single translation. The basis for the accuracy estimate is a statistical overlap between the translated content at the individual phrase level, and prior translations that have been manually evaluated.
The Normalized Content 1748 and/or Translated Content 1706 can next be cached in the Normalized Content Server and Localized Content Server (LCS) 1752, respectively. This cached data is made available through the SOLx Server 1708.
The LCS 1752 is a fast lookup translation cache. There are two parts to the LCS 1752: an API that is called by Java clients (such as a JSP server process) to retrieve translations, and an user interface 1754 that allows the user 1756 to manage and maintain translations in the LCS database 1752.
As well as being the translation memory foundation of the SOLx system 1700, the LCS 1752 is also intended to be used as a standalone product that can be integrated into legacy customer servers to provide translation lookups.
The LCS 1752 takes as input source language text, the source locale, and the target locale. The output from LCS 1752 is the target text, if available in the cache, which represents the translation from the source text and source locale, into the target locale. The LCS 1752 is loaded ahead of run-time with translations produced by the SOLx system 1700. The cache is stored in a relational database.
The SOLx Server 1708 provides the customer with a mechanism for run-time access to the previously cached, normalized and translated data. The SOLx Server 1708 also uses a pipeline processing mechanism that not only permits access to the cached data, but also allows true on-the-fly processing of previously unprocessed content. When the SOLx Server encounters content that has not been cached, it then performs the normalization and/or translation on the fly. The existing knowledge base of the content structure and vocabulary is used to do the on-the-fly processing.
Additionally, the NCS and LCS user interface1754 provides a way for SMEs 1756 to search and use normalized 1748 and translated 1706 data. The NCS and LCS data is tied back to the original ERP information via the customer's external key information, typically an item part number.
As shown in Figure 1700, the primary NorTran Workbench engines are also used in the SOLx Server 1708. These include: N-Gram Analyzer 1722, Machine Translation Server 1742, Natural Language Engine 1718, Candidate Search Engine 1720, and Translation Quality Analyzer 1746. The SOLx server 1708 also uses the grammar rules 1754 and custom and standard glossaries 1756 from the Workbench 1702. Integration of the SOLx server 1708 for managing communication between the source/legacy system 1712 and targets via the Web 1758 is managed by an integration server 1758 and a workflow control system 1760.
Fig. 21 is a flowchart illustrating a process 2100 for searching a database or network using normalization and classification as discussed above. The process 2100 is initiated by establishing (2102) a taxonomy and establishing (2104) normalization rules as discussed above. For example, the taxonomy may define a subject matter area in the case of a specialized search engine or a substantial portion of a language for a more generalized tool. Once the taxonomy and normalization rules have been initially established, a query is received (2106) and parsed (2108) into chunks. The chunks are then normalized (2110) and classified (2112) using the normalization rules and taxonomy. The classification information may be associated with the chunks via tags, e.g., XML tags.
At this point, the normalized chunks may be translated (2114 a-c) to facilitate multi-language searching. The process for translating is described in more detail below. One or more research engines are then used (2116 a-c) to perform term searches using the normalized chunks and the classification information. Preferably, documents that are searched have also been processed using compatible normalization rules and a corresponding taxonomy as discussed above such that responsive documents can be retrieved based on a term match and/or a tag match. However, the illustrated process 2100 may be advantageously used even in connection with searching unprocessed documents, e.g., by using the normalized chunks and/or terms associated with the classification to perform a conventional term search. The responsive documents may then be normalized and classified (2118 a-c) and translated (2120 a-c) as described in more detail below. Finally, the search results are compiled (2122) for presentation to the searcher. It will be appreciated that normalization and classification of the search query thus facilitates more structured searching of information in a database or network including in a multi-language environment. Normalization and classification also assist in translation by reducing the quantity of terms required to be translated and by using the classification structure to reduce ambiguities.
III. INFORMATION SHARING As will be appreciated from the discussion above, preparing the transformation system for a particular application involves significant effort by one or more human operators or SMEs. Such effort relates, inter alia, to mapping of source collection terms to a standardized terminology, associating terms with a classification system or taxonomy, e.g., as reflected in a tag structure, and establishing syntax rules. This is accomplished with the assistance of a tool denoted the Knowledge Builder tool below.
Even with the assistance of the Knowledge Builder tool, this preparation process can be time consuming and cumbersome. It is therefore desirable to allow for reuse of pre-existing information, for example, previously developed mapping rules for mapping source collection terms to standardized terminology or previously developed classification structure or taxonomy. Such sharing of information may be used to provide a head-start in connection with a new knowledge base creation project, to accommodate multiple users or SMEs working on the same subject area or domain (including at the same time) or in various other information sharing contexts. The invention is described below in connection with supporting multiple SMEs developing a SMM that involves working in the same domains or at least one common domain. While this example aptly illustrates the information sharing functionality, it will be appreciated that the invention is not limited to this context.
Two issues that are addressed by the Knowledge Builder tool in connection with sharing information are: 1) using or importing only selected information, as may be desired, rather than being limited to using or importing a full knowledge base; and 2) resolving potential conflicts or inconsistencies resulting from multiple users working in a single domain. By addressing these issues as discussed below, benefits of information sharing can be efficiently realized.
A. Domain Management
Figure 22 generally illustrates an architecture for an information sharing environment involving multiple SMEs. For purposes of illustration, this is shown as involving a server-client model involving server 2200 and clients 2202-2204. As will be described in more detail below, certain knowledge base development functionality including information sharing functionality is executed by a Knowledge Builder tool 2206. In the illustrated embodiment, the functionality of this tool is illustrated as being distributed over the server 2200 and client 2202-2204 platforms, however, it will be appreciated that other hardware implementations are possible.
The SMEs use graphical interfaces 2208 at the clients 2202-2204, in the illustrated embodiment, to access a project database 2210 and a developing knowledge base 2212, each of which is schematically illustrated, in this example, as residing at the server 2202. The project database 2210 may include, for example, the collection of source data that is to be transformed. The knowledge base 2212 includes classification or taxonomy structure, rules and the like, that have been developed by the SMEs or others. The illustrated clients 2202-2204 also include storage 2214, for storing rules and the like under development, or to temporarily store a version of the knowledge base or portions thereof, as will be described in more detail below. The Knowledge Builder tool includes a Domain Management module to address the issue of using or importing only selected information. The Domain Management module segments the various rules in the developing knowledge base into smaller, easily managed compartments. More specifically, the knowledge base may be graphically represented in the familiar form of files and folders.
This is illustrated in Fig. 23. In the illustrated example, a new knowledge base project is started with at least two domain folders as shown in panel 2300 of a graphical user interface. Specifically, the knowledge base includes a default domain folder and the common folder 2302. The default domain folder includes phrases and terms that have not been assigned to other domain folders. These phrases and terms appear in the knowledge base tree 2304 under the nodes labeled "Phrase Structure" 2306 and "Terminology" 2308 directly under the "Knowledge Base" node 2310. Initially, the common folder does not contain any phrases or terms.
The Knowledge Builder tool attempts to automatically place the rules into the appropriate domain folder when they are created. If a domain has not been specified, all created rules are placed in the phrase structure or terminology folders 2306 or 2308 under the knowledge base node 2310.
When a new domain is created, the Knowledge Builder tool continues to place rules in the phrase structure or terminology folders 2306 or 2308 until the user manually drags the rules into the new domain. Thereafter, when new rules are created, the Knowledge Builder tool analyzes the new rules to determine whether the new rules are related to the rules in the new folder. If so, the tool will automatically place the newly created rules in the same folder. Such analysis may involve consideration of the associated terminology or any identification of a classification or other taxonomical structure, for example, dependencies and references as described below.
Domains can be nested to any level. When a domain is created the Knowledge Builder tool automatically creates a common folder at the same level. Whenever a subdomain is created the system creates a sub common folder that is initially empty. If an additional subdomain is created and populated with rules, the Knowledge Builder tool will automatically move rules common to the two subdomains into the sub common. The tool moves rules into and out of common domains as additional rules are created and depending on where they are positioned within the domain hierarchy.
The user can also move phrase rules from one domain to another. As phrases are moved into a domain, related phrases and terminal rules are also moved either into the same domain or into the appropriate common domain. For improved efficiency, top-level phrased rules can be moved thereby implicitly dragging related phrase and terminal rules into a domain.
A user can also move domain folders into other domains. When a domain folder is moved, all of the associated rules are also moved. This can also create additional common folders. As noted above, information sharing can facilitate creation of new knowledge bases. In this regard, when a new project is created, the user can select a single domain from an existing project to import into the new project. Multiple domains can be imported in this manner with any resulting inconsistencies addressed as discussed below.
1. DOMAIN CREATION
A fundamental step in the process of knowledge base development is domain creation. Domain creation can be accomplished using the Knowledge
Builder tool. In this regard, Fig. 24 illustrates a graphical user interface 2400 that may be displayed upon launching the Knowledge Builder tool. The graphical user interface 2400 generally includes a knowledge base structure or classification panel 2402, a source collection or project panel 2404, and a taxonomy or parse tree panel 2406. The interoperation of these panels is described below.
To create a domain, the user can right-click on the knowledge base node 2408 of the knowledge base panel 2402. This causes a pop-up window 2500 to be displayed as shown in Fig. 25. From the pop-up window 2500, the user selects the create subdomain entry 2502. The user is then prompted to name this first domain as shown in Fig. 26. In the illustrated example, the new domain is named "passives." As shown in Fig. 27, the knowledge base panel 2700 is then updated to include a folder icon 2702 for the "passives" domain.
2. DOMAIN EDITING
A variety of domain editing functions are supported by the Knowledge Builder tool including functionality for moving rules, renaming domains and deleting domains. In the example discussed above, rules may be moved into the passives domain after that domain is established. For example, rules may be dragged from their current location in the knowledge base panel to the desired domain folder. Alternatively, rules can be dragged into domain folders using a move rules dialog. To open the move rules dialog, the edit/move rules menu (not shown) is selected and the rule is dragged from the knowledge base tree onto the desire domain in the resulting dialog. The advantage of using the move rules dialog is minimizing scrolling through the knowledge base tree.
Domains may be renamed by selecting the appropriate domain, right- clicking and selecting the rename domain menu item. A domain name dialog is then opened as shown above and can be used to enter the new name. Domains may be deleted by selecting the appropriate domain, right- clicking and selecting the delete domain menu item. It should be noted that the associated rules are not deleted. They move to the next level in the knowledge base tree. This may involve moving rules to other domains, other common folders or root folders. In the illustrated implementation, it is not possible to simultaneously delete domains and associated rules by deleting only the domain (though such functionality could optionally be supported). Individual rules are deleted either before or after the domain itself is deleted.
3. DOMAIN REUSE
As noted above, one of the advantages of domains is that they may be imported into a new project without importing the entire prior project. This allows for more efficient reuse of knowledge previously created. To import a domain from an existing project, the file/input domain's menu item is selected. This opens an import domain's dialog box 2800 as shown in Fig. 28. A pull¬ down menu 2802 can then be utilized to select the project from which the user wishes to import a domain. Panel 2804 of the dialog box 2800 displays the knowledge base tree from the selected project. The desired domain can then be dragged to the target position and the knowledge base tree of knowledge base panel 2900 as shown in Fig. 29.
B. Multi-User Functionality
The discussion above described how domains can be created, populated and edited. These same processes may be used by multiple SMEs to jointly develop a knowledge base. For example, the developing database may be accessed on a server by multiple SMEs at different workstations via a LAN, WAN or the like. Each of the SMEs may import particular domains on which to work. In order to accommodate such multi-user development, it is useful to provide a mechanism for resolving conflicts or ambiguities. Such conflicts or ambiguities may result from inconsistent mapping of terms, inconsistent rule definitions or the like, which may be identified based on dependency and reference relationships as described below.
There are a number of ways that such conflicts and ambiguities can be avoided or resolved. For example, when one SME selects a domain for editing or extension, other SMEs may be locked out of that domain so as to substantially avoid conflicts and inconsistencies. Such an implementation may be practical in the context of the present invention because the knowledge base is divided into multiple domains, thus allowing for concurrent access to selected portions of the knowledge base. However, it is often desirable to allow multiple SMEs to concurrently work on the same domain, e.g., to more rapidly process a large volume of data. Many architectures are possible for resolving conflicts or ambiguities in the case of multiple SMEs working on a single domain. For example, one definitive version of the domain may be retained, for example, at the server. Each SME may then "check-out" a version of the domain for revision and extension. When a version of the domain is checked back-in, the revisions and extensions may be analyzed relative to pre-defined rules. Thus, the rules may cause the Knowledge Builder tool to accept revisions and extensions that do not result in conflicts relative to the definitive version and reject all other revisions or extensions. Alternatively, revisions and extensions that result in conflicts or inconsistencies may be identified so that they can be resolved by an authorized SME, e.g., by selecting one of the conflicting rules and editing the other to be replaced by or consistent therewith. Similarly, upon importing a domain, all conflicts or inconsistencies may be listed or highlighted for arbitration.
Alternatively, one of the SMEs may be designated as dominant with respect to a particular project, such that his revisions and extensions are accepted as definitive. Revisions and extensions by other, subservient SMEs would then be rejected or harmonized with the knowledge base of the dominant SME by arbitration rules as discussed above. Further, rather than checking-out and checking back-in domain versions as discussed above, arbitration can be executed in real time as knowledge base development is occurring. For example, if an SME proposes that the term "mil" be rewritten as "milliliter" and a rule already exists (in the same domain or anywhere within the knowledge base, depending on the specific implementation) that requires "mil" to be rewritten as "Milwaukee," the SME may be immediately notified by way of an error message upon entry of the proposed rule.
Regardless of the specific architecture employed, the Knowledge Builder tool executes logic for identifying or preventing conflicts and inconsistencies. This may be based on dependencies and references. A rule dependency is a relationship between two rules. The dependency is a second rule that must be defined in order for the first rule to be valid. Only phrase structure rules have dependencies. A first structure rule's dependency set is that set of rules that appear as constituents in its productions. Those dependencies are apparent by inspecting a parse tree of the knowledge base panel. Thus, the rule corresponding to a parent node in a parse tree is said to depend on any rule corresponding to child nodes. In the example of Fig. 30, [attrjesistance] has dependencies on [number] and [ohm], and [number] has at least a dependency on [period] that is apparent in this particular parse tree. Other parse trees may reveal other [number] dependencies, e.g., [integer] and [fraction].
It will be appreciated that one may not be able to see all dependencies in a single parse tree. A phrase structure rule's productions define all possible dependencies. Thus, one can manually edit a rule to view all dependencies. In the example of Fig. 31 , [sae_thread_size] has dependencies on
[screw_dimension], [thread.dia], [real], [separator.-], [separator_pound] and
[separator_colon]. References are the inverse of dependencies. A reference is one of possibly several rules that depends on the current rule. In the example above, rules [screw_dimension], [thread_dia], [real], [separator.-], [separator pound] and [separator colon] are each referenced by [sae_thread_size], although each may be referenced by other rules, too. One would have to inspect the entire grammer to determine all references, so the Knowledge Builder tool provides a utility to get a list of references for any rule. The utility is accessed by right-clicking on any rule in the knowledge tree. A menu is then displayed that includes the entry "get references." By selecting the "get references" item, a display is provided as shown in Fig. 32.
In general, terminology rules do not have dependencies although any one terminology rule may have many references. The rules that govern legal dependencies among the domains are aimed at keeping knowledge contained in each domain as self sufficient as possible. An exception is that knowledge that is shared among multiple domains is stored in a Common domain. There may be several Common domains in a single project. A Common domain is automatically added to a domain when a child domain is created there.
A grammar rule that resides in some domain may have dependencies on any of the following objects: any rule that resides in the same domain; any rule that resides in a child or descendant domain; and any rule that resides in a Common domain that is a child of a parent or ancestor domain. Thus, the scope of a domain is limited to that domain, any child or descendant domain, and any Common domain that is the child of a parent or ancestor domain.
Referring to Fig. 33, any rule in FactoryEquipment_and_supplies may have dependencies on all rules in adhesives_and_sealants, chemicals, Common, engines_and_motors, and any other of its subdomains, because they are all children and in the top level Common. On the other hand, a rule in FactoryEquipment_and_supplies may not have dependencies on any rules in ComputerEquipment_and_supplies hardware, or other siblings. Nor may it reference rules at the root level in the "phrase structure" and "terminology" folders immediately under the "knowledge base" node. Likewise, "chemicals" may not have dependencies on "tools," "hardware," or the root domain, but rules in "chemicals" may reference the factory equipment Common and the root Common.
Thus, the assignment of a rule to a domain and other operations on domains are governed by dependencies among rules. Whether an operation is legal is governed by such dependencies and automatic assignments are governed by such dependencies. Dependencies among rules are used to govern domain operations in order to preserve a grammar consistency and to enable the use of domains as containers for moving knowledge from one project to another. Any branch in a domain hierarchy, if copied to a new project, should act as consistent and correct grammar. As a result, consistency is constantly maintained by inspecting dependencies among rules when rules or domains are moved or when new productions and new rules are introduced or when existing rules are edited manually.
This is illustrated by the example of Fig. 34. Fig. 34 generally illustrates a grammar. If a user creates a new domain "PassiveElectronics" and new subdomains "resistors" and "capacitors" under it, a new Common will automatically be inserted under "PassiveElectronics."
In the illustrated example, the rule [resistor] has no dependencies. If one drags it to "resistors," no other rules will move there. However, if the user was to drag [productjesistor] to "resistors," more than half the rules in the grammar will be automatically moved there, including [resjype], [variable],
[carbonjilm], [resistance], [number], and [ohms] and any other direct or indirect dependency.
Now, if the user moves [capacitor] to "capacitors," only the one rule moves. If, instead, the rule [product_capacitor] is moved, all the remaining rules move to "capacitors" too. However, several other rules are moved to
PassiveElectronics: Common. Those are the rules that are referenced, either directly or indirectly by rules in resistors_and_capacitors, including [variable],
[number], [tolerance], and [percent], all which have first been moved to
"resistors." Now consider the grammar rules illustrated in Fig. 35. The user may insert a new domain at the root called "hardware" and then drag [screw variety] into the new domain. All of the new rules are automatically assigned to "hardware." But [number], previously assigned to PassiveElectronics:
Common, is now moved instead to Common under the root. This is because the root is the only ancestor common to "hardware," "resistors" and
"capacitors," and [number], which has references in all three domains and may only be assigned to the Common under the root.
Thus, the Knowledge Builder tool uses dependencies and references for a variety of purposes including governing whether an operation is legal. Fig. 36 is a flow chart illustrating a process 3600 for augmenting a grammar from component domains. The process 3600 is initiated by opening
3602 the current project and knowledge base. The current project includes a source listing that is to be transformed. The user then tests 3604 the current project using the current knowledge base and saves the result for regression testing. Next, the user augments 3606 the knowledge base from a grammar in an external project, as described above. In this regard, the user may drag 3608 one or more domains from the external project into the current project. The Knowledge Builder tool then checks for inconsistencies and conflicts, for example, based on the dependency and reference listings. Each inconsistency and conflict is identified to the user who responds to each, for example, by harmonizing such inconsistencies and conflicts. The user can then retest 3610 the project using the modified knowledge base and run a further regression test with the previously saved data. The user then judges 3612 the result to determine whether the original knowledge base or the modified knowledge base is more effective. The Knowledge Builder tool may also analyze the regression test to identify the sources of progressions and regressions so as to facilitate trouble shooting. The present invention thus allows for knowledge sharing for a variety of purposes including facilitating the processing of developing a new knowledge base and allowing multiple users to work on a single project, including simultaneous development involving a common domain. In this manner, the process for developing a knowledge base is significantly streamlined.
IV. Frame-Slot Architecture
As noted above, the present invention generally relates to converting data from a first or source form to a second or target form. Such conversions may be desired in a variety of contexts relating, for example, to importing data into or otherwise populating an information system, processing a search query, exchanging information between information systems and translation. In this section, the invention is set forth in the context of particular examples relating to processing a source stream including a product oriented attribute phrase. Such streams may include information identifying a product or product type together with a specification of one or more attributes and associated attribute values. For example, the source stream (e.g., a search query or product descriptor from a legacy information system) may include the content "8 oz. ceramic coffee cup." In this case, the product may be defined by the phrase "coffee cup" and the implicit attributes of size and material have attribute values of "8 oz." and "ceramic" respectively.
While such source streams including product oriented attribute phrases provide a useful mechanism for illustrating various aspects of the invention, and in fact represent significant commercial implementations of the invention, it should be appreciated that the invention is not limited to such environments. Indeed, it is believed that the invention is applicable to virtually any other conversion environment with concepts such as product attributes and attribute values replaced, as necessary, by logical constructs appropriate to the subject environment, e.g., part of speech and form. Moreover, as noted above, the conversion rules are not limited to elements of a single attribute phrase or analog, but may involve relationships between objects, including objects set forth in separate phrases. Accordingly, the specific examples below should be understood as exemplifying the invention and not by way of limitation. In a preferred implementation of the invention, at least some conversions are executed with the assistance of a frame-slot architecture. Such a frame-slot architecture may function independently to define a full conversion model for a given conversion application, or may function in conjunction with one or more parse tree structures to define a conversion model. In the latter regard, the frame-slot architecture and parse tree structures may overlap with respect to subject matter.
The above-noted coffee cup example is illustrative in this regard. It may be desired to correlate the source string "8 oz. ceramic coffee cup" to a product database, electronic catalogue, web-based product information or other product listing. Such a product listing may include a variety of product types, each of which may have associated attributes and grammar rules. In this regard, the product types and attributes may be organized by one or more parse-tree structures. These parse tree structures, which are described and shown in U.S. Patent Application Serial Number. 10/970,372, generally organize a given subject matter into a hierarchy of classes, subclasses, etc., down to the desired level of granularity, and are useful for improving conversion accuracy and improving efficiency in building a grammar among other things. In this case, "coffee cup" may fall under a parse tree node "cups" which, in turn falls under a parent node "containers" which falls under "housewares", etc. Similarly, the same or another parse tree may group the term "oz.", or a standardized expression thereof (e.g., defined by a grammar) such as "ounce" under the node "fluid measurements" (ounce may also appear under a heading such as "weights" with appropriate grammar rules for disambiguation) which, in turn, may fall under the parent node "measurements", etc.
As noted above, such a parse tree structure has certain efficiencies in connection with conversion processes. However, in some cases, very deep parses may be required, e.g., in connection with processing terms associated with large data systems. Moreover, such terms are often processed as individual fields of data rather than closer to the whole record level, thereby potentially losing contextual cues that enhance conversion accuracy and missing opportunities to quickly identify content anomalies or implement private schema to define legal attributes or values for a given information object. Finally, such parse tree processes may impose a rigid structure that limits applicability to a specific subject matter context, thereby limiting reuse of grammar segments.
By contrast, a frame-slot architecture allows for consideration of source stream information at, or closer to, the whole record level. This enables substantial unification of ontology and syntax, e.g., collective consideration of attribute phrases, recognized by the grammar and attribute values contained therein. Moreover, this architecture allows for consideration of contextual cues, within or outside of the content to be converted or other external constraints or other external information. In the coffee cup example, the frame-slot architecture allows for consideration of the source stream "8 oz. coffee cup" in its entirety. In this regard, this stream may be recognized as an attribute phrase, having "coffee cup" as an object. Grammar rules specific to this object or a class including this object or rules of a public schema may allow for recognition that "oz." means "ounce" and "ounce" in this context is a fluid measure, not a weight measure. A user-defined schema, for example, a private schema of the source or target information owner, may limit legal quantity values associated with "ounce" in the context of coffee cups to, for example, "6", "8" and "16". In this case, recognition of "8" by the schema provides increased confidence concerning the conversion. If the value had been "12", which would not comply with the schema in this example, this might serve, for example to quickly identify an anomaly (e.g., in the case of mapping records from a legacy data system to a target system) or identify an imperfect match (e.g., in the case of a search query) so that appropriate action may be taken.
The frame-slot architecture thus encompasses a utility for recognizing stream segments, obtaining contextual cues from within or external to the stream, accessing grammar rules specific to the subject matter of the stream segment and converting the stream segment. This may avoid deep parses and allow for greater conversion confidence and accuracy. Moreover, greater grammar flexibility is enabled, thus allowing for a higher degree of potential reuse in other conversion contexts. In addition, executing such processes by reference to a schema enables improved context-related analysis. In short, conversions benefit from surrounding and external context cues in a manner analogous to human processing.
As noted above, the frame-slot architecture may be developed in a top- down or bottom-up fashion. For example, objects, associated attributes and legal attribute values may be defined as schema that are imposed on the data. In the coffee cup example, all of these may be defined based on an analysis of a product inventory or the structure of a legacy information system. In either case, the schema may dictate the legal values for quantity to 6, 8 and 16. Any information not conforming to the schema would then be identified and processed as an anomaly. Conversely, the legal values may be defined based on the data. For example, files from a legacy information system may be used to define the legal attribute values which, then, develop as a function of the input information.
Figure 45 illustrates a system 4500 for implementing such conversion processing. The illustrated system 4500 includes a conversion engine 4502 that is operative to execute various grammar rules and conversion rules for converting source information to a target form. In the illustrated embodiment, the system 4500 is operative to execute both frame-slot architecture methodology and parse tree structure methodology. However, it will be appreciated that a frame-slot architecture may be executed in accordance with the present invention in the absence of a cooperating parse tree environment. The illustrated grammar engine receives inputs and/or provides outputs via a workstation associated with the user interface 4504. For example, in a set-up mode, a user may select terms for processing and create associated relationships and grammar rules via the user interface 4504. In the context of a search system, a search query may be entered, and search results may be received, via the user interface 4504. In this regard, the grammar engine 4502 may be resident at the work station associated with the user interface 4504, or may communicate with such a work station via a local or wide area network.
The source content 4506 includes the source string to be converted. Depending on the specific application, this content 4506 may come from any of a variety of sources. Thus, in the case of an application involving transferring information from one or more legacy information systems into a target information system, the source content 4506 may be accessed from the legacy systems. In the case of a search engine application, the source content may be derived from a query. In other cases, the source content 4506 may be obtained from a text to be translated or otherwise converted. The source content 4506 may be preprocessed to facilitate conversion or may be in raw form. In the case of preprocessing, the raw content may be supplemented, for example, with markers to indicate phrase boundaries, tags to indicate context information, or other matter. Such matter may be provided in a set-up mode process. In addition, some such information may be present in a legacy system and may be used by the conversion engine 4502. It will be appreciated that the sources of the content 4506 and the nature thereof is substantially unlimited. The illustrated conversion engine 4502 performs a number of functions.
In this regard, the engine 4502 is operative to process the source content 4506 to parse the content into potential objects and attributes, identify the associated attribute values, and, in some cases, recognize contextual cues and other matter additional to the content to be transformed that may be present in the source content. The engine 4502 then operates to convert the relevant portion of the source content 4506 using a parse tree structure 4510 and/or a frame-slot architecture 4511, and provides a converted output, e.g., to a user or target system.
With regard to the parse tree structure 4500, such a structure is generally developed using the conversion engine 4502 in a set-up mode. The nodes of the parse tree structure 4510 may be defined by someone familiar with the subject matter under consideration or based on an analysis of a data set. Moreover, certain structure developed in connection with prior conversion applications may be imported to facilitate the set-up process. Such a set-up process is described in U.S. Patent Application Serial Number 10/970,372, which is incorporated herein by reference. At a high level, this set-up involves defining the hierarchical structure of the tree, populating the various nodes of the tree, developing standardized terminology and syntax and associated grammar and conversion rules associated with the tree and mapping source content variants to the standardized terminology and syntax. In the case of the frame-slot architecture 4511 , the conversion engine 4502 obtains the source content 4502 and identifies potential objects, attributes and attribute values therein. In this regard, the source content 4506 may be parsed as discussed above. In addition, the engine 4502 may obtain contextual cues 4512 to assist in the conversion. As noted above, such cues may be internal or external to the source content 4506. External cues may be based on the identity or structure of a source information system, defined by a schema specific to the frame-slot conversion, or based on information regarding the subject matter under consideration obtained from any external source. For example, information indicating that, when used in connection with "coffee cup" the term "ounce" is a fluid (not a weight) measure, may be encoded into metadata of a legacy information system, defined by a private schema developed for the subject conversion application or derived from an analysis of external information sources.
In the context of the frame-slot architecture, the conversion engine is operative to: identify potential objects, attributes and attribute values; process such information in relation to certain stored information concerning the objects, attributes and attribute values; access associated grammar and conversion rules; and convert the information from the source form to a target form. In this regard, the illustrated system 4500 includes stored object information 4514, stored attribute information 4516 and stored attribute value information 4518. This information may be defined by a public or private schema or by reference to external information regarding the subject matter under consideration. For example, the object information 4514 may include a list of recognized objects for which the frame-slot architecture is applicable together with information associating the object with legal attributes and/or attribute values and other conversion rules associated with that object. The attribute information 4516 may include a definition of legal attributes for the object together with information regarding associated attribute values and associated grammar and conversion rules. Finally, the attribute value information 4518 may include a definition of legal attribute values for given attributes together with associated information concerning grammar and conversion rules.
Fig. 46 shows a flow chart illustrating a process 4600 that may be implemented by a conversion system such as described above. It will be appreciated that the various process steps illustrated in Fig. 41 may be combined or modified as to sequence or otherwise. Moreover, the illustrated process 4600 relates to a system that executes a parse tree structure as well as a frame-slot architecture. It will be appreciated that a frame-slot architecture in accordance with the present invention may be implemented independent of any associated parse tree structure.
The illustrated process 4600 is initiated by receiving (4602) a data stream from a data source. Such a data stream may be entered by a user or accessed from a legacy or other information system. A segment of the data stream is then identified (4604) for conversion. For example, the segment may comprise an attribute phrase or any other chunk of source data that may be usefully processed in a collective form. Such a segment may be identified as the- entirety of an input such as a search query, the entirety or a portion of a file from a legacy or other information system, or based on a prior processing step whereby phrase boundaries have been marked for purposes of conversion processing or based on logic for recognizing attribute phrases or other chunks to be coprocessed.
In the illustrated process 4600 the identified segment is then processed to identify (4606) a potential object within the segment. In the case of the coffee cup example above, the object may be identified as the term "cup" or "coffee cup." The potential object may be identified by comparison of individual terms to a collection of recognized objects or based on a preprocessing step wherein metadata has been associated with the source content to identify components thereof including objects. The potential object is then compared (4608) to a known object list of a frame-slot architecture. As discussed above, within a given subject matter, there may be a defined subset for which frame-slot processing is possible. In the illustrated process 4600, if a match (4610) is identified, the system then accesses (4614) an associated grammar and schema for processing in accordance with the frame-slot architecture. Otherwise, the segment is processed (4612) using a parse tree structure. As a further alternative, if no object is recognized, an error message may be generated or the segment may be highlighted for set¬ up processing for out of vocabulary terms, e.g., so as to expand the vocabulary and associated grammar rules.
In the case of processing using the frame-slot architecture, an attribute associated with the object is then identified (4616). In the coffee cup example, the terms "ceramic" or "8 oz." may be identified as reflecting attributes. Such identification may be accomplished based on grammar rules or based on metadata associated with such terms by which such terms are associated with particular attribute fields. The associated attribute values are then compared (4618) to legal values. For example, the value of "8 oz." may be compared to a listing of legal values for the attribute "fluid measurement" in the context of "coffee cup." These legal values may be defined by a private schema, for example, limited to the inventory of an entity's product catalog or may be based on other external information (e.g., defining a legal word form based on part of speech). If a match is found (4620) then the attribute phrase is recognized and an appropriate conversion process if executed (4623) in accordance with the associated grammar and conversion rules. The process 4600 then determines whether additional stream information (4624) is available for processing and either processes such additional information or terminates execution. In the case where the attribute value does not match a legal value, anomaly processing is executed (4622). How anomalies are processed generally depends on the application and context. For example, if an anomaly is identified during a set-up process, the anomalous attribute value may be verified and added to the legal values listing. For example, in the coffee cup example, if the attribute value is "12 oz." and that value does not match a previously defined legal value but, in fact, represents a valid inventory entry, the term "12 oz." (or a standardized version thereof) may be added to the legal values list for the attribute "fluid measurement" in the context of "coffee cup." Alternatively, further processing may indicate that the attribute value is incorrect. For example, if the attribute value was "6 pack," an error in parsing may be indicated. In this case, an appropriate error message may be generated or the segment may be reprocessed to associate an alternate attribute type, e.g., "object quantity," with the term under consideration. In other contexts, different anomaly processing may be executed. For example, in the case of processing a search query, illegal values may be ignored or closest match algorithms may be executed. Thus, in the case of a query directed to a "12 oz. coffee cup," search results may be generated or a link may be executed relative to inventory related to coffee cups in general or to 8 and 16 oz. coffee cups. It will be appreciated that many other types of anomaly processing are possible in accordance with the present invention. In the above examples, the conversion system can implement both a frame-slot architecture and a parse tree structure. This architecture and structure will now be described in more detail. Referring first to Fig. 48, a schematic diagram of a conversion system 4800 in accordance with the present invention is shown. The illustrated conversion system 4800 includes a parser 4802 for use in parsing and converting an input stream 4803 from a source 4804 to provide an output stream 4811 in a form for use by a target system 4812. In this case, the source stream 4803 includes the content "flat bar (1mm x 1" x 1')." To accomplish the desired conversion, the parser 4802 uses information from a public schema 4806, a private schema 4808 and a grammar 4810. The public schema 4806 may include any of various types of information that is generally applicable to the subject matter and is not specific to any entity or group of entities. In this regard, Fig. 49 illustrates an example structure 4900 showing how public information related to the subject matter area may be used to define a conversion rule. As shown, a new structure 4900 includes a dictionary 4904 that forms a portion of the public schema 4902. Panel 4906 shows definitions related to the object "flat bar."
Specifically, "bar" is defined as "straight piece that is longer than it is wide" and "flat" is defined as including "major surfaces distinctly greater than minor surfaces." Such definitions may be obtained from, for example, a general purpose dictionary, a dictionary specific to the subject matter, a subject matter expert or any other suitable source. These definitions are translated to define a rule as shown in panel 4908. Specifically, the associated rule indicates that "length is greater than width and width is greater than thickness." This rule may then be written into the logic of a machine-based conversion tool. Referring again to Fig. 48, this rule is reflected in file 4807 of public schema 4806.
The parser 4802 also receives input information from private schema 4808 in the illustrated example. The private schema 4808 may include conversion rules that are specific to an entity or group of entities less than the public as a whole. For example, the private schema 4808 may define legal values for a given attribute based on a catalog or inventory of an interested entity such as an entity associated with the target system 4812. An associated user interface 5000 is shown in Fig. 5OA. For example, the user interface 5000 may be used in a start-up mode to populate the legal values for a given attribute. In this case, the user interface is associated with a particular project 5002 such as assembling an electronic catalog. The illustrated user interface 5000 includes a data structure panel 5004, in this case reflecting a parse-tree structure and a frame-slot structure. The interface 5000 further includes a private schema panel 5005. In this case, the private schema panel 5005 includes a number of windows 5006 and 5008 that define a product inventory of a target company. In this case, a length field 5010 associated with a table for #6 machine screws is used to define legal attribute value 5012 at a node of panel 5004 corresponding to attribute values for #6 machine screws. Associated legal value information is shown as a file 4809 of the private schema 4808 in Fig. 48.
A further example of user interface segments 5020 is shown in Fig. 5OB Specifically, Fig. 5OB shows a parse tree graphics panel 5022 and a parse tree node map panel 5024. For purposes of illustration, these panes 5022 and 5024 are shown in a stacked arrangement. However, it should be appreciated that the panels 5022 and 5024 may be otherwise arranged on a user interface screen or provided on separate screens. Panel 5022 shows a parse tree for a particular product descriptor. In this case, the product descriptor is shown at the base level 5026 of the parse tree as "ruler 12" 1/16" divisions." Layers 5028-5030 show patent nodes of the parse tree. Of particular interest, both of the chunks "12"" and "1/16"" are associated with the high level node "[length_unit]" reflecting the recognition by a parse tool that each of these chunks indicates a measure of length.
If the parse tree structure went no deeper, and there was not frame-slot logic available, these two length measures would present an ambiguity.
However, human reader would readily recognize that, in the context of rulers, "12"" likely represents the overall length of the ruler and "1/16"" most likely represents measurement increments. In the case of a frame-slot architecture, such logic can be captured by a rule that enables the parse tool to recognize and apply such context cues to provide accurate interpretations without deep parses.
In this case, such a rule is reflected within the parse tree node map of panel 5024. Specifically, a rule for interpreting "length unit" designations in the context of rulers (and, perhaps, other length measuring devices) is encoded under the "ruler" node. As shown, the rule interprets a given "length unit" as indicating "a measuring length" if the associated attribute value is greater than 1 unit of measure (uom) and treats the "length unit" as indicating an "increment" if the associated attribute value is less than 0.25 uom. This provides a certain and structurally efficient mechanism for disambiguating and converting length units in this context. Moreover, it is anticipated that such rules will be reuseable in other contexts within a project (e.g., for tape measures or straight edges) and in other projects. Grammar 4810 also provides information to the parser 4802. The grammar may provide any of various information defining a lexicon, syntax and an ontology for the conversion process. In this regard, the grammar may involve definition of standardized terminology as described above. Thus, in the illustrated example, file 4813 associates the standardized terms "inch," "foot," and "millimeter" with various alternate forms thereof.
The parser 4802 can then use the input from the public schema 4806, private schema 4808 and grammar 4810 to interpret the input stream 4803 to provide an output stream 4811 to the target 4812. In this case, the noted input stream 4803 is interpreted as "flat bar-1 ' long, 1" wide and 1 mm thick. Referring to Fig. 47, a further example related to a frame-slot architecture 4700 is illustrated. The architecture 4700 is used to process a source stream 4702, in this case, "bearings for transmission~100milli. bore." For example, this source stream may be a record from a legacy information system or a search query. As discussed above, the processing of this source stream 4702 may utilize various contextual cues. As will be discussed in more detail below, such contextual cues may be derived from the content of the source stream 4702 itself. However, it is also noted that certain metadata cues 4704 may be included in connection with the source stream 4702. In this regard, it is noted that legacy information systems such as databases may include a significant amount of structure that can be leveraged in accordance with the present invention. Such structure may be provided in the form of links of relational databases or similar tags or hooks that define data relationships. Such contextual information, which can vary substantially in form, is generally referred to herein as metadata. The frame-slot architecture 4700 is utilized to identify an object 4706 from the source stream 4702. As noted above, this may involve identifying a term within the stream 4702 and comparing the term to a list of recognized objects or otherwise using logic to associate an input term with a recognized object. It will be noted in this regard that some degree of standardization or conversion which may involve the use contextual information may be performed in this regard. Thus, in the illustrated example, the identified object "roller bearing" does not literally correspond to any particular segment of the stream 4702. Rather, the object "roller bearing" is recognized from the term "bearing" from the stream 4702 together with contextual cues provided by the term "transmission" included within the content of the stream 4702 and, perhaps, from metadata cues 4704. Other sources including external sources of information regarding bearings may be utilized in this regard by logic for matching the stream 4702 to the object 4706.
Based on the object 4706, information regarding attributes 4708 and attribute values 4714 may be accessed. As discussed above, such information may be derived from public and private schema. For example, an attribute type 4710 may be identified for the object 4706 and corresponding legal attribute values 4712 may be determined. In this case, one attribute associated with the object "roller bearing" is "type" that has legal values of "cylindrical, tapered and spherical." The stream 4702 may be processed using this information to determine a refined object 4716. In this case, the refined object is determined to be "cylindrical roller bearing." Again, it will be noted that this refined object 4716 is not literally derived from the stream 4702 but rather, in the illustrated example, is determined based on certain contextual information and certain conversion processes. Thus, the stream 4702 is determined to match the attribute value "cylindrical" based on contextual information related to the terms "transmission" and "bore" included within the content of the source stream 4702. Information regarding the attributes 4708 and attribute values 4714 may again be accessed based on this refined object 4716 to obtain further attributes 4718 and associated attribute values 4720. It should be noted that these attributes and attribute values 4718 and 4720, though illustrated as being dependent on the attribute 4710 and attribute value 4712 may alternatively be independent attributes and attribute values associated with the object 4706. However, in the illustrated example, the attribute "size parameter" is associated with the legal values "inside diameter" and "outside diameter" based on the refined object "cylindrical roller bearings."
In this case, the attribute 4718 and attribute value 4720 are used together with certain contextual cues to define a further refined object 4722. In this case, the further refined object 4722 is defined as "cylindrical roller bearing inside diameter." A selection between the legal value "inside diameter" and "outside diameter" is made based on contextual information provided by the term "bore" included within the content of the stream 4702. Based on this further refined object 4722, information regarding the attributes 4708 and attribute values 4714 can be used to identify a further attribute 4724 and associated legal values 4725. In this case, the attribute 4724 is "legal dimensions" and associated legal values 4725 are defined as "50, 60, 70, 80, 90, 100, 150 . . . 500." These values are assumed for the purposes of this example to be given in millimeters. In this case, the input stream 4702 is processed in view of the attribute 4724 and legal values 4725 to define an output 4726 identified as "100mm ID cylindrical roller bearings." In this regard, the stream term "100 milli." is found to match the legal value of "100" for the attribute "legal dimensions" in the context of cylindrical roller bearings inside diameter. It will be appreciated that the term "milli." has thus been matched, based on a standardization or conversion process, to the designation "mm." It should be noted in this regard that success in matching the source term "100 milli." to the legal value "100mm" provides further confidence was correctly and accurately performed.
Various types of outputs reflecting various conversion applications may be provided in this regard. Thus, in the case of converting an input file from a legacy database to an output form of a target information system, the input stream 4702 may be rewritten as "100 mm ID cylindrical roller bearing." In the case where the source stream 4702 represents a search query, the output may be provided by way of linking the user to an appropriate web page or including associated information in a search results page. It will be appreciated that other types of output may be provided in other conversion environments. While various embodiments of the present invention have been described in detail, it is apparent that further modifications and adaptations of the invention will occur to those skilled in the art. However, it is to be expressly understood that such modifications and adaptations are within the spirit and scope of the present invention.

Claims

What is claimed:
1. A method for use in converting data from a source form to a target form, where said target form differs from said source form with respect to one of linguistics and syntax, said method comprising the steps of: providing logic for converting data from a first form to a second form, said logic being configurable to associate first elements of said first form with second elements of said second form so as to establish a conversion model; first operating said logic to establish a first association of a first set of said first elements with a second set of said second elements; second operating said logic to establish a second association of a third set of said first elements with a fourth set of said second elements, wherein said first set and said third set include at least a common one of said first elements thereby defining an overlap; and third operating said logic to process one of said first association and said second association so as to address any inconsistencies associated with said overlap.
2. A method as set forth in Claim 1 , wherein said first form is one of said source form and said target form.
3. A method as set forth in Claim 1 , wherein said second form is one of said source form and said target form.
4. A method as set forth in Claim 1 , wherein said second form is an intermediate form, different than both said source form and said target form.
5. A method as set forth in Claim 1 , wherein said second form comprises a semantic metadata model that provides a standardized basis for data conversion.
6. A method as set forth in Claim 5, wherein said semantic metadata model defines a set of standardized terms where each standardized term corresponds to one or more of said first elements and one or more of said second elements.
7. A method as set forth in Claim 6, wherein said semantic metadata model defines a taxonomy for classifying said standardized terms.
8. A method as set forth in Claim 5, wherein said step of first operating is performed by a first user and said step of second operating is performed by a second user different than said first user whereby multiple users can develop said semantic metadata model and any associated inconsistencies are addressed by said logic.
9. A method as set forth in Claim 8, wherein said first user establishes said first association over time defining a first time period and said second user establishes said second association over time defining a second time period that overlaps, at least in part, said first time period.
10. A method as set forth in Claim 1 , wherein said step of third operating comprises modifying one of said first and second associations to match the other of said first and second associations with respect to said overlap.
11. A method as set forth in Claim 1 , wherein said step of third operating comprises modifying both of said first and second associations so as to match with respect to said overlap.
12. A method for use in establishing a searchable data structure, comprising the steps of: providing a list of terms pertaining to a subject matter area of interest; and establishing a classification structure for said subject matter area of interest, said classification structure having a hierarchical form including classes, corresponding to a subset of said subject matter area, each of which includes one or more sub-classes corresponding to a subset of a respective one of said classes; for each one term of said terms, associating said one term with said classification structure such that said one term is assigned to at least one of said sub-classes and at least one of said classes.
13. A method as set forth in Claim 12, wherein said terms comprise potential search terms for use in search requests to access data of said subject matter area of interest.
14. A method as set forth in Claim 12, wherein said terms comprise source terms of a source data collection.
15. A method as set forth in Claim 12, wherein said step of providing a list comprises providing potential search terms for use in search requests and providing source terms of a source data collection.
16. A method as set forth in Claim 12, wherein said step off establishing comprises adopting, at least in part, a pre-existing classification structure for said subject matter area.
17. A method as set forth in Claim 12, wherein said step of establishing comprises using one or more individuals to develop a classification structure for said subject matter area.
18. A method as set forth in Claim 12, wherein said step of associating comprises establishing a database where said one term is related to said one sub-class and said one class such that data identified by said one term can be accessed in said database based on selecting any one of said one term, said sub-class and said class.
19. A method as set forth in Claim 12, wherein said step of associating comprises providing a graphical representation of said classification structure on a graphical user interface and using said graphical representation to effect an association of said one term with a node of said classification structure.
20. A method as set forth in Claim 12, wherein said step of using said graphical representation comprises dragging a first graphical element representing said one term relative to said graphical user interface and dropping said first graphical element at a second graphical element representing said node of said classification structure.
21. A method as set forth in Claim 12, wherein said step of providing a list comprises obtaining a first set of terms in a first form and transforming said first set of terms into a second form to provide a second set of terms, wherein said second form differs from said first form with respect to at least one of linguistics and syntax.
22. A method as set forth in Claim 12, wherein said second set contains less terms than said first set.
23. A method as set forth in Claim 12, wherein said step of providing comprises translating said terms from a first language to a second language.
24. A searchable data system, comprising: an input port for receiving a search request including a search term; a second storage structure for storing a knowledge base for relating potential search terms to a defined classification structure of said subject matter of said searchable data system; logic for identifying said search term of said search request, using said knowledge base to relate said search term to a determined classification of said classification structure, and using said determined classification to access said first storage structure to obtain responsive data that is responsive to said search request; and an output port for outputting said responsive data.
25 A system as set forth in Claim 24, wherein said logic is operative for mapping said search term to a standardized term of a set of predefined standardized terms.
26. A system as set forth in Claim 24, wherein said classification structure includes multiple classifications having parent and child relationships where a child classification corresponds to a subset of an associated parent classification in relation to said subject matter, and to at least one child classification and an associated parent classification.
27. A system as set forth in Claim 26, wherein a given child classification is associated with a plurality of parent classifications.
28. A method for use in operating a machine-based tool for converting data from a first form to a second form, comprising the steps of: establishing, based on external knowledge of a subject matter area independent of analysis of a particular data set to be converted, a number of schema, each including one or more conversion rules for use in converting data within a corresponding context of said subject matter area; identifying a set of data to be converted from said first form to said second form; determining a particular context of said set of data; based on said context, accessing an associated first schema of said number of schema; and using an included conversion rule of said first schema in a process for converting said set of data from said first form to said second form.
29. A method as set forth in Claim 28, wherein said step of establishing comprises identifying a public schema, including conversion rules generally applicable to said subject matter area independent of any entity or group of entities associated with said set of data, that establishes a structure for understanding at least a portion of the subject matter area.
30. A method as set forth in Claim 29, wherein said public schema involves an accepted public definition of a semantic object.
31. A method as set forth in Claim 28, wherein said step of establishing comprise identifying a private schema, including conversion rules specific to an entity or group of entities less than the public as a whole, that establishes a structure for understanding at least a portion of the subject matter area.
32. A method as set forth in Claim 28, wherein said step of identifying comprises parsing a data stream to obtain an attribute phrase including information potentially defining a semantic object, an attribute of said object and an attribute value of said attribute.
33. A method as set forth in Claim 32, wherein said step of determining comprises associating said semantic object with said particular context.
34. A method as set forth in Claim 32, wherein said step of using comprises executing logic to interpret said information so as to identify said object, attribute or attribute value.
35. A method as set forth in Claim 32, wherein said step of using comprises performing a comparison of said object, attribute or attribute value to a corresponding set of objects, attributes, or attribute values defined by said first schema.
36. A method as set forth in Claim 35, wherein said step of using comprises using said comparison to convert said set of data from said first form to said second form.
37. A method as set forth in Claim 35, wherein said step of using comprises using said comparison to identify an anomaly regarding said set of data.
38. A method as set forth in Claim 32, wherein said step of using comprises identifying legal attribute values for said attribute.
39. A method as set forth in Claim 28, wherein said step of establishing is implemented in a start-up mode for configuration of logic of said machine- based tool so as to convert data based on contextual cues inferred from an understanding of said subject matter area.
40. A method as set forth in Claim 39, wherein said first schema is operative to enable proper conversion of a set of data which was not specifically addressed in said configuration.
41. An apparatus for use in converting data from a first form to a second form, comprising: an input port for receiving an input including a first content string to be converted; and a processor operative for analyzing said content string to determine an applicable schema for use in converting at least a portion of said content string, wherein said schema is applicable to less than the whole of a subject matter area including said content string and includes one or more conversion rules for use in converting data from said first form to said second form; said processor further being operative for using said first schema to convert said content string from said first form to said second form and to provide a corresponding output.
42. An apparatus as set forth in Claim 41 , wherein said processor is further operative for accessing one or more stored public schema, each said public schema including conversion rules generally applicable to said subject matter area independent of any entity or group of entities associated with said input, that establishes a structure for understanding at least a portion of the subject matter area.
43. An apparatus as set forth in Claim 41 , wherein said processor is further operative for accessing one or more stored private schema, each said private schema including conversion rules specific to an entity or group of entities less than the public as a whole, that establishes a structure for understanding at least a portion of the subject matter area.
PCT/US2005/031303 2004-09-01 2005-09-01 Functionality and system for converting data from a first to a second form WO2006031466A2 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US10/931,789 US7865358B2 (en) 2000-06-26 2004-09-01 Multi-user functionality for converting data from a first form to a second form
US10/931,789 2004-09-01
US10/970,372 US8396859B2 (en) 2000-06-26 2004-10-21 Subject matter context search engine
US10/970,372 2004-10-21
US11/151,596 2005-06-13
US11/151,596 US7536634B2 (en) 2005-06-13 2005-06-13 Frame-slot architecture for data conversion

Publications (2)

Publication Number Publication Date
WO2006031466A2 true WO2006031466A2 (en) 2006-03-23
WO2006031466A3 WO2006031466A3 (en) 2006-08-17

Family

ID=36060513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/031303 WO2006031466A2 (en) 2004-09-01 2005-09-01 Functionality and system for converting data from a first to a second form

Country Status (1)

Country Link
WO (1) WO2006031466A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10002159B2 (en) 2013-03-21 2018-06-19 Infosys Limited Method and system for translating user keywords into semantic queries based on a domain vocabulary
CN109582647A (en) * 2018-11-21 2019-04-05 珠海市新德汇信息技术有限公司 A kind of analysis method and system towards the unstructured instrument of evidence
US10565533B2 (en) 2014-05-09 2020-02-18 Camelot Uk Bidco Limited Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
US10896212B2 (en) 2014-05-09 2021-01-19 Camelot Uk Bidco Limited System and methods for automating trademark and service mark searches
US10992488B2 (en) * 2017-12-14 2021-04-27 Elizabeth K. Le System and method for an enhanced focus group platform for a plurality of user devices in an online communication environment
US11100124B2 (en) 2014-05-09 2021-08-24 Camelot Uk Bidco Limited Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
US11544304B2 (en) * 2018-03-27 2023-01-03 Innoplexus Ag System and method for parsing user query
US20230267271A1 (en) * 2022-02-24 2023-08-24 Research Factory And Publication Inc. Auto conversion system and method of manuscript format

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4141005A (en) * 1976-11-11 1979-02-20 International Business Machines Corporation Data format converting apparatus for use in a digital data processor
US20030037173A1 (en) * 2000-09-01 2003-02-20 Pace Charles P. System and method for translating an asset for distribution over multi-tiered networks
US20040059705A1 (en) * 2002-09-25 2004-03-25 Wittke Edward R. System for timely delivery of personalized aggregations of, including currently-generated, knowledge
US20050288920A1 (en) * 2000-06-26 2005-12-29 Green Edward A Multi-user functionality for converting data from a first form to a second form

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4141005A (en) * 1976-11-11 1979-02-20 International Business Machines Corporation Data format converting apparatus for use in a digital data processor
US20050288920A1 (en) * 2000-06-26 2005-12-29 Green Edward A Multi-user functionality for converting data from a first form to a second form
US20030037173A1 (en) * 2000-09-01 2003-02-20 Pace Charles P. System and method for translating an asset for distribution over multi-tiered networks
US20040059705A1 (en) * 2002-09-25 2004-03-25 Wittke Edward R. System for timely delivery of personalized aggregations of, including currently-generated, knowledge

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10002159B2 (en) 2013-03-21 2018-06-19 Infosys Limited Method and system for translating user keywords into semantic queries based on a domain vocabulary
US10565533B2 (en) 2014-05-09 2020-02-18 Camelot Uk Bidco Limited Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
US10896212B2 (en) 2014-05-09 2021-01-19 Camelot Uk Bidco Limited System and methods for automating trademark and service mark searches
US11100124B2 (en) 2014-05-09 2021-08-24 Camelot Uk Bidco Limited Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
US10992488B2 (en) * 2017-12-14 2021-04-27 Elizabeth K. Le System and method for an enhanced focus group platform for a plurality of user devices in an online communication environment
US11544304B2 (en) * 2018-03-27 2023-01-03 Innoplexus Ag System and method for parsing user query
CN109582647A (en) * 2018-11-21 2019-04-05 珠海市新德汇信息技术有限公司 A kind of analysis method and system towards the unstructured instrument of evidence
CN109582647B (en) * 2018-11-21 2022-09-30 珠海市新德汇信息技术有限公司 Unstructured evidence file oriented analysis method and system
US20230267271A1 (en) * 2022-02-24 2023-08-24 Research Factory And Publication Inc. Auto conversion system and method of manuscript format

Also Published As

Publication number Publication date
WO2006031466A3 (en) 2006-08-17

Similar Documents

Publication Publication Date Title
US9311410B2 (en) Subject matter context search engine
US7865358B2 (en) Multi-user functionality for converting data from a first form to a second form
US7225199B1 (en) Normalizing and classifying locale-specific information
US6986104B2 (en) Method and apparatus for normalizing and converting structured content
US8190985B2 (en) Frame-slot architecture for data conversion
US7921367B2 (en) Application generator for data transformation applications
US9201869B2 (en) Contextually blind data conversion using indexed string matching
Heflin et al. Searching the Web with SHOE
US9037613B2 (en) Self-learning data lenses for conversion of information from a source form to a target form
US20060074980A1 (en) System for semantically disambiguating text information
WO2014035539A1 (en) Contextually blind data conversion using indexed string matching
Diefenbach et al. Qanswer KG: designing a portable question answering system over RDF data
US20050177358A1 (en) Multilingual database interaction system and method
WO2006031466A2 (en) Functionality and system for converting data from a first to a second form
US20090164428A1 (en) Self-learning data lenses
US9207917B2 (en) Application generator for data transformation applications
Chang et al. Mining semantics for large scale integration on the web: evidences, insights, and challenges
Bayer et al. Evaluation of an ontology-based knowledge-management-system. a case study of convera retrievalware 8.0
Lutsky Information extraction from documents for automating software testing
Uschold et al. Ontologies for knowledge management
Fernandes Development of a Web-Based Platform for Biomedical Text Mining
Merrill The Babylon project: toward an extensible text-mining platform
Reitter et al. Hybrid natural language processing in a customer-care environment
Corral et al. The SGML standard for a technical documents writing assistance system
Johnson et al. Aviation Technologies and Human Error: a Research Literature Database and Analysis

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 69(1)EPC

122 Ep: pct application non-entry in european phase

Ref document number: 05814943

Country of ref document: EP

Kind code of ref document: A2