EP0998714A1 - System for processing textual inputs using natural language processing techniques - Google Patents

System for processing textual inputs using natural language processing techniques

Info

Publication number
EP0998714A1
EP0998714A1 EP98936899A EP98936899A EP0998714A1 EP 0998714 A1 EP0998714 A1 EP 0998714A1 EP 98936899 A EP98936899 A EP 98936899A EP 98936899 A EP98936899 A EP 98936899A EP 0998714 A1 EP0998714 A1 EP 0998714A1
Authority
EP
European Patent Office
Prior art keywords
logical forms
document
logical
obtaining
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP98936899A
Other languages
German (de)
English (en)
French (fr)
Inventor
Simon H. Corston
William B. Dolan
Lucy H. Vanderwende
Lisa Braden-Harder
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US08/898,652 external-priority patent/US5933822A/en
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of EP0998714A1 publication Critical patent/EP0998714A1/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • the present invention deals with processing textual inputs. More specifically, the present invention relates to using natural language processing techniques in order to determine similarity between textual inputs.
  • the present invention is useful in a wide variety of applications, such as information retrieval, machine translation, natural language understanding, document similarity/clustering, etc.
  • the present invention will be described primarily in the context of information retrieval, for illustrative purposes only.
  • information retrieval is a process by which a user finds and retrieves information, relevant to the user, from a large store of information.
  • the amount of information that can be queried and searched is very large.
  • some information retrieval systems are set up to search information on the internet, digital video discs, and other computer data bases in general .
  • the information retrieval systems are typically embodied as, for example, internet search engines and library catalog search engines.
  • certain types of information retrieval mechanisms are provided.
  • some operating systems provide a tool by which a user can search all files on a given database or on a computer system based upon certain terms input by the user. Many information retrieval techniques are known.
  • a user input query in such techniques is typically presented as either an explicit user generated query, or an implicit query, such as when a user requests documents which are similar to a set of existing documents.
  • Typical information retrieval systems search documents in the larger data store at either a single word level, or at a term level. Each of the documents is assigned a relevancy (or similarity) score, and the information retrieval system presents a certain subset of the documents searched to the user, typically that subset which has a relevancy score which exceeds a given threshold.
  • English like other languages, has a rich and complex syntactic and lexico-semantic structure with words whose meanings vary, often widely, based on the specific linguistic context in which they are used, with the context determining in any one instance a given meaning of a word and what word(s) can subsequently appear.
  • words that appear in a textual passage are simply not independent of each other, rather they are highly inter-dependent. Keyword based search engines totally ignore this fine-grained linguistic structure.
  • a statistical search engine operating on content words “hearts” and “octopus”, or morphological stems thereof, might likely return or direct a user to a stored document that contains a recipe that has at its ingredients and hence its content words: “artichoke hearts, squid, onions and octopus”.
  • This engine given matches in the two content words "octopus” and “hearts”, may determine, based on statistical measures, e.g. including proximity and logical operators, that this document is an excellent match, when, in reality, the document is quite irrelevant to the query.
  • the present invention provides a method and apparatus for determining similarity between two textual inputs.
  • a first set of logical forms is obtained for the first textual input, and a second set of logical forms is obtained for the second textual input.
  • the first and second sets of logical forms are compared, and similarity between the first and second textual inputs is determined based on the comparison.
  • a logical form is a directed graph in which words representing text of any arbitrary size are linked by labeled relations.
  • a logical form portrays structural relationships (i.e., syntactic and semantic relationships) , particularly argument and/or adjunct relationships, between important words in an input string.
  • This portrayal can take various specific forms, such as, a logical form graph or any sub-graph thereof, the latter including, for example, a list of logical form triples, with each of the triples being illustratively of a form "word-relation-word” ; wherein, any one of these forms can be used with our invention.
  • each textual input is subjected to natural language processing, illustratively morphological, syntactic and semantic, to ultimately produce appropriate logical forms for each sentence in each textual input.
  • the set of logical forms for the first textual input is then compared to the set of logical forms associated with the second textual input in order to ascertain a match between logical forms.
  • Similarity means obtaining some measure for how close two textual inputs are with respect to either semantic and syntactic structure or lexical meaning, or both.
  • information retrieval systems are based, in part, on natural language processing. Semantic information is used to capture more information about either the documents being searched, or the queries, or both, in order to achieve better performance or accuracy. Generally, such systems use natural language processing techniques in an attempt to match the semantic content of a first textual input (such as the queries) to that of a second textual input (such as the documents being searched) . Such systems represent a significant advancement in the art, particularly with respect to obtaining increased precision in the information retrieval process.
  • the input query is converted to one or more logical forms, and the documents retrieved by a search engine are also converted to logical forms.
  • the logical forms for the query are compared against those for the documents. Documents whose logical forms precisely match the logical forms corresponding to the query are ranked and presented to the user.
  • the stringency associated with the above- described matching process is reduced by using paraphrased logical forms.
  • the information retrieval application there may be a need to reduce the stringency in the filtering process in order to prevent discarding relevant documents.
  • a document that the query (or keyword search) correctly includes in the recall set will be incorrectly discarded. This can occur when keywords from the query occur in the document, but not in the precise syntactic/semantic relationship required by the logical form generated for the query.
  • Such an incorrectly discarded document can be illustrated by the following example. It should be noted that the example discusses logical form triples, but other subgraphs of a logical form can be used as well. Assume that the query is as follows:
  • the logical form triples generated for the query will be: eat; Dsub; spider eat; Dobj ; victim
  • a relevant document may include the sentence "Many spiders consume their victims . . . " .
  • Logical form triples generated for that sentence will be as follows :
  • logical forms may appear with a high degree of frequency in documents in the large data store being searched.
  • Such logical forms may also be commonly present in queries, regardless of the subject matter of the query. For instance, assume that the query is:
  • one or both sets of logical forms is modified, such as by paraphrasing the set of logical forms or suppressing certain logical forms.
  • the modified set or sets of logical forms is/are used in the matching process .
  • the system filters documents in a document set retrieved from a document store in response to a query.
  • the system obtains a first set of logical forms based on a selected one of the query and the documents in the document set.
  • the system obtains a second set of logical forms based on another of the query and the documents in the document set.
  • the system then uses natural language processing techniques to modify the first logical forms to obtain a modified set of logical forms.
  • the system filters documents in the document set based on a predetermined relationship between the modified set of logical forms and the second set of logical forms.
  • the natural language processing techniques are used to obtain a first set of paraphrased logical forms indicative of paraphrases of the first set of logical forms .
  • the natural language processing techniques suppress a first predetermined class of logical forms to obtain a first suppressed set of logical forms. Filtering is then conducted based upon the set of paraphrased logical forms and/or the suppressed set of logical forms .
  • the query is received and query logical forms are computed based on the query.
  • the query is run and documents are retrieved based on the query.
  • Logical forms are either computed or retrieved from a data store for each document retrieved.
  • High frequency query logical forms are suppressed, and paraphrased logical forms are computed based on the query logical forms .
  • the paraphrased query logical forms are matched against the document logical forms.
  • FIG. 1 depicts a very high-level block diagram of information retrieval system 5 in accordance with our present invention
  • FIG. 2 depicts a high-level embodiment of information retrieval system 200, of the type shown in FIG. 1, that utilizes the teachings of our present invention
  • FIG. 3 depicts a block diagram of computer system 300, specifically a client personal computer, that is contained within system 200 shown in FIG. 2 ;
  • FIG. 4 depicts a very-high level block diagram of application programs 400 that execute within computer 300 shown in FIG. 3 ;
  • FIGS. 5A-5D depict different corresponding examples of English language sentences of varying complexity and corresponding logical form elements therefor;
  • FIG. 6 depicts the correct alignment of the drawing sheets for FIGS. 6A and 6B;
  • FIGS. 6A and 6B collectively depict a flowchart of our inventive Retrieval process 600
  • FIG. 7 depicts a flowchart of NLP routine 700 that is executed within process 600;
  • FIG. 8A depicts illustrative Matching Logical Form Triple Weighting table 800
  • FIG. 8B graphically depicts logical form triple comparison
  • document scoring, ranking and selection processes in accordance with our inventive teachings, that occur within blocks 650, 660, 665 and 670, all shown in FIGS. 6A and 6B, for an illustrative query and an illustrative set of three statistically retrieved documents;
  • FIGS. 9A-9C respectively depict three different embodiments of information retrieval systems that incorporate the teachings of our present invention
  • FIG. 9D depicts an alternate embodiment of remote computer (server) 930 shown in FIG. 9C for use in implementing yet another different embodiment of our present invention
  • FIG. 10 depicts the correct alignment of the drawing sheets for FIGS. 10A and 10B;
  • FIGS. 10A and 10B collectively depict yet another embodiment of our present invention wherein the logical form triples for each document are precomputed and stored, along with the document record therefor, for access during a subsequent document retrieval operation;
  • FIG. 11 depicts Triple Generation process 1100 that is performed by Document Indexing engine 1015 shown in FIGS. 10A and 10B;
  • FIG. 12 depicts the correct alignment of the drawing sheets for FIGS. 12A and 12B;
  • FIGS. 12A and 12B collectively depict a flowchart of our inventive Retrieval process 1200 that is executed within computer system 300 shown in FIGS. 10A and 10B;
  • FIG. 13A depicts a flowchart of NLP routine 1300 which is executed within Triple Generation process 1100.
  • FIG. 13B depicts a flowchart of NLP routine 1350 which is executed within Retrieval process 1200.
  • FIG. 14 is a functional block diagram illustrating one embodiment of the present invention.
  • FIG. 15 is a functional block diagram illustrating indexing of documents in accordance with one aspect of the present invention.
  • FIG. 16 is a more detailed block diagram of a retrieval engine in accordance with one aspect of the present invention.
  • FIG. 17 is a flow diagram illustrating operation of the system shown in FIG. 16.
  • FIG. 18 is a flow diagram illustrating natural language processor modification of logical forms in accordance with one aspect of the present invention.
  • FIG. 19 is a more detailed block diagram illustrating natural language processor modification of logical forms in accordance with one aspect of the present invention.
  • the present invention utilizes natural language processing techniques to create sets of logical forms corresponding to a first textual input and a second textual input .
  • the present invention determines similarity between the first and second textual inputs based on a comparison of the sets of logical forms.
  • one or both of the sets of logical forms is modified, such as by obtaining paraphrases or suppressing certain logical forms. While the present invention is contemplated for use in a wide variety of applications, it is described herein, primarily in the context of information retrieval, for the purpose of illustration only.
  • the present invention creates sets of logical forms corresponding to an input query, and corresponding to a document set returned in response to the input query.
  • the present invention also utilizes natural language processing techniques to modify the logical forms corresponding to either the query, the document set, or both.
  • the modified logical forms are expanded to include paraphrases .
  • the modified logical forms are processed to suppress a predetermined class of logical forms which have not proven useful in discriminating among various documents.
  • our teachings of our present invention can be readily utilized in many applications and nearly any information retrieval system to increase the precision of a search engine used therein, regardless of whether that engine is a conventional statistical engine or not.
  • our invention can be utilized to improve precision in retrieving textual information from nearly any type of mass data store, e.g. a database whether stored on magnetic, optical (e.g. a CD-ROM) or other media, and regardless of any particular language in which the textual information exists, e.g. English, Spanish, German and so forth.
  • FIG. 1 depicts a very high-level block diagram of information retrieval system 5 that utilizes our invention.
  • System 5 is formed of conventional retrieval engine 20, e.g. a keyword based statistical retrieval engine, followed by processor 30.
  • Processor 30 utilizes our inventive natural language processing technique, as described below, to filter and re-rank documents produced by engine 20 to yield an ordered set of retrieved documents that are more relevant to a user- supplied query than would otherwise arise.
  • a user supplies a search query to system 5.
  • the query should be in full-text (commonly referred to as "literal") form in order to take full advantage of its semantic content through natural language processing and thus provide an increase in precision over that associated with engine 20 alone.
  • System 5 applies this query both to engine 20 and processor 30.
  • engine 20 searches through dataset 10 of stored documents to yield a set of retrieved documents therefrom.
  • This set of documents (also referred to herein as an "output document set") is then applied, as symbolized by line 25, as an input to processor 30.
  • processor 30 Within processor 30, as discussed in detail below, each of the documents in the set is subjected to natural language processing, illustratively morphological, syntactic and logical form, to produce logical forms for each sentence in that document .
  • Each such logical form for a sentence encodes, for example, semantic relationships, particularly argument and adjunct structure, between words in a linguistic phrase in that sentence.
  • Processor 30 analyzes the query in an identical fashion to yield a set of corresponding logical forms therefor. Processor 30 then compares the set of forms for the query against the sets of logical forms associated with each of the documents in the set in order to ascertain any match between logical forms in the query set and logical forms for each document. Documents that produce no matches are eliminated from further consideration.
  • Each remaining document that contains at least one logical form which matches the query logical form is retained and heuristically scored by processor 30.
  • each different relation type i.e., such as deep subject, deep object, operator and the like, that can occur in a logical form triple is assigned a predefined weight.
  • the total weight (i.e., score) of each such document is, e.g., the sum of the weights of all its uniquely matching triples, i.e. with duplicate matching triples being ignored.
  • processor 30 presents the retained documents to the user rank-ordered based on their score, typically in groups of a predefined number, e.g. five or ten, starting with those documents that have the highest score.
  • That context will be an information retrieval system that employs a conventional keyword based statistical Internet search engine to retrieve stored records of English- language documents indexed into a dataset from the world wide web.
  • Each such record generally contains predefined information, as set forth below, for a corresponding document.
  • the record may contain the entire document itself.
  • FIG. 2 depicts a high-level block diagram of a particular embodiment of our invention used in the context of an Internet search engine. Our invention will principally be discussed in detail in the context of this particular embodiment.
  • system 200 contains computer system 300, such as a client personal computer (PC) , connected, via network connection 205, through network 210 (here the Internet, though any other such network, e.g. an intranet, could be alternatively used), and network connection 215, to server 220.
  • the server typically contains computer 222 which hosts Internet search engine 225, typified by, e.g., the ALTA VISTA search engine (ALTA VISTA is a registered trademark of Digital Equipment Corporation of Maynard,
  • Each such record typically contains: (a) a web address (commonly referred to as a uniform resource locator -- URL) at which a corresponding document can be accessed by a web browser, (b) predefined content words which appear in that document, along with, in certain engines, a relative address of each such word relative to other content words in that document; (c) a short summary, often just a few lines, of the document or a first few lines of the document; and possibly (d) a description of the document as provided in its hypertext markup language (HTML) description field.
  • HTML hypertext markup language
  • a user stationed at computer system 300 establishes an Internet connection, through, e.g., an associated web browser (such as based on the "Internet Explorer” version 4.0 browser available from the Microsoft Corporation and appropriately modified to include our inventive teachings) executing at this system to server 220 and particularly to search engine 222 executing thereat. Thereafter, the user enters a query, here symbolized by line 201, to the browser which, in turn, sends the query, via system 300 and through the Internet connection to server 220, to search engine 225. The search engine then processes the query against document records stored within dataset 227 to yield a set of retrieved records, for documents, that the engine determines is relevant to the query.
  • an associated web browser such as based on the "Internet Explorer” version 4.0 browser available from the Microsoft Corporation and appropriately modified to include our inventive teachings
  • engine 225 Inasmuch as the manner through which engine 225 actually indexes documents to form document records for storage in data store 227 and the actual analysis which the engine undertakes to select any such stored document record are both irrelevant to the present invention, we will not discuss either of these aspects in any further detail. Suffice it to say, that in response to the query, engine 225 returns a set of retrieved document records, via the Internet connection, back to web browser 420. Browser 420, simultaneously while engine 225 is retrieving documents and/or subsequent thereto, analyzes the query to yield its corresponding set of logical form triples.
  • the search engine completes its search and has retrieved a set of document records and has supplied that set to the browser, the corresponding documents (i.e., to form an output document set) are themselves accessed by the browser from associated web servers (the datasets associated therewith collectively forming a "repository" of stored documents; such a repository can also be a stand-alone dataset as well, such as in, e.g., a self-contained CD-ROM based data retrieval application) .
  • the browser analyzes each of the accessed documents (i.e., in the output document set) to form a corresponding set of logical form triples for each such document.
  • browser 420 based on matching logical form triples between the query and the retrieved documents, scores each document having such a match and presents the user with those documents, as symbolized by line 203, ranked in terms of descending score, typically in a group of a predefined small number of documents having the highest rankings, then followed, if the user so selects through the browser, by the next such group and so forth until the user has examined a sufficient number of the documents so presented.
  • FIG. 2 depicts our invention as illustratively utilizing a network connection to obtain document records and documents from a remote server, our invention is not so limited. As will be discussed in detail below, in conjunction with FIG.
  • FIG. 3 and the related discussion are intended to provide a brief, general description of a suitable computing environment in which the invention may be implemented.
  • the invention will be described, at least in part, in the general context of computer-executable instructions, such as program modules, being executed by a personal computer.
  • program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote memory storage devices .
  • an exemplary system for implementing the invention includes a general purpose computing device in the form of a conventional personal computer 320, including processing unit 321 (which may include one or more processors) , a system memory 322, and a system bus 323 that couples various system components including the system memory to the processing unit 321.
  • the system bus 323 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • the system memory includes read only memory (ROM) 324 a random access memory (RAM) 325.
  • a basic input/output 326 (BIOS) , containing the basic routine that helps to transfer information between elements within the personal computer 320, such as during start-up, is stored in ROM 324.
  • the personal computer 320 further includes a hard disk drive 327 for reading from and writing to a hard disk (not shown) a magnetic disk drive 328 for reading from or writing to removable magnetic disk 329, and an optical disk drive 330 for reading from or writing to a removable optical disk 331 such as a CD ROM or other optical media.
  • the hard disk drive 327, magnetic disk drive 328, and optical disk drive 330 are connected to the system bus 323 by a hard disk drive interface 332, magnetic disk drive interface 333, and an optical drive interface 334, respectively.
  • the drives and the associated computer- readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the personal computer 320.
  • the exemplary environment described herein employs a hard disk, a removable magnetic disk 329 and a removable optical disk 331, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs) , read only memory (ROM) , and the like, may also be used in the exemplary operating environment .
  • RAMs random access memories
  • ROM read only memory
  • a number of program modules may be stored on the hard disk, magnetic disk 329, optical disk 331, ROM 324 or RAM 325, including an operating system 335, one or more application programs 336, other program modules 337, and program data 338.
  • a user may enter commands and information into the personal computer 320 through input devices such as a keyboard 340 and pointing device 342.
  • Other input devices may include a microphone, joystick, game pad, satellite dish, scanner, or the like.
  • serial port interface 346 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB) .
  • a monitor 347 or other type of display device is also connected to the system bus 323 via an interface, such as a video adapter 348.
  • a monitor 347 personal computers may typically include other peripheral output devices (not shown) , such as speakers and printers .
  • the personal computer 320 may operate in a networked environment using logic connections to one or more remote computers, such as a remote computer
  • the remote computer 349 may be another personal computer, a server, a router, a network PC, a peer device or other network node, and typically includes many or all of the elements described above relative to the personal computer 320, although only a memory storage device 350 has been illustrated in FIG. 1.
  • the logic connections depicted in FIG. 1 include a local are network (LAN) 351 and a wide area network (WAN) 352.
  • LAN network
  • WAN wide area network
  • the personal computer 320 When used in a LAN networking environment , the personal computer 320 is connected to the local area network 351 through a network interface or adapter 353. When used in a WAN networking environment, the personal computer 320 typically includes a modem 354 or other means for establishing communications over the wide area network 352, such as the Internet.
  • the modem 354, which may be internal or external, is connected to the system bus 323 via the serial port interface 346.
  • program modules depicted relative to the personal computer 320, or portions thereof may be stored in the remote memory storage devices. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • FIG. 4 depicts a very-high level block diagram of application programs 400 that execute within computer 300 shown in FIG. 3.
  • These programs include, as shown in FIG. 4, web browser 420 which, for implementing our present invention, comprises retrieval process 600 (which will be discussed below in detail in conjunction with FIGS. 6A and 6B) .
  • web browser 420 which, for implementing our present invention, comprises retrieval process 600 (which will be discussed below in detail in conjunction with FIGS. 6A and 6B) .
  • a user-selected statistical search engine such as the ALTA VISTA search engine
  • process 600 forwards, as symbolized by line 426, the query through the web browser to the search engine.
  • process 600 also internally analyzes the query to produce its corresponding logical form triples which are then locally stored within computer 300.
  • the search engine supplies, as symbolized by line 432, process 600 with a set of statistically retrieved document records.
  • Each of these records includes, as noted above, a web address, specifically a URL, at which that document can be accessed and appropriate command (s) required by a remote web server, at which that document resides, sufficient to download, over the Internet, a computer file containing that document.
  • process 600 receives all the records, this process then sends, via web browser 420 and as symbolized by line 436, the appropriate commands to access and download all the documents specified by the records (i.e., to form the output document set) . These documents are then accessed, in seriatim, from their corresponding web servers and downloaded to web browser 420 and specifically process 600, as symbolized by line 442. Once these documents are downloaded, process 600 analyzes each such document to produce and locally store the corresponding logical form triples therefor.
  • process 600 scores each document that contains at least one matching logical form triple, then ranks these particular documents based on their scores, and finally instructs web browser 400 to present these particular documents, as symbolized by line 446, in ranked order by descending document score on a group-by-group basis to the user.
  • Browser 400 generates a suitable selection button, on a screen of display 380 (see FIG. 3) , through which the user can select, by appropriately “clicking" thereon with his (her) mouse, to display each successive group of documents, as desired.
  • a logical form is a directed graph in which words representing text of any arbitrary size are linked by labeled relations.
  • a logical form portrays semantic relationships between important words in a phrase, which may include hypernyms and/or synonyms thereof.
  • a logical form can take on any one of a number of different forms, e.g. a logical form graph or any sub-graph thereof such as, for example, a list of logical form triples, each of the triples being illustratively of a form "word-relation-word” .
  • FIG. 5A This figure depicts logical form graph 515 and logical form triples 525 for illustrative input string 510, specifically a sentence "The octopus has three hearts . " .
  • logical form triples for an illustrative input string, e.g. for input string 510, that string is first parsed into its constituent words. Thereafter, using a predefined record (not to be confused with document records employed by a search engine) , in a stored lexicon, for each such word, the corresponding records for these constituent words, through predefined grammatical rules, are themselves combined into larger structures or analyses which are then, in turn, combined, again through predefined grammatical rules, to form even larger structures, such as a syntactic parse tree.
  • a logical form graph is then built from the parse tree. Whether a particular rule will be applicable to a particular set of constituents is governed, in part, by presence or absence of certain corresponding attributes and their values in the word records. The logical form graph is then converted into a series of logical form triples .
  • our invention uses such a lexicon having approximately 165,000 head word entries.
  • This lexicon includes various classes of words, such as, e.g., prepositions, conjunctions, verbs, nouns, operators and quantifiers that define syntactic and semantic properties inherent in the words in an input string so that a parse tree can be constructed therefor.
  • a logical form (or, for that matter, any other representation, such as logical form triples or logical form graph within a logical form, capable of portraying a semantic relationship) can be precomputed, while a corresponding document is being indexed, and stored, within, e.g., a record for that document, for subsequent access and use rather than being computed later once that document has been retrieved.
  • precomputation and storage as occurs in another embodiment of our invention discussed in detail below in conjunction with FIGS. 10-13B, drastically and advantageously reduces the amount of natural language processing, and hence execution time associated therewith, required to handle any retrieved document in accordance with our invention.
  • an input string such as sentence 510 shown in FIG. 5A
  • stem forms are used in order to normalize differing word forms, e.g., verb tense and singular-plural noun variations, to a common morphological form for use by a parser.
  • the input string is syntactically analyzed by the parser, using the grammatical rules and attributes in the records of the constituent words, to yield the syntactic parse tree therefor.
  • This tree depicts the structure of the input string, specifically each word or phrase, e.g. noun phrase "The octopus", in the input string, a category of its corresponding grammatical function, e.g., NP for noun phrase, and link(s) to each syntactically related word or phrase therein.
  • a category of its corresponding grammatical function e.g., NP for noun phrase
  • link(s) to each syntactically related word or phrase therein For illustrative sentence 510, its associated syntactic parse tree would be:
  • Sentence types include "DECL” (as here) for a declarative sentence, "IMPR” for an imperative sentence and "QUES" for a question. Displayed vertically to the right and below the start node is a first level analysis. This analysis has a head node indicated by an asterisk, typically a main verb (here the word "has”), a premodifier (here the noun phrase “The octopus”), followed by a postmodifier (the noun phrase "three hearts”). Each leaf of the tree contains a lexical term or a punctuation mark.
  • NP designates a noun phrase
  • CHAR denotes a punctuation mark.
  • the syntactic parse tree is then further processed using a different set of rules to yield a logical form graph, such as graph 515 for input string 510.
  • the process of producing a logical form graph involves extracting underlying structure from syntactic analysis of the input string; the logical form graph includes those words that are defined as having a semantic relationship therebetween and the functional nature of the relationship.
  • the "deep" cases or functional roles used to categorize different semantic relationships include:
  • each node in the syntactic parse tree for that string is examined.
  • other semantic roles are used, e.g. as follows :
  • the results of such analysis for input string 510 is logical form graph 515.
  • Those words in the input string that exhibit a semantic relationship therebetween (such as, e.g. "Octopus” and "Have") are shown linked to each other with the relationship therebetween being specified as a linking attribute (e.g. Dsub) .
  • This graph typified by graph 515 for input string 510, captures the structure of arguments and adjuncts for each input string.
  • logical form analysis maps function words, such as prepositions and articles, into features or structural relationships depicted in the graph.
  • Logical form analysis also, in one embodiment, resolves anaphora, i.e., defining a correct antecedent relationship between, e.g., a pronoun and a co- referential noun phrase; and detects and depicts proper functional relationships for ellipsis. Additional processing may well occur during logical form analysis in an attempt to cope with ambiguity and/or other linguistic idiosyncrasies. Corresponding logical form triples are then simply read in a conventional manner from the logical form graph and stored as a set. Each triple contains two node words as depicted in the graph linked by a semantic relationship therebetween. For illustrative input string 510, logical form triples 525 result from processing graph 515. Here, logical form triples 525 contain three individual triples that collectively convey the semantic information inherent in input string 510.
  • FIGS. 6A and 6B A flowchart of our invention utilized in retrieval process 600, as used in the specific embodiment of our invention shown in FIGS. 2, 3 and 4, is collectively depicted in FIGS. 6A and 6B; for which the correct alignment of the drawing sheets for these figures is shown in FIG. 6.
  • client PC 300 see FIGS. 2 and 3
  • web browser 420 the remaining operations shown in these figures are performed by computer system, e.g. client PC 300 (see FIGS. 2 and 3) and specifically within web browser 420.
  • FIGS. 2, 3 and 6A-6B To simplify understanding, the reader should simultaneously refer to FIGS. 2, 3 and 6A-6B throughout the following discussion.
  • execution proceeds first to block 605.
  • This block when executed, prompts a user to enter a full -text (literal) query through web browser 420.
  • the query can be in the form of a single question (e.g. "Are there any air-conditioned hotels in Bali?") or a single sentence (e.g. "Give me contact information for all fireworks held in Seattle during the month of July.") or a sentence fragment (e.g. "Clothes in Ecuador").
  • execution splits and proceeds, via path 607, to block 610 and, via path 643, to path 645.
  • Block 645 when performed, invokes NLP routine 700 to analyze the query and construct and locally store its corresponding set of logical form triples.
  • Block 610 when performed, transmits, as symbolized by dashed line 615, the full-text query from web browser 420, through an Internet connection, to a remote search engine, such as engine 225 situated on server 220.
  • a remote search engine such as engine 225 situated on server 220.
  • block 625 is performed by the search engine to retrieve a set of document records in response to the query.
  • the set is transmitted, as symbolized by dashed line 630, by the remote server back to computer system 300 and specifically to web browser 420 executing thereat.
  • block 635 is performed to receive the set of records, and then for each record: extract a URL from that record, access a web site at that URL and download therefrom an associated file containing a document corresponding to that record.
  • block 640 is performed. For each such document, this block first extracts all the text from that document, including any text situated within HTML tags associated with that document. Thereafter, to facilitate natural language processing which operates on a single sentence at a time, the text for each document is broken into a text file, through a conventional sentence breaker, in which each sentence (or question) occupies a separate line in the file. Thereafter, block 640 repeatedly invokes NLP routine 700 (which will be discussed in detail below in conjunction with FIG. 7) , for each line of text in that document, to analyze each of these documents and construct and locally store a corresponding set of logical form triples for each line of text in that document.
  • NLP routine 700 which will be discussed in detail below in conjunction with FIG. 7
  • the operations in block 645 have been discussed as being performed essentially in parallel with those in blocks 610, 635 and 640, the operations in the former block, based on actual implementation considerations, could be performed serially either before or after the operations in blocks 610, 635 and 640.
  • the logical form triples for each document can be precomputed and stored for subsequent access and use during document retrieval, in which case, these triples would simply be accessed rather than computed during document retrieval.
  • the triples may have been stored, in some manner, as properties of that stored document or as, e.g., a separate entry in either the record for that document or in the dataset containing that document .
  • block 650 compares each of the logical form triples in the query against each of the logical form triples for each of the retrieved documents to locate a match between any triple in the query and any triple in any of the documents.
  • An illustrative form of matching is defined as an identical match between two triples both in terms of the node words as well as in the relation type in these triples.
  • a match only occurs if the node words wordla and word lb are identical to each other, node words word2a and word2b are identical to each other, and relationl and relation2 are the same. Unless all three elements of one triple identically match corresponding elements of another triple, these two triples do not match.
  • block 655 is performed to discard all retrieved documents that do not exhibit a matching triple, i.e., having no triple that matches any triple in the query. Thereafter, block 660 is performed.
  • each different type of relation that can arise in a logical form triple is assigned a corresponding weight, such as those shown in table 800 in FIG. 8A.
  • illustrative relations Dobj, Dsub, Ops and Nadj may be assigned predetermined static numeric weights of 100, 75, 10 and 10, respectively.
  • the weight reflects a relative importance ascribed to that relation in indicating a correct semantic match between a query and a document .
  • the actual numeric values of these weights are generally defined on an empirical basis. As described in detail in conjunction with FIG.
  • block 665 is performed to rank order the documents in order of descending score.
  • block 670 is performed to display the documents in rank order, typically in terms of a small predefined group of documents, typically five or ten, that exhibit the highest scores.
  • FIG. 7 depicts a flowchart of NLP routine 700.
  • This routine given a single line of input text -- whether it be a query, sentence in a document, or text fragment, constructs the corresponding logical form triples therefor.
  • block 710 is first executed to process a line of input text to yield a logical form graph, such as illustrative graph 515 shown in FIG. 5A.
  • This processing illustratively includes morphological and syntactic processing to yield a syntactic parse tree from which a logical form graph is then computed. Thereafter, as shown in FIG. 7, block 720 is performed to extract (read) a set of corresponding logical form triples from the graph. Once this occurs, block 730 is executed to generate each such logical form triple as a separate and distinct formatted text string. Finally, block 740 is executed to store, in a dataset (or database) , the line of input text and, as a series of formatted text strings, the set of logical form triples for that line. Once this set has been completely stored, execution exits from block 700. Alternatively, if in lieu of logical form triples, a different representation, e.g.
  • blocks 720 and 730 would be readily modified to generate that particular form as the formatted string, with block 740 storing that form in lieu of logical form triples into the dataset.
  • FIG. 8B This figure graphically depicts logical form triple comparison; document scoring, ranking and selection processes, in accordance with our inventive teachings, that occur within blocks 650, 660, 665 and 670, all shown in FIGS. 6A and 6B, for an illustrative query and an illustrative set of three retrieved documents.
  • a user supplied full-text query 810 to our inventive retrieval system, with the query being "How many hearts does an octopus have?".
  • three documents 820 were ultimately retrieved.
  • Document 1 is a recipe containing artichoke hearts and octopus.
  • a second document (denoted Document 2) is an article about octopi .
  • a third document (denoted Document 3) is an article about deer.
  • NLP natural language processing
  • the logical form triples for the query are compared, in seriatim, against the logical form triples for Document 1, Document 2 and Document 3, respectively, to ascertain whether any document contains any triple that matches any logical form triple in the query. Those documents that contain no such matching triples, as in the case of Document 1, are discarded and hence considered no further. Document 2 and Document 3, on the other hand, contain matching triples.
  • Document 2 contains three such triples: "HAVE-Dsub-OCTOPUS", "HAVE-Dsub-HEART” illustratively associated with one sentence, and "HAVE-Dsub-OCTOPUS” associated illustratively with another sentence (these sentences not specifically shown) .
  • these triples two are identical, i.e., "HAVE-Dsub-OCTOPUS” .
  • a score for a document is illustratively a numeric sum of the weights of all uniquely matching triples in that document. All duplicate matching triples for any document are ignored.
  • An illustrative ranking of the relative weightings of the different types of relations that can occur in a triple, in descending order from their largest to smallest weightings are: first, verb-object combinations (Dobj); verb-subject combinations (Dsub); prepositions and operators (e.g. Ops) , and finally modifiers (e.g. Nadj) .
  • Such a weighting scheme is given in illustrative triple weighting table 800 shown in FIG. 8A. To simplify this figure, table 800 does not include all the different relations that can arise in a logical form triple, but rather just those pertinent for the triples shown in FIG. 8B . With this metric, the particular triples in each document that contribute to its score are indicated by a check ("V » ) mark.
  • predefined metrics for scoring documents may be used than those we have chosen, such as, e.g., multiplying rather than adding weights in order to provide enhanced document selectivity (discrimination) , or summing the weights in a different predefined fashion, such as including multiple matches of the same type and/or excluding the weights of other triples than those noted above.
  • the score may also take into account, in some fashion: the node words in the triples themselves in that document, or the frequency or semantic content of these node words in that document; the frequency or semantic content of specific node words in that document; or the frequency of specific logical forms (or paraphrases thereof) and/or of particular logical form triples as a whole in that document; as well as the length of that document .
  • the score for Document 2 is 175 and is formed by combining the weights, i.e., 100 and 75, for the first two triples associated with the first sentence in the document and indicated in block 850.
  • the third triple in this document and associated with the second sentence thereof, and listed in this block, which already matches one of other triples existing in the document is ignored.
  • the score for Document 3 is 100 and is formed of the weight, here 100, for the sole matching triple, as listed in block 860, in this particular document.
  • Document 2 is ranked ahead of Document 3 with these documents being presented to the user in that order. In the event, which has not occurred here, that any two documents have the same score, then those documents are ranked in the same order provided by the conventional statistical search engine and are presented to the user in that order.
  • FIGS. 9A-9C respectively depict three different embodiments of information retrieval systems that incorporate the teachings of our present invention.
  • FIG. 9A One such alternate embodiment is shown in FIG. 9A wherein all the processing resides in single local computer 910, such as a PC.
  • computer 910 hosts a search engine and, through that engine, indexes input documents and searches a dataset (either locally situated thereat, such as on a CD-ROM or other storage medium, or accessible to that computer) , in response to a user-supplied full-text query, to ultimately yield a set of retrieved documents that form an output document set.
  • a dataset either locally situated thereat, such as on a CD-ROM or other storage medium, or accessible to that computer
  • This computer also hosts our inventive processing to: analyze both the query and each such document to produce its corresponding set of logical form triples; then compare the sets of triples and select, score and rank the documents in the fashion discussed above, and finally present the results to a local user, e.g., stationed thereat or accessible thereto.
  • FIG. 9B Another alternate embodiment is shown in FIG. 9B, which encompasses the specific context shown in FIG. 2, wherein the retrieval system is formed of a client PC networked to a remote server.
  • client PC 920 is connected, via network connection 925, to remote computer (server) 930.
  • a user stationed at client PC 920 enters a full-text query which the PC, in turn, transmits over the network connection to the remote server.
  • the client PC also analyzes the query to produce its corresponding set of logical form triples.
  • the server hosts, e.g., a conventional statistical search engine and consequently, in response to the query, undertakes statistical retrieval to yield a set of document records .
  • the server then returns the set of records and ultimately, either on instruction of the client or autonomously based on the capabilities of the search engine or associated software, returns each document in an output document set to the client PC.
  • the client PC analyzes each of the corresponding documents, in the output document set, it receives to produce a set of logical form triples therefor.
  • the client PC then completes its processing by appropriately comparing the sets of triples and selecting, scoring and ranking the documents in the fashion discussed above, and finally presenting the results to the local user.
  • FIG. 9C A further embodiment is shown in FIG. 9C.
  • client PC 920 accepts a full-text query from a local user and transmits that query onward, via networked connection 925, to remote computer (server) 930.
  • server instead of merely hosting a conventional search engine, also provides natural language processing in accordance with our invention.
  • the server rather than the client PC, would appropriately analyze the query to produce a corresponding set of logical form triples therefor.
  • the server would also download, if necessary, each retrieved document in an output document set and then analyze each such document to produce the corresponding sets of logical form triples therefor.
  • server 930 would transmit the remaining retrieved documents in rank order, via networked connection 925, to client PC 920 for display thereat .
  • the server could transmit these documents either on a group-by-group basis, as instructed by the user in the manner set forth above, or all in seriatim for group-by-group selection thereamong and display at the client PC.
  • remote computer (server) 930 need not be implemented just by a single computer that provides all the conventional retrieval, natural language and associated processing noted above, but can be a distributed processing system as shown in FIG. 9D with the processing undertaken by this server being distributed amongst individual servers therein.
  • server 930 is formed of front-end processor 940 which distributes messages, via connections 950, to a series of servers 960 (containing server 1, server 2, ..., server n) .
  • servers 960 containing server 1, server 2, ..., server n
  • server 1 can be used to index input documents into dataset on a mass data store for subsequent retrieval.
  • Server 2 can implement a search engine, such as a conventional statistical engine, for retrieving, in response to a user-supplied query routed to it by front-end processor 940, a set of document records from the mass data store. These records would be routed, from server 2, via front-end processor 940, to, e.g., server n for subsequent processing, such as downloading each corresponding document, in an output document set, from a corresponding web site or database. Front-end processor 940 would also route the query to server n.
  • a search engine such as a conventional statistical engine
  • Server n would then appropriately analyze the query and each document to produce the corresponding sets of logical form triples and then appropriately compare the sets of triples and select, score and rank the documents in the fashion discussed above and return ranked documents, via front-end processor 940, to client PC 920 for ranked display thereat.
  • front-end processor 940 the various operations used in our inventive processing could be distributed across servers 960 in any one of many other ways, whether static or dynamic, depending upon run-time and/or other conditions occurring thereat.
  • server 930 could be implemented by illustratively a well-known sysplex configuration with a shared direct access storage device (DASD) accessible by all processors therein (or other similar distributed multi-processing environment) with, e.g., the database for the conventional search engine and the lexicon used for natural language processing both stored thereon.
  • DASD direct access storage device
  • these triples could alternatively be generated while the document is being indexed by a search engine.
  • the search engine could download a complete file for that document and then either immediately thereafter or later, via a batch process, preprocess the document by analyzing that document and producing its logical form triples. To complete the preprocessing, the search engine would then store these triples, as part of an indexed record for that document, in its database.
  • our invention is equally applicable to use with: (a) any network accessible search engine, whether it be intranet -based or not, accessible through a dedicated network facility or otherwise; (b) a localized search engine operative with its own stored dataset, such as a CD-ROM based data retrieval application typified by an encyclopedia, almanac or other self-contained stand-alone dataset; and/or (c) any combination thereof.
  • the present invention can be used in any other suitable application as well .
  • FIGS. 10A and 10B collectively depict yet another embodiment of our present invention which generates logical form triples through document preprocessing with the resulting triples, document records and documents themselves being collectively stored, as a self-contained stand-alone dataset, on common storage media, such as one or more CD-ROMs or other transportable mass media (typified by removable hard disk, tape, or magneto-optical or large capacity magnetic or electronic storage devices) , for ready distribution to end-users.
  • common storage media such as one or more CD-ROMs or other transportable mass media (typified by removable hard disk, tape, or magneto-optical or large capacity magnetic or electronic storage devices)
  • FIG. 10 By collectively placing on, common media, the retrieval application itself and the accompanying dataset which is to be searched, a stand-alone data retrieval applications results; hence, eliminating a need for a network connection to a remote server to retrieve documents.
  • this embodiment is comprised of essentially three components: document indexing component 1005 ! , duplication component 1005 2 and user component 1005 3 .
  • Component 1005 ! gathers documents for indexing into a dataset, illustratively dataset 1030, that, in turn, will form the document repository for a self-contained document retrieval application, such as, e.g., an encyclopedia, almanac, specialized library (such as a decisional law reporter) , journal collection or the like.
  • a self-contained document retrieval application such as, e.g., an encyclopedia, almanac, specialized library (such as a decisional law reporter) , journal collection or the like.
  • incoming documents to be indexed into the dataset are gathered from any number of a wide variety of sources and applied, in seriatim, to computer 1010.
  • This computer implements, through appropriate software stored within memory 1015, a document indexing engine which establishes a record within dataset 1030 for each such document and stores information into that record for the document, and also establishes an appropriate stored entry, in the dataset, containing a copy of the document itself.
  • Engine 1015 executes triple generation process 1100. This process, to be described in detail below in conjunction with FIG. 11, is separately executed for each document being indexed. In essence, this process, in essentially the same manner as discussed above for block 640 shown in FIGS.
  • indexing engine 1010 shown in FIGS. 10A and 10B, to index a document, including generating an appropriate record therefor, are all irrelevant to the present invention, we will not address them in any detail. Suffice it to say, that once the set of triples is generated through process 1100, engine 1015 stores this set onto dataset 1030 along with a copy of the document itself and the document record created therefor. Hence, dataset 1030, at the conclusion of all indexing operations, not only stores a complete copy of every document indexed therein and a record therefor, but also stores a set of logical form triples for that document .
  • dataset 1030 being viewed as a "Master Dataset" is itself then duplicated through duplication component 1005 2 .
  • conventional media duplication system 1040 repetitively writes a copy of the contents of the master dataset, as supplied over line 1035, along with a copy of appropriate files for the retrieval software including a retrieval process and a user installation program, as supplied over line 1043, onto common storage media, such one or more CD-ROMs, to collectively form the stand-alone document retrieval application.
  • a series 1050 of media replicas 1050 is produced having individual replicas 1050 ⁇ 1050 2 , ... 1050 n .
  • replicas are identical and contain, as specifically shown for replica 1050 x , a copy of the document retrieval application files, as supplied over line 1043, and a copy of dataset 1030, as supplied over line 1035.
  • each replica may extend over one or more separate media, e.g. separate CD- ROMs.
  • the replicas are distributed, typically by a purchased license, throughout a user community, as symbolized by dashed line 1055.
  • CD-ROM 3 also denoted as CD-ROM 1060
  • the user can execute the document retrieval application, including our present invention, through computer system 1070 (such as a PC having a substantially, if not identical architecture, to client PC 300 shown in FIG. 3), against the dataset stored in CD-ROM j to retrieve desired documents therefrom.
  • computer system 1070 such as a PC having a substantially, if not identical architecture, to client PC 300 shown in FIG. 3
  • the user inserts the CD-ROM into PC 1070 and proceeds to execute the installation program stored on the CD-ROM in order to create and install a copy of the document retrieval application files into memory 1075, usually a predefined directory within a hard disk, of the PC, thereby establishing document retrieval application 1085 on the PC.
  • This application contains search engine 1090 and retrieval process 1200.
  • the user can then search through the dataset on CD-ROM j by providing an appropriate full-text query to the application.
  • the search engine retrieves, from the dataset, a document set including the records for those documents and the stored logical form triples for each such document.
  • the query is also applied to retrieval process 1200.
  • This process very similar to that of retrieval process 600 discussed above in conjunction with FIGS. 6A and 6B, analyzes the query and constructs the logical form triples therefor. Thereafter, process 1200, shown in FIGS. 10A and 10B, compares the logical form triples for each of the retrieved documents, specifically the records therefor, in the set against the triples for the query. Based on the occurrence of matching triples therebetween and their weights, process 1200 then scores, in the manner described in detail above, each of the documents that exhibits at least one matching triple, ranks these documents in terms of descending score, and finally visually presents the user with a small group of the document records, typically 5-20 or less, that have the highest rankings.
  • the user upon reviewing these records, can then instruct the document retrieval application to retrieve and display an entire copy of any of the associated documents that appears to be interest .
  • the user can then request a next group of document records having the next highest rankings, and so forth until all the retrieved document records have been so reviewed.
  • application 1085 initially returns ranked document records in response to a query, this application could alternatively return ranked copies of the documents themselves in response to the query.
  • FIG. 11 depicts Triple Generation process 1100 that is performed by Document Indexing engine 1015 shown in FIGS. 10A and 10B.
  • process 1100 preprocesses a document to be indexed by analyzing the textual phrases in that document and, through so doing, constructing and storing a corresponding set of logical form triples, for that document, within dataset 1030.
  • block 1110 is executed upon entry into process 1100. This block first extracts all the text from that document, including any text situated within HTML tags associated with that document. Thereafter, to facilitate natural language processing which operates on a single sentence at a time, the text for each document is broken into a text file, through a conventional sentence breaker, in which each sentence (or question) occupies a separate line in the file.
  • block 1110 invokes NLP routine 1300 (which will be discussed in detail below in conjunction with FIG. 13A) , separately for each line of text in that document, to analyze this document and construct and locally store a corresponding set of logical form triples for that line and stored the set within dataset 1030. Once these operations have been completed, execution exits from block 1110 and process 1100.
  • NLP routine 1300 which will be discussed in detail below in conjunction with FIG. 13A
  • FIGS. 12A and 12B A flowchart of our inventive retrieval process 1200, as used in the specific embodiment of our invention shown in FIGS. 10A and 10B is collectively depicted in FIGS. 12A and 12B; for which the correct alignment of the drawing sheets for these figures is shown in FIG. 12.
  • Retrieval process 600 shown in FIGS. 6A and 6B and discussed in Retail above
  • all the operations shown in FIGS . 12A and 12B are performed on a common computer system, here PC 1070 (see FIGS. 10A and 10B) .
  • PC 1070 see FIGS. 10A and 10B
  • execution proceeds first to block 1205.
  • This block when executed, prompts a user to enter a full -text query. Once this query is obtained, execution splits and proceeds, via path 1207, to block 1210 and, via path 1243, to path 1245.
  • Block 1245 when performed, invokes NLP routine 1350 to analyze the query and construct and locally store its corresponding set of logical form triples within memory 1075.
  • Block 1210 when performed, transmits, as symbolized by dashed line 1215, the full-text query to search engine 1090.
  • the search engine performs block 1220 to retrieve both a set of document records in response to the query and the associated logical form triples associated with each such record.
  • Block 1240 merely receives this information from search engine 1090 and stores it within memory 1075 for subsequent use.
  • the operations in block 1245 have been discussed as being performed essentially in parallel with those in blocks 1210, 1090 and 1220, the operations in block 1245, based on actual implementation considerations, could be performed serially either before or after the operations in blocks 1210, 1090 or 1220.
  • block 1250 is performed. This block compares, in the manner described in detail above, each of the logical form triples in the query against each of the logical form triples for each of the retrieved document records to locate a match between any triple in the query and any triple in any of the corresponding documents. Once block 1250 completes, block 1255 is performed to discard all retrieved records for documents that do not exhibit a matching triple, i.e., having no triple that matches any triple in the query. Thereafter, block 1260 is performed.
  • block 1260 all remaining document records are assigned a score as defined above and based on the relation type(s) of matching triples and their weights, that exist for each of the corresponding documents.
  • block 1265 is performed to rank order the records in order of descending score.
  • block 1270 is performed to display the records in rank order, typically in terms of a small predefined group of document records, typically five or ten, that exhibit the highest scores.
  • the user can, for example, by appropriately "clicking" his (her) mouse on a corresponding button displayed by computer system 1070, have that system display the next group of ranked document records, and so forth until the user has sufficiently examined all the ranked document records (and has accessed and examined any document of interest therein) in succession, at which point process 1200 is completed with execution then exiting therefrom.
  • FIG. 13A depicts a flowchart of NLP routine 1300 which is executed within Triple Generation process 1100 shown in FIG. 11.
  • NLP routine 1300 analyzes an incoming document to be indexed, specifically a single line of text therefor, and constructs and locally stores a corresponding set of logical form triples for that document within dataset 1030, shown in FIG. 10A and 10B.
  • Routine 1300 operates in essentially the same fashion as does NLP routine 700 shown in FIG. 7 and discussed in detail above .
  • block 1310 is first executed to process a line of input text to yield a logical form graph, such as illustrative graph 515 shown in FIG. 5A. Thereafter, as shown in FIG.
  • block 1320 is performed to extract (read) a set of corresponding logical form triples from the graph. Once this occurs, block 1330 is executed to generate each such logical form triple as a separate and distinct formatted text string. Finally, block 1340 is executed to store, in dataset 1030, the line of input text and, as a series of formatted text strings, the set of logical form triples for that line. Once this set has been completely stored, execution exits from block 1300. Alternatively, if in lieu of logical form triples, a different form, e.g.
  • blocks 1320 and 1330 would be readily modified to generate that particular form as the formatted string, with block 1340 storing that form in lieu of logical form triples into the dataset.
  • FIG. 13B depicts a flowchart of NLP routine 1350 which is executed within Retrieval process 1200.
  • NLP routine 1350 analyzes a query supplied by User : to document retrieval application 1085 (shown in Figs. 10A and 10B) and constructs and locally stores a corresponding set of logical form triples therefor and within memory 1075.
  • the only difference in operation between routine 1350 and routine 1300, discussed in detail above in conjunction with FIG. 13A, lies in the location where the corresponding triples are stored, i.e. in dataset 1030 through execution of block 1340 in NLP routine 1300 and in memory 1075 through execution of block 1390 for NLP routine 1350.
  • routine 1350 Inasmuch as the operations performed by the other blocks, specifically blocks 1360, 1370 and 1380, of routine 1350 are substantially the same as those performed by blocks 1310, 1320 and 1330, respectively, in routine 1300, we will dispense with discussing the former blocks in any detail.
  • the ALTA VISTA search engine as the search engine in our retrieval system.
  • This engine which is publicly accessible on the Internet, is a conventional statistical search engine that ostensibly has over 31 million indexed web pages therein and is widely used
  • a document that was irrelevant to the query in a language other than English or could not be retrieved from a corresponding URL provided by the ALTA VISTA engine (i.e., a "cobweb" link).
  • a second human evaluator examined a sub-set of these 3361 documents, specifically those documents that exhibited at least one logical form triple that matched a logical form triple in its corresponding query (431 out of the 3361 documents) , and those documents previously ranked as relevant or optimal but which did not have any matching logical form triples (102 out of the 3361 documents) . Any disagreements in these rankings for a document were reviewed by a third human evaluator who served as a "tie-breaker" .
  • our inventive retrieval system yielded improvements, over that of the raw documents returned by the ALTA VISTA search engine, on the order of approximately 200% in overall precision (i.e., of all documents selected) from approximately 16% to approximately 47%, and approximately 100% of precision within the top five documents from approximately 26% to approximately 51%.
  • use of our inventive system increased the precision of the first document returned as being optimal by approximately 113% from approximately 17% to approximately 35%, over that for the raw documents.
  • our invention can be used to process retrieved documents obtained through substantially any type of search engine in order to improve the precision of that engine .
  • these weights can dynamically vary and, in fact, can be made adaptive.
  • a learning mechanism such as, e.g., a Bayesian or neural network, could be appropriately incorporated into our inventive process to vary the numeric weight for each different logical form triple to an optimal value based upon learned experiences .
  • a paraphrase may be either lexical or structural or can include generating abstract logical forms, as described below.
  • An example of a lexical paraphrase would be either a hypernym or a synonym.
  • a structural paraphrase is exemplified by use of either a noun appositive or a relative clause. For example, noun appositive constructions such as "the president, Bill Clinton” should be viewed as matching relative clause constructions such as "Bill Clinton, who is president".
  • our invention can be readily combined with other processing techniques which center on retrieving non-textual information, e.g. graphics, tables, video or other, to improve overall precision.
  • non-textual content in a document is frequently accompanied in that document by a linguistic (textual) description, such as, e.g., a figure legend or short explanation.
  • a linguistic (textual) description such as, e.g., a figure legend or short explanation.
  • use of our inventive process can be used to analyze and process the linguistic description that often accompanies the non-textual content.
  • Documents could be retrieved using our inventive natural language processing technique first to locate a set of documents that exhibit linguistic content semantically relevant to a query and then processing this set of documents with respect to their non-textual content to locate a document (s) that has relevant textual and non-textual content.
  • document retrieval could occur first with respect to non-textual content to retrieve a set of documents; followed by processing that set of documents, through our inventive technique, with respect to their linguistic content to locate a relevant document (s) .
  • FIG. 14 is a simplified functional diagram of an information retrieval system 1480 in accordance with one aspect of the present invention.
  • System 1480 includes retrieval engine 1482, search engine 1484 and statistical data store 1486. It should be noted that the entire system 1480, or part of system 1480, can be implemented in the environment illustrated in FIG. 3.
  • retrieval engine 1482 and search engine 84 can simply be implemented as computer readable instructions stored in memory 322 which are executed by CPU 321 in order to perform the desired functions.
  • retrieval engine 1482 and search engine 84 can be provided on any type of computer readable medium, such as those described with respect to FIG. 3.
  • retrieval engine 1482 and search engine 1484 can be provided in a distributed processing environment and carried out in separate processors.
  • statistical data store 1486 can also be stored in the memory components discussed with respect to FIG. 3, it can be stored on a memory located in wide area network 352, or it can be stored in, for example, memory 350 accessible over local area network 351. In another illustrative embodiment, store 1486 can be located in a portion of memory 322 and can be accessed by the operating system in computer 320.
  • a textual input is provided to retrieval engine 1482 through any suitable input mechanism, such as keyboard 340, mouse 342, etc.
  • Retrieval engine 1482 performs a number of functions based on the query.
  • retrieval engine 1482 formulates a Boolean query based on the textual input, and provides the Boolean query to search engine 1484.
  • Search engine 1484 in one illustrative embodiment, is a search engine provided under the commercial designation Alta Vista by Digital Equipment Corporation of Maynard, MA.
  • the Alta Vista search engine is a conventional internet search engine.
  • retrieval engine 1482 is connected to search engine 1484 by an appropriate internet connection.
  • search engine 1484 is a statistical search engine which has access to statistical data store 1486. Such a statistical search engine typically incorporates statistical processing into the search methodologies used to search data store 1486.
  • Data store 1486 may typically contain a data set of document records indexed by search engine 1484. Each such record, may, for example, contain a web address at which a corresponding document can be accessed by a web browser, predefined content words which appear in that document, possibly a short summary of the document, and a description of the document as provided in its hypertext marked-up language (HTML) description fields.
  • statistical data store 1486 may also include data indicative of logical forms computed for the documents indexed therein.
  • the logical forms associated with an index entry correspond to the language originally used in the document indexed.
  • the logical forms are modified to include paraphrase logical forms and to suppress high frequency logical forms .
  • Statistical search engine 1484 typically calculates numeric measures for each document record retrieved from statistical data store 1486.
  • the numerical measure is based on the query provided to search engine 1484.
  • Such numeric measures may include, for example, term frequency * inverse document frequency (tf*idf) .
  • search engine 1484 returns to retrieval engine 1482 either the document records identified, or the documents themselves, ranked in order of the statistical measure calculated for each document record.
  • retrieval engine 1482 subjects the return documents or records to additional natural language processing in order to refine the ranking of the documents or records. The documents or records are then provided to the user, as an output document set, according to the refined ranking.
  • FIG. 15 is a more detailed functional block diagram of search engine 1484, illustrating how statistical data store 1486 is created in accordance with one illustrative embodiment of the present invention.
  • FIG. 15 illustrates documents 1588 stored on any suitable storage device.
  • Such a storage device may be computers in a distributed computing environment, storage accessed by an operating system in computer 320, computers accessible over a wide area network (such as the internet) , a library database, or any other suitable location at which documents are stored.
  • the documents 1588 are accessible by search engine
  • Document indexer 1590 accesses documents 1588 and indexes them in a known fashion, generating the records associated with each of the documents accessed.
  • Search engine 1484 also includes a logical form generator 1592 and a logical form modifier 1594.
  • Logical form generator 1592 also accesses the documents and creates logical forms corresponding to each of the documents accessed.
  • Logical form generator 1592 generates logical forms based on input text. Briefly, semantic analysis generates a logical form graph that describes the meaning of the textual input .
  • the logical form graph includes nodes and links wherein the links are labeled to indicate the relationship between a pair of nodes.
  • Logical form graphs represent a more abstract level of analysis than, for example, syntax parse trees, because the analysis normalizes many syntactic or morphological variations.
  • Logical form modifier 1594 receives the logical forms generated by logical form generator 1592 and modifies the logical forms.
  • Modifier 1594 illustratively creates a set of paraphrased logical forms based on the original logical forms and suppresses a predetermined class of logical forms (such as high frequency logical forms) which are not helpful in distinguishing among various documents.
  • FIG. 16 is a more detailed block diagram of retrieval engine 1482.
  • retrieval engine 1482 includes input logical form generator 1696, logical form modifier 1698, Boolean query generator 1600 and filter 1602. Filter 1602, in turn, includes logical form comparator 1604 and document rank generator 1606.
  • Boolean query generator 1600 The user input query is provided to Boolean query generator 1600.
  • Boolean query generator 1600 generates a Boolean query based on the user input query in the same manner as in a conventional information retrieval system.
  • the Boolean query is provided to search engine 1484 which executes the query against statistical data store 1486.
  • Statistical data store 1486 in response, returns document records (including the modified set of logical forms) to search engine 1484 which, in turn, provides them to filter 1602 in retrieval engine 1482.
  • the query is also provided to input logical form generator 1596.
  • Generator 1596 generates one or more logical forms based on the original words, and their relation to one another, in the query.
  • the logical forms are generated in the same fashion as described with respect to logical form generator 1592 in FIG. 15.
  • logical form modifier 1698 modifies the logical forms to illustratively include a set of paraphrased logical forms, and to suppress high frequency logical forms.
  • This modified set of logical forms is also provided to logical form comparator 1604 in filter 1602.
  • Logical form comparator 1604 compares the modified set of logical forms based on the query with the modified set of logical forms based on the documents retrieved from data store 1486. If any of the modified set of logical forms based on the query match those based on the documents, logical form comparator 1604 assigns a weight to the particular document containing the matched logical form. The weight is based on the number and type of matches associated with each document. If any document does not contain any matches, the document can either be discarded and not provided to the user, or provided to the user along with an indication that the documents may be less likely to be relevant to the query.
  • the records of documents containing matches, along with the weights assigned by logical form comparator 1604, are provided to document rank generator 1606. Document rank generator 1606 ranks the documents based on the weights assigned by logical form comparator 1604 and provides a ranked output to the user as the output document set .
  • FIG. 17 is a flow diagram illustrating, in more detail, the operation of the system illustrated in FIG. 16.
  • the input query is first executed against statistical data store 1486 and the document records and the modified logical forms associated with those document records are provided to filter 1602. This is indicated by blocks 1708 and 1710.
  • Generator 1696 then generates logical forms based on the original content of the query. This is indicated by block 1712.
  • the logical forms based on the query are then modified by logical form modifier 1698. This is indicated by block 1714.
  • Filter 1602 selects a first of the document records provided by search engine 1484 in response to the query. This is indicated by block 1716.
  • Logical form comparator 1604 determines whether any of the modified query logical forms correspond to the modified document logical forms.
  • filter 1602 determines whether any additional documents need to be compared. This is indicated by blocks 1718, 1720 and 1722. If, however, any of the modified query logical forms matches any of the modified document logical forms, then the document being analyzed is assigned a weight by logical form comparator 1604. This is indicated by block 1724. Again, filter 1602 determines whether any additional documents need to be compared, as illustrated by block 1722.
  • document rank generator 1606 ranks the documents according to the weight assigned by logical form generator 1604. The ranked output is then provided to the user. This is indicated by blocks 1726 and 1728.
  • FIG. 18 is a flow diagram illustrating the operation of logical form modifier 1594 shown in FIG. 15 and logical form modifier 1698 shown in FIG. 16. It will be understood that the present invention contemplates using modified logical forms, as discussed in greater detail below, on either the query side, or the data side, or both. For purposes of the present discussion, logical form modifiers are shown on both the query side and the data side.
  • the logical form modifier first receives the original logical form generated based on either the query or the documents being analyzed. This is indicated by block 1830.
  • the logical form modifier then generates paraphrases of the original logical forms .
  • the paraphrases can be formed in any number of ways, several of which are described below. Generation of the paraphrase logical forms is indicated by block 1832.
  • the logical form modifier then suppresses a predetermined class of logical forms (which can also be a wide variety of logical forms) , a number of which are discussed below. This suppression is indicated by block 1834.
  • the paraphrased logical forms, after undergoing suppression, are then provided to the filter 102 where the documents are filtered based upon the logical forms remaining after suppression. This is indicated by block 1836.
  • FIG. 19 is a flow diagram better illustrating the generation of paraphrased logical forms, and the suppression of logical forms. Semantic or lexical paraphrases
  • the original logical form is received by one of the logical form modifiers.
  • the logical form modifier then forms lexically paraphrased logical forms by first performing semantic expansion of words in the original logical form. This is indicated by block 1938.
  • the lexically paraphrased logical forms are then generated based on the semantically expanded words, and using the original structural connection in the original logical form. This is indicated by block 1940.
  • the semantic expansion is performed by examining each content word in the original logical form, and expanding the word to include synonyms, hypernyms , hyponyms, or other words having a semantic relation to the original content word.
  • logical form modifiers 94 and 98 may, in one embodiment, be provided with access to a reference corpus, such as a thesaurus, a dictionary, or a computational lexicon, such as the WordNet or MindNet lexicons, in order to identify synonyms, hypernyms, hyponyms, or other semantic relationships between words to identify possible lexical paraphrase relationships between the query and document .
  • the original logical form triples generated based on the query are : eat; Dsub; spider eat; Dobj; victim
  • This technique tends to retain relevant documents that are returned based on the query. Thus, this technique increases recall within this set of documents, without reducing precision.
  • Structural paraphrases After the original logical forms have been lexically expanded, they are structurally expanded to obtain additional paraphrased logical forms. Relevant documents returned by the search engine may, using more stringent techniques described in the references incorporated above by reference, be discarded even when the content words in the query occur in a single sentence in the document. This typically occurs when a syntactic or semantic paraphrase relationship exists between the query and the document sentence, but the logical forms based on the query and those based on the document do not match precisely. In order to correctly retain documents which meet these criteria, structural paraphrase rules are implemented in the logical form modifiers to generate additional logical forms based on the original logical forms.
  • the additional logical forms are intended to capture regular syntactic/semantic paraphrase relationships, normalizing differences between how the query was expressed by the user and how a relevant document expresses a similar concept .
  • the logical form modifiers augment the basic logical forms generated based upon the original input text . For example, if an original query is:
  • the original logical form triples based on the query are:
  • the logical form modifier By implementing the structural paraphrase rules in accordance with one aspect of the present invention, the logical form modifier generates an additional logical form:
  • Such information can include:
  • noun compounds/verb objects such as "program computers” and "computer program”
  • Attributive/predicate adjectives such as "That woman is tall” and “That tall woman”
  • Appendix A includes code which illustrates exemplary implementations of the rules described above. In each case, these rules allow for the retention of more relevant documents while still tightly constraining the matching process. Performing the structural expansion or structural paraphrasing of the original structural relation is indicated by block 1942 in FIG. 19.
  • the paraphrase rules discussed above, and other such rules, can be obtained empirically, or by any other suitable means.
  • structural paraphrasing can be implemented both on the indexing side of the information retrieval system, and on the query side, if it is implemented on the indexing side, it can undesirably increase the size of the index.
  • the structural paraphrasing is implemented only on the query side of the information retrieval system.
  • the structural paraphrase can either be performed prior to the semantic paraphrasing indicated by blocks 138 and 140, or afterward.
  • the structural paraphrase can be performed based on the additional logical forms generated during semantic expansion. This is indicated by blocks 1944 and 1946.
  • Meta structure paraphrases An additional set of paraphrased logical forms which can be generated by the logical form modifier includes the generation of abstract logical forms. For instance, even when users are encouraged to enter natural language queries into a search engine, many users still do not provide a well-formed query with multiple content words in an interesting syntactic/semantic relationship. Rather, many queries fall into a category referred to herein as a "keyword query” . Such keyword queries include true keyword queries such as "dog”, “gardening”, “the Renaissance”, “Buffalo Bill” .
  • Keyword queries can also be in the form of keywords in a stereotypical "frame” sentence that provides no useful linguistic context, such as "Tell me about dogs”, “I want information on gardening”, and “What do you have on dinosaurs?" Since such queries are common, the present invention includes matching techniques to accommodate these queries .
  • the query is identified as a keyword query, as indicated by block 1948 in FIG. 19, based on its structure.
  • a query is identified as a keyword query either if it consists of only one content word
  • a multi-word expression (or a sequence of content words treated as a complex content word, also known as a multi-word expression) or because it includes one or more content words occurring in an explicitly-identified, common query structure.
  • a multi-word expression is
  • the Dnom (deep nominative) is "who" ; or If the Dsub is syntactically unmodified, with the exception of a preceding determiner or prepositional phrase.
  • any logical form containing the verb "be" and a Dsub yields a special logical form as follows :
  • the abstract logical forms created at indexing time and at query time for keyword queries allow the information retrieval system to exploit linguistic structure on the data side in order to identify documents that are likely to be primarily about the keyword contained in the keyword query (e.g., the abstract logical forms on the data side represent the meta structure of the document which can be matched to a keyword query) .
  • sentences in the document can be analyzed to determine the meta structure of the document. For example, the subjects of sentences, particularly subjects of sentences whose main verb is "be”, tend to be the theme or topic of that sentence. Precision can be increased, even for keyword queries, by preferentially matching the keyword queries against documents containing sentences about that keyword. For instance, where the query is "Buffalo Bill" and a first document contains the sentence:
  • Buffalo Bill was a showman, usually acting as the part of himself in one of Buntline's melodramas.
  • the abstract logical forms generated at indexing time for the document and at query time for the keyword query allow the keyword query to be preferentially matched against the first document as opposed to the second document. This is because the first document contains the keyword query as the subject of a sentence, while the second document does not .
  • An additional example of an abstract logical form is created based on definitional sentences.
  • One example of a definitional sentence is as follows:
  • Definitional sentences of this type can be identified by examining cues that include linguistic structure and formatting structure. Most frequently, such sentences parse as a noun phrase containing a single noun or multi-word expression, followed by a comma, followed by a noun phrase in apposition thereto. This generates an abstract logical form of the form:
  • this class of logical forms is identified and suppressed in either the query, or the document records, or both. This is indicated by blocks 1954 and 1956 in FIG. 19.
  • some such logical forms can be suppressed only during the production of logical forms based on a query. For instance, a logical form of the type "give; Dobj; information" is not suppressed during document indexing, and may be useful in matching against a query such as "what databases give information on cancer?" In that instance, the user is requesting the identity of certain specific databases, and the query is quite specific.
  • a logical form of the type "give; Dobj; information” is suppressed during the processing of a query of the type "give me information on X" .
  • This query is identified as a keyword query, and the identified logical form is suppressed.
  • Filter 1602 looks for matches between modified logical forms based on the query and those based on the documents, as discussed above.
  • the present invention provides a system for determining similarity between two or more textual inputs. Further, one aspect of the present invention is suitable for significantly increasing precision in an information retrieval application by identifying more relevant documents in the document set returned by the search engine than did previous techniques. Also, the present invention increases recall by reducing the number of relevant documents discarded during filtering.
  • One aspect of the present invention illustratively creates and compares logical forms based on two textual inputs, and creates paraphrased logical forms by lexically or semantically expanding the original words, by structurally expanding the original structural connections, and/or by creating abstract logical forms indicative of the meta structure of either or both of the textual inputs (e.g., a document or query, or both).
  • the present invention also illustratively suppresses certain logical forms.
  • paraphrasing and suppression need not be the same for both sets of logical forms, but could differ from one to the next.
  • hashing techniques are currently being employed to hash the index contained in statistical data store 86 to a smaller size.
  • any suitable hashing technique can be used.
  • the present invention can be utilized equally well with a hashed representation of the index, or with a full representation of the index.
  • lex_record lex_get (Pred, 0) ;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
EP98936899A 1997-07-22 1998-07-17 System for processing textual inputs using natural language processing techniques Ceased EP0998714A1 (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US08/898,652 US5933822A (en) 1997-07-22 1997-07-22 Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision
US97979 1998-06-16
US09/097,979 US6901399B1 (en) 1997-07-22 1998-06-16 System for processing textual inputs using natural language processing techniques
PCT/US1998/014883 WO1999005621A1 (en) 1997-07-22 1998-07-17 System for processing textual inputs using natural language processing techniques
US898652 2001-07-03

Publications (1)

Publication Number Publication Date
EP0998714A1 true EP0998714A1 (en) 2000-05-10

Family

ID=26793837

Family Applications (1)

Application Number Title Priority Date Filing Date
EP98936899A Ceased EP0998714A1 (en) 1997-07-22 1998-07-17 System for processing textual inputs using natural language processing techniques

Country Status (3)

Country Link
EP (1) EP0998714A1 (zh)
CN (1) CN100524294C (zh)
WO (1) WO1999005621A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SE517496C2 (sv) 2000-06-22 2002-06-11 Hapax Information Systems Ab Metod och system för informationsextrahering
JP2004110161A (ja) 2002-09-13 2004-04-08 Fuji Xerox Co Ltd テキスト文比較装置
DE102007056140A1 (de) 2007-11-19 2009-05-20 Deutsche Telekom Ag Verfahren und System zur Informationssuche
GB201016385D0 (en) * 2010-09-29 2010-11-10 Touchtype Ltd System and method for inputting text into electronic devices
US8793199B2 (en) * 2012-02-29 2014-07-29 International Business Machines Corporation Extraction of information from clinical reports
EP2959405A4 (en) * 2013-02-19 2016-10-12 Google Inc RESEARCH BASED ON TREATMENT OF NATURAL LANGUAGE
US11409749B2 (en) * 2017-11-09 2022-08-09 Microsoft Technology Licensing, Llc Machine reading comprehension system for answering queries related to a document
US11106872B2 (en) * 2018-01-09 2021-08-31 Jyu-Fang Yu System and method for improving sentence diagram construction and analysis by enabling a user positioning sentence construction components and words on a diagramming interface
CN108829666B (zh) * 2018-05-24 2021-11-26 中山大学 一种基于语义解析和smt求解的阅读理解题求解方法
RU2722587C9 (ru) * 2019-09-06 2020-09-14 Акционерное общество "Калужский научно-исследовательский институт телемеханических устройств" Способ формирования и расформирования текста сообщения в информационных бинарных пакетах прикладного уровня
CN111124422B (zh) * 2019-12-25 2023-03-10 成都互诚在线科技有限公司 一种基于抽象语法树的eos智能合约语言转换方法
CN116663534A (zh) * 2023-08-02 2023-08-29 中国标准化研究院 一种基于自然语言处理的文本数据统计分析系统及方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NL8900587A (nl) * 1989-03-10 1990-10-01 Bso Buro Voor Systeemontwikkel Werkwijze voor het bepalen van de semantische verwantheid van lexicale componenten in een tekst.
US5321833A (en) * 1990-08-29 1994-06-14 Gte Laboratories Incorporated Adaptive ranking system for information retrieval
US5724567A (en) * 1994-04-25 1998-03-03 Apple Computer, Inc. System for directing relevance-ranked data objects to computer users
EP0953920A3 (en) * 1995-01-23 2005-06-29 BRITISH TELECOMMUNICATIONS public limited company Method and/or systems for accessing information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9905621A1 *

Also Published As

Publication number Publication date
CN1265209A (zh) 2000-08-30
CN100524294C (zh) 2009-08-05
WO1999005621A1 (en) 1999-02-04

Similar Documents

Publication Publication Date Title
US6901399B1 (en) System for processing textual inputs using natural language processing techniques
Senellart et al. Automatic discovery of similar words
EP0965089B1 (en) Information retrieval utilizing semantic representation of text
Xu et al. TREC 2003 QA at BBN: Answering Definitional Questions.
Delort et al. Enhanced web document summarization using hyperlinks
Witten Text Mining.
Strzalkowski Robust text processing in automated information retrieval
WO2006068872A2 (en) Method and system for extending keyword searching to syntactically and semantically annotated data
JP2011118689A (ja) 検索方法及びシステム
EP0998714A1 (en) System for processing textual inputs using natural language processing techniques
Farhan et al. Survey of automatic query expansion for Arabic text retrieval
Strzalkowski Natural language processing in large-scale text retrieval tasks
Li et al. Supporting web query expansion efficiently using multi-granularity indexing and query processing
Cheng et al. An Experiment in Ehancing Information Access by Natural Language Processing
Strzalkowski et al. Recent developments in natural language text retrieval
Kian et al. An efficient approach for keyword selection; improving accessibility of web contents by general search engines
Matsumura et al. The Effect of Information Retrieval Method Using Dependency Relationship Between Words.
Manjula et al. Semantic search engine
Zheng et al. An improved focused crawler based on text keyword extraction
Ykhlef et al. Query paraphrasing using genetic approach for intelligent information retrieval
Haddad Automatic semantic header generator
Gure et al. Intelligence Information Retrieval System Modeling for Afaan Oromo
Bai et al. An Abstract-Level Patent Retrieval Model
Tuni et al. Afaan Oromo Hybrid Modelling: A Case based Optimized Intelligence in Information Retrieval System’s Localization
Shinkai et al. Complement keywords for query toward efficient information retrieval

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20000117

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH DE ES FR GB IE IT LI LU MC NL

17Q First examination report despatched

Effective date: 20041005

17Q First examination report despatched

Effective date: 20041005

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20170612