US20120078926A1 - Efficient passage retrieval using document metadata - Google Patents

Efficient passage retrieval using document metadata Download PDF

Info

Publication number
US20120078926A1
US20120078926A1 US13/244,347 US201113244347A US2012078926A1 US 20120078926 A1 US20120078926 A1 US 20120078926A1 US 201113244347 A US201113244347 A US 201113244347A US 2012078926 A1 US2012078926 A1 US 2012078926A1
Authority
US
United States
Prior art keywords
documents
metadata
document
query
terms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/244,347
Inventor
Jennifer Chu-Carroll
David A. Ferrucci
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US13/244,347 priority Critical patent/US20120078926A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FERRUCCI, DAVID A., CHU-CARROLL, JENNIFER
Publication of US20120078926A1 publication Critical patent/US20120078926A1/en
Priority to US13/605,313 priority patent/US20120331003A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Definitions

  • the invention relates generally to information retrieval systems, and more particularly, the invention relates to an automated query/answer system and method implementing a passage retrieval component to conduct a search that identifies passages relevant to a given question using document metadata from a collection including text-based resources.
  • QA question answering
  • NLP complex natural language processing
  • search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the World Wide Web.
  • Closed-domain QA deals with questions under a specific domain, for example medicine or automotive maintenance, and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies.
  • Open-domain QA deals with questions about nearly everything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.
  • closed-domain QA might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive rather than procedural information.
  • Access to information is currently dominated by two paradigms.
  • a major unsolved problem in such information query paradigms is the lack of a computer program capable of accurately answering factual questions based on information included in a collection of documents that can be either structured, unstructured, or both.
  • Such factual questions can be either broad, such as “what are the risks of vitamin K deficiency?”, or narrow, such as “when and where was Hillary Clinton's father born?”
  • a computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising: receiving an input query; performing a query context analysis upon the input query to obtain searchable query terms; matching metadata associated with one or more documents against the query terms; mapping matched document metadata to corresponding one or more documents; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the data subcorpus using the searchable query terms to obtain one or more passages relevant to the input query from the identified documents, wherein one or more processor devices performs one or more the retrieving, performing, matching, mapping, identifying and conducting.
  • the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
  • a computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising: receiving, at a processor device, an input query; performing, at the processor device, a query context analysis upon the input query to obtain searchable query terms; accessing a dictionary of document metadata obtained from one or more documents of the data corpus, each stored document metadata being associated with a corresponding document identification (ID); performing, by the processor device, a dictionary matching of the metadata associated with one or more documents against the query terms; mapping matched document metadata to corresponding one or more document IDs; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the subcorpus using the searchable query terms to obtain passages relevant to the input query from the identified documents.
  • a computer program product for performing operations.
  • the computer program product includes a storage medium readable by a processing circuit and storing instructions run by the processing circuit for running a method(s).
  • the method(s) are the same as listed above.
  • FIG. 1 shows a prior art high level logical architecture 10 of a question/ answering method in which the present invention may be employed
  • FIG. 2 is a schematic diagram depicting passage retrieval components 75 according to one embodiment
  • FIG. 3 is a flow diagram illustrating a method 100 for performing passage retrieval operations in one embodiment.
  • FIG. 4 illustrates an exemplary hardware configuration to run method steps described in FIG. 3 in one embodiment.
  • FIG. 1 shows a QA system diagram such as described in U.S. patent application Ser. No. 12/126,642 depicting a high-level logical architecture 10 and methodology in which the present system and method may be employed in one embodiment.
  • FIG. 1 illustrates the major components that comprise a canonical question answering system 10 and their workflow.
  • the question analysis component 20 receives a natural language question 19 (e.g., “Who is the 42 ⁇ president of the United States?”) and analyzes the question to produce, minimally, the semantic type of the expected answer (in this example, “president”), and optionally other analysis results for downstream processing.
  • a natural language question 19 e.g., “Who is the 42 ⁇ president of the United States?”
  • the search component 30 a formulates queries from the output 29 of question analysis and consults various resources such as the World Wide Web 41 or one or more knowledge resources, e.g., databases, knowledge bases 42 , to retrieve “documents” including, e.g., whole documents or document portions 44 , e.g., web-pages, database tuples, etc., having “passages” 44 that are relevant to answering the question.
  • the candidate answer generation component 30 b may then extract from the search results 48 potential (candidate) answers to the question, which are then scored and ranked by the answer selection component 50 to produce a final ranked list of answers with associated confidence scores.
  • Passage retrieval operations In current questions and answer systems, one key component is the passage retrieval operations conducted when searching for candidate answers in heterogeneous collection of structured, semi-structured and unstructured information resources. Passage retrieval operations adapt a search engine at its core to identify passages relevant to a given question from the collection of sources, e.g., text-based sources. Passage retrieval is also relevant to any search application where selecting passages containing, for example, 1-3 sentences is more appropriate than retrieving entire documents either for processing by downstream components, or for presentation to the end user.
  • sources e.g., text-based sources.
  • Passage retrieval is also relevant to any search application where selecting passages containing, for example, 1-3 sentences is more appropriate than retrieving entire documents either for processing by downstream components, or for presentation to the end user.
  • the first approach is to adopt a document search engine to retrieve a list of relevant documents using the search engine's internal document ranking criteria, and to apply a custom post-hoc passage scoring algorithm to identify the most relevant text segments from these documents.
  • the second approach is to adopt a search engine with passage retrieval capability and to make use of the engine's internal ranking algorithm to return a set of relevant passages. In either approach, the retrieval process is performed over the entire collection, which typically contains millions of documents or more. This poses an efficiency issue for real-time question answering systems that must deliver answers to users in no more than a few seconds.
  • a typical solution for this problem is to split the search index into multiple subindices on multiple machines so that retrieval against the subindices can be performed in parallel and their result merged. While this solution addresses the efficiency issue, it poses other problems related to merging search results from multiple indices.
  • the present system and method for efficient passage retrieval against a corpus given a question is applicable and may be part of a Question Answering (QA) system.
  • the system and method for efficient passage retrieval against a corpus given a question may be implemented in non-QA applications, i.e., applications implemented to return a passage, for example, a 1-sentence to 3-sentence passage most relevant to a question, as opposed to an answer per se.
  • the present disclosure may extend and complement the effectiveness of a QA or non-QA system and method by improving the efficiency of passage retrieval operations based on dynamic subcorpus selection to constrain the number of relevant documents considered in the retrieval process.
  • the subcorpus selection process is based on a matching algorithm that identifies relevant documents based on the question text and metadata associated with the documents in the collection, such as document titles, user tags (“clouds”), or automatically identified document labels.
  • the passage retrieval process is then restricted to return passages only from this subcorpus, which typically contains several orders of magnitude fewer documents than the entire collection.
  • the approach to efficient passage retrieval significantly constrains the pool of documents from which passages may be retrieved based on metadata associated with documents, such as document titles and user tags (“clouds”).
  • the efficiency of passage retrieval is improved by providing the ability to dynamically select a subcorpus from which search will take place based on terms in the user question and metadata associated with documents in the corpus. More specifically, the user's input question string is analyzed to extract all matches between question terms and document metadata. Those matched documents comprise a subcorpus from which the system will extract passages for this question.
  • FIG. 2 is a schematic diagram depicting passage retrieval components 75 that may be implemented in QA and non-QA systems according to one embodiment.
  • the system components 75 conducting passage retrieval operations make use of system modules from FIG. 1 such as: the question analysis processing component 20 that performs a query context analysis upon an received input query to break down said input query into query terms, and any searchable components thereof; and, the search component 30 a that formulates queries from the output searchable components of question analysis unit and that consults various resources such as the World Wide Web 40 or one or more knowledge resources, e.g., databases, knowledge bases 42 .
  • the question analysis processing component 20 includes a programmed matcher component 80 that functions to identify document metadata present in the question. It performs this by consulting a resource 84 containing document metadata information for all documents.
  • Document metadata may include any information that identifies the topic or domain of the document, such as the document title, manually or automatically derived category/domain classification, and crowdsourced or automatically derived tag clouds (“clouds”) which indicate general topics of the document. It is against this data resource 84 where matching of terms in the input question to the document metadata information is performed.
  • Data corpus 89 represents the entire data corpus that the QA or non-QA system is using and may include both open domain and closed domain topics.
  • a document containing George W. Bush's 2007 State of the Union address may include the following metadata:
  • this matcher component 80 is to represent the metadata in dictionary form and to leverage a dictionary matcher to identify dictionary terms that appear in an input question.
  • any matching component can be used to identify closed or open domain dictionary terms in text (e.g., legal terms, medical terms, or generic named entities) may be used.
  • the matching algorithm determines from the question text those terms that match entries in the dictionary.
  • a dictionary matcher includes the open source ConceptMapper annotator available at http://uima.apache.org/sandbox.html#concept.mapper.annotator, whose functionality is incorporated by reference as if fully set forth herein.
  • the matched dictionary entries are used to identify a subset of documents for the passage retrieval process. That is, for the query terms that are mapped to the metadata (titles, tags, clouds) of a document in the resource 84 , that document's index (or other document identifier) is flagged, tagged, or recorded for its inclusion in a subcorpus.
  • each dictionary entry in resource 84 encodes the document ID for each document that contains metadata matching that dictionary term.
  • the metadata and associated document information in the dictionary entry that match the terms in the input question is represented as 85 in FIG. 2 .
  • the passage retrieval component can be any standard IR (Information Retrieval) search engine 90 that supports both of: Retrieval of relevant short passages, instead of full documents; and Runtime specification of a relevant subcorpus for retrieval.
  • IR search engine that satisfies this requirement is the Indri engine from the Lemur Toolkit such as the search engine with passage retrieval capability, such as Indri, http://www.lemurproject.org/indri/, incorporated by reference as if fully set forth herein.
  • the matched documents identified by the matcher component 80 form a constrained document set 88 , indicated in the entire corpus 89 having the entire index and a subcorpus 92 is built including the constrained document set 88 on which passage retrieval operation via IR search engine 90 are performed to select the most relevant passages.
  • a passage retrieval method 100 employed by the passage retrieval components 75 for improving the efficiency of passage retrieval is described with respect to FIG. 3 .
  • the method 100 includes at 101 , receiving at a processor device, an input query and, using a parser device or function, breaking down the query into searchable query terms.
  • the obtained searchable query terms from said input query are terms that match document metadata.
  • the semi-structured source of information is a dictionary or corpus that associates data (e.g., definitions) with a large set of vocabulary items including document metadata stored in memory storage device.
  • the semi-structured source of information may be formed via off-line processes that extract document metadata from one or more documents of a large corpus of documents.
  • the extracted document metadata is stored as a dictionary in the memory storage device, with each document metadata stored in the dictionary having one or more associated document identifications (IDs) that represent those documents matching the metadata in that dictionary entry.
  • IDs document identifications
  • the programmed processor device performs invoking a matching component to match a document metadata against the query terms.
  • a dictionary matcher may be invoked that includes the open source ConceptMapper annotator available at http://uima.apache.org/sandbox.html#concept.mapper.annotator.
  • mapping of the matched document metadata to corresponding one or more document IDs there is performed mapping of the matched document metadata to corresponding one or more document IDs. Then at 120 , from the corresponding IDs, there is performed identifying the corresponding matched documents.
  • the corresponding documents indicated by the mapped document IDs are identified, e.g., flagged, tagged or recorded in the corpus in which the actual documents are electronically stored with their ID.
  • the identified corresponding matched documents form the subcorpus 92 of documents including only the identified matched metadata documents of the larger corpus of documents.
  • This step invokes corpus construction functionality to identify the subset of flagged, tagged or otherwise identified matched metadata documents obtained from the first corpus 84 ( FIG. 2 ) during the matching step, which functionality for dynamically constructing subcorpora during runtime is provided for example in the above-incorporated Indri engine from the Lemur Toolkit.
  • step 120 there may be further performed at 125 , extracting the identified corresponding matched documents are found in step 120 as the subcorpus 92 .
  • the method performs passage retrieval operations against those identified matched metadata documents obtained from the subcorpus 92 formed at step 120 or 125 .
  • the passage retrieval process 100 when performed in parallel with traditional passage retrieval algorithms is more effective when the information sought in the question is present in documents whose relevant metadata field contains a term/phrase in the question.
  • the dictionary can be constructed to include morphological variations for the given metadata information, such as including both the singular and plural forms of terms, as well as known synonyms.
  • redirect links between Wikipedia® titles (which, e.g., redirects requests for the document “artists” to the document titled “artist” and for example, “Ol' Blue Eyes” to “Frank Sinatra”) are used to capture morphological variations and synonyms.
  • morphological and synonym information can be mined from publicly available resources such as WordNet® (Trademark of The CORPORATION NEW JERSEY Princeton University) available at http://wordnet.princeton.edu/. For these questions, this approach significantly reduces execution time in those situations compared with performing passage retrieval against a large unconstrained corpus 89 .
  • FIG. 1 shows a system diagram described in U.S. patent application Ser. No. 12/126,642 depicting a high-level logical architecture of a QA system 10 and methodology in which a system and method for deferred type evaluation using text with limited structure is employed in one embodiment.
  • the high level logical architecture 10 includes the Query Analysis module 20 implementing functions for receiving and analyzing a user query or question.
  • the term “user” may refer to a person or persons interacting with the system, or refers to a computer system 22 generating a query by mechanical means, and where the term “user query” refers to such a mechanically generated query and context 19 ′.
  • a candidate answer generation module 30 is provided to implement a search for candidate answers by traversing structured, semi structured and unstructured sources contained in primary sources (e.g., the Web, a data corpus 41 ) and in an Answer Source or a Knowledge Base (KB), e.g., containing collections of relations and lists extracted from primary sources. All the sources of information can be locally stored or distributed over a network, including the Internet.
  • primary sources e.g., the Web, a data corpus 41
  • KB Knowledge Base
  • the Candidate Answer generation module 30 of architecture 10 generates a plurality of output data structures containing candidate answers based upon the analysis of retrieved data.
  • an Evidence Gathering module 50 further interfaces with the primary sources and knowledge base for concurrently analyzing the evidence based on passages having candidate answers, and scores each of candidate answers, in one embodiment, as parallel processing operations.
  • the architecture may be employed utilizing the Common Analysis System (CAS) candidate answer structures as is described in commonly-owned, issued U.S. Pat. No. 7,139,752, the whole contents and disclosure of which is incorporated by reference as if fully set forth herein.
  • CAS Common Analysis System
  • the Evidence Gathering and Scoring module 50 comprises a Candidate Answer Scoring module 40 for analyzing a retrieved passage and scoring each of candidate answers of a retrieved passage.
  • the Answer Source Knowledge Base may comprise one or more databases of structured or semi-structured sources (pre-computed or otherwise) comprising collections of relations (e.g., Typed Lists).
  • the Answer Source knowledge base may comprise a database stored in a memory storage system, e.g., a hard drive.
  • An Answer Ranking module 60 may be invoked to provide functionality for ranking candidate answers and determining a response 99 returned to a user via a user's computer display interface (not shown) or a computer system 22 , where the response may be an answer, or an elaboration of a prior answer or request for clarification in response to a question—when a high quality answer to the question is not found.
  • a machine learning implementation is further provided where the “answer ranking” module 60 includes a trained model component (not shown) produced using a machine learning techniques from prior data.
  • the processing depicted in FIG. 1 may be local, on a server, or server cluster, within an enterprise, or alternately, may be distributed with or integral with or otherwise operate in conjunction with a public or privately available search engine in order to enhance the question answer functionality in the manner as described.
  • the method may be provided as a computer program product comprising instructions executable by a processing device, or as a service deploying the computer program product.
  • the architecture employs a search engine (e.g., a document retrieval system) as a part of Candidate Answer Generation module 30 which may be dedicated to searching the Internet, a publicly available database, a web-site (e.g., IMDB.com), a privately available collection of documents or, a privately available database.
  • Databases can be stored in any storage system, non-volatile memory storage systems, e.g., a hard drive or flash memory, and can be distributed over the network or not.
  • the system and method of FIG. 1 makes use of the Common Analysis System (CAS), a subsystem of the Unstructured Information Management Architecture (UIMA) that handles data exchanges between the various UIMA components, such as analysis engines and unstructured information management applications.
  • CAS Common Analysis System
  • UIMA Unstructured Information Management Architecture
  • CAS supports data modeling via a type system independent of programming language, provides data access through a powerful indexing mechanism, and provides support for creating annotations on text data, such as described in (http://www.research.ibm.com/journal/sj/433/gotz.html) incorporated by reference as if set forth herein.
  • CAS allows for multiple definitions of the linkage between a document and its annotations, as is useful for the analysis of images, video, or other non-textual modalities (as taught in the herein incorporated reference U.S. Pat. No. 7,139,752).
  • UIMA may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources.
  • the architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters.
  • the UIMA system, method and computer program may be used to generate answers to input queries.
  • the method includes inputting a document and operating at least one text analysis engine that comprises a plurality of coupled annotators for tokenizing document data and for identifying and annotating a particular type of semantic content. Thus it can be used to analyze a question and to extract entities as possible answers to a question from a collection of documents.
  • modules of FIGS. 1 , 2 can be represented as functional components in GATE (General Architecture for Text Engineering) (see: http://gate.ac.uk/releases/gate-2.0alpha2-build484/doc/userguide.html).
  • GATE employs components which are reusable software chunks with well-defined interfaces that are conceptually separate from GATE itself. All component sets are user-extensible and together are called CREOLE—a Collection of REusable Objects for Language Engineering.
  • the GATE framework is a backplane into which plug CREOLE components. The user gives the system a list of URLs to search when it starts up, and components at those locations are loaded by the system.
  • GATE components are one of three types of specialized Java Beans: 1) Resource: The top-level interface, which describes all components. What all components share in common is that they can be loaded at runtime, and that the set of components is extendable by clients. They have Features, which are represented externally to the system as “meta-data” in a format such as RDF, plain XML, or Java properties. Resources may all be Java beans in one embodiment. 2) ProcessingResource: Is a resource that is runnable, may be invoked remotely (via RMI), and lives in class files.
  • LanguageResource Is a resource that consists of data, accessed via a Java abstraction layer. They live in relational databases; and, VisualResource: Is a visual Java bean, component of GUIs, including of the main GATE GUI Like PRs these components live in .class or .jar files.
  • a PR is a Resource that implements the Java Runnable interface.
  • the GATE Visualisation Model implements resources whose task is to display and edit other resources are modeled as Visual Resources.
  • the Corpus Model in GATE is a Java Set whose members are documents.
  • Both Corpora and Documents are types of Language Resources(LR) with all LRs having a Feature Map (a Java Map) associated with them that stored attribute/value information about the resource.
  • FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via an annotation model.
  • Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.
  • UIMA As UIMA, GATE can be used as a basis for implementing natural language dialog systems and multimodal dialog systems having a question answering system as one of the main submodules.
  • the references, incorporated herein by reference above (U.S. Pat. Nos. 6,829,603 and 6,983,252, and 7,136,909) enable one skilled in the art to build such an implementation.
  • FIG. 4 illustrates an exemplary hardware configuration of a computing system 400 in which the present system and method may be employed.
  • the hardware configuration preferably has at least one processor or central processing unit (CPU) 411 .
  • the CPUs 411 are interconnected via a system bus 412 to a random access memory (RAM) 414 , read-only memory (ROM) 416 , input/output (I/O) adapter 418 (for connecting peripheral devices such as disk units 421 and tape drives 440 to the bus 412 ), user interface adapter 422 (for connecting a keyboard 424 , mouse 426 , speaker 428 , microphone 432 , and/or other user interface device to the bus 412 ), a communication adapter 434 for connecting the system 400 to a data processing network, the Internet, an Intranet, a local area network (LAN), etc., and a display adapter 436 for connecting the bus 412 to a display device 438 and/or printer 439 (e.g., a digital printer of the like).
  • aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a system, apparatus, or device running an instruction.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device running an instruction.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the program code may run entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • system and method for efficient passage retrieval may be performed with data structures native to various programming languages such as Java and C++.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which run on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more operable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be run substantially concurrently, or the blocks may sometimes be run in the reverse order, depending upon the functionality involved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A system, method and computer program product for efficiently retrieving relevant passages to questions based on a corpus of data. A processor device receives an input query and performs a query analysis to obtain searchable query terms. The processor performs: matching metadata associated with one or more documents against the query terms. The document metadata includes one or more of: a title of the documents, one or more user tags or clouds. Then the processor device performs: mapping matched document metadata to corresponding one or more documents; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the data subcorpus using the searchable query terms to obtain one or more passages relevant input query from the identified documents.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • The present invention relates to and claims the benefit of the filing date of commonly-owned, co-pending U.S. Provisional Patent Application No. 61/386,019, filed Sep. 24, 2010, the entire contents and disclosure of which is incorporated by reference as if fully set forth herein.
  • BACKGROUND
  • The invention relates generally to information retrieval systems, and more particularly, the invention relates to an automated query/answer system and method implementing a passage retrieval component to conduct a search that identifies passages relevant to a given question using document metadata from a collection including text-based resources.
  • DESCRIPTION OF THE RELATED ART
  • An introduction to the current issues and approaches of question answering (QA) can be found in the web-based reference http://en.wikipedia.org/wiki/Question_answering. Generally, QA is a type of information retrieval. Given a collection of documents (such as the World Wide Web or a local collection) the system should be able to retrieve answers to questions posed in natural language. QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval, and it is sometimes regarded as the next step beyond search engines.
  • QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically-constrained, and cross-lingual questions. Search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the World Wide Web.
  • Closed-domain QA deals with questions under a specific domain, for example medicine or automotive maintenance, and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. Open-domain QA deals with questions about nearly everything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.
  • Alternatively, closed-domain QA might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive rather than procedural information.
  • Access to information is currently dominated by two paradigms. First, a database query that answers questions about what is in a collection of structured records. Second, a search that delivers a collection of document links in response to a query against a collection of unstructured data, for example, text or html.
  • A major unsolved problem in such information query paradigms is the lack of a computer program capable of accurately answering factual questions based on information included in a collection of documents that can be either structured, unstructured, or both. Such factual questions can be either broad, such as “what are the risks of vitamin K deficiency?”, or narrow, such as “when and where was Hillary Clinton's father born?”
  • It is a challenge to understand the query, to find appropriate documents that might contain the answer, and to extract the correct answer to be delivered to the user. There is a need to further advance the methodologies for answering open-domain questions.
  • SUMMARY
  • In one aspect there is provided a computing infrastructure and methodology that conducts question and answering and performs automatic passage retrieval operations in a highly efficient manner.
  • In one aspect, there is provided a computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising: receiving an input query; performing a query context analysis upon the input query to obtain searchable query terms; matching metadata associated with one or more documents against the query terms; mapping matched document metadata to corresponding one or more documents; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the data subcorpus using the searchable query terms to obtain one or more passages relevant to the input query from the identified documents, wherein one or more processor devices performs one or more the retrieving, performing, matching, mapping, identifying and conducting.
  • In this aspect, the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
  • Further to this aspect, prior to matching of metadata associated with one or more documents against the query terms there is performed: extracting document metadata from one or more documents of a corpus of documents; providing the extracted document metadata as a dictionary in a storage device, each document metadata stored in the dictionary being associated with a corresponding document identification (ID), wherein the matching of metadata against the query terms comprises: performing, by the processor device, a dictionary matching.
  • In an alternate embodiment, there is provided a computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising: receiving, at a processor device, an input query; performing, at the processor device, a query context analysis upon the input query to obtain searchable query terms; accessing a dictionary of document metadata obtained from one or more documents of the data corpus, each stored document metadata being associated with a corresponding document identification (ID); performing, by the processor device, a dictionary matching of the metadata associated with one or more documents against the query terms; mapping matched document metadata to corresponding one or more document IDs; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the subcorpus using the searchable query terms to obtain passages relevant to the input query from the identified documents.
  • A computer program product is provided for performing operations. The computer program product includes a storage medium readable by a processing circuit and storing instructions run by the processing circuit for running a method(s). The method(s) are the same as listed above.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The objects, features and advantages of the invention are understood within the context of the Detailed Description, as set forth below. The Detailed Description is understood within the context of the accompanying drawings, which form a material part of this disclosure, wherein:
  • FIG. 1 shows a prior art high level logical architecture 10 of a question/ answering method in which the present invention may be employed;
  • FIG. 2 is a schematic diagram depicting passage retrieval components 75 according to one embodiment;
  • FIG. 3 is a flow diagram illustrating a method 100 for performing passage retrieval operations in one embodiment; and,
  • FIG. 4 illustrates an exemplary hardware configuration to run method steps described in FIG. 3 in one embodiment.
  • DETAILED DESCRIPTION
  • FIG. 1 shows a QA system diagram such as described in U.S. patent application Ser. No. 12/126,642 depicting a high-level logical architecture 10 and methodology in which the present system and method may be employed in one embodiment.
  • FIG. 1 illustrates the major components that comprise a canonical question answering system 10 and their workflow. The question analysis component 20 receives a natural language question 19 (e.g., “Who is the 42˜president of the United States?”) and analyzes the question to produce, minimally, the semantic type of the expected answer (in this example, “president”), and optionally other analysis results for downstream processing. The search component 30 a formulates queries from the output 29 of question analysis and consults various resources such as the World Wide Web 41 or one or more knowledge resources, e.g., databases, knowledge bases 42, to retrieve “documents” including, e.g., whole documents or document portions 44, e.g., web-pages, database tuples, etc., having “passages” 44 that are relevant to answering the question. The candidate answer generation component 30 b may then extract from the search results 48 potential (candidate) answers to the question, which are then scored and ranked by the answer selection component 50 to produce a final ranked list of answers with associated confidence scores.
  • In current questions and answer systems, one key component is the passage retrieval operations conducted when searching for candidate answers in heterogeneous collection of structured, semi-structured and unstructured information resources. Passage retrieval operations adapt a search engine at its core to identify passages relevant to a given question from the collection of sources, e.g., text-based sources. Passage retrieval is also relevant to any search application where selecting passages containing, for example, 1-3 sentences is more appropriate than retrieving entire documents either for processing by downstream components, or for presentation to the end user.
  • Most existing systems performing a passage retrieval operation adopts one of two approaches. The first approach is to adopt a document search engine to retrieve a list of relevant documents using the search engine's internal document ranking criteria, and to apply a custom post-hoc passage scoring algorithm to identify the most relevant text segments from these documents. The second approach is to adopt a search engine with passage retrieval capability and to make use of the engine's internal ranking algorithm to return a set of relevant passages. In either approach, the retrieval process is performed over the entire collection, which typically contains millions of documents or more. This poses an efficiency issue for real-time question answering systems that must deliver answers to users in no more than a few seconds. A typical solution for this problem is to split the search index into multiple subindices on multiple machines so that retrieval against the subindices can be performed in parallel and their result merged. While this solution addresses the efficiency issue, it poses other problems related to merging search results from multiple indices.
  • It would be highly desirable to provide a system and method that improves the efficiency of passage retrieval based on dynamic subcorpus selection to constrain the number of relevant documents considered in the retrieval process.
  • In one embodiment, the present system and method for efficient passage retrieval against a corpus given a question is applicable and may be part of a Question Answering (QA) system. Alternatively, the system and method for efficient passage retrieval against a corpus given a question may be implemented in non-QA applications, i.e., applications implemented to return a passage, for example, a 1-sentence to 3-sentence passage most relevant to a question, as opposed to an answer per se.
  • Commonly-owned, co-pending U.S. patent application Ser. No. 12/126,642, titled “SYSTEM AND METHOD FOR PROVIDING QUESTION AND ANSWERS WITH DEFERRED TYPE EVALUATION” and co-pending U.S. patent application Ser. No. 12/152411, titled “SYSTEM AND METHOD FOR PROVIDING ANSWERS TO QUESTIONS” are both incorporated by reference herein, and describe a QA (Question and Answer) system and method in which the present passage retrieval system may be incorporated.
  • In one embodiment, the present disclosure may extend and complement the effectiveness of a QA or non-QA system and method by improving the efficiency of passage retrieval operations based on dynamic subcorpus selection to constrain the number of relevant documents considered in the retrieval process.
  • In one embodiment, the subcorpus selection process is based on a matching algorithm that identifies relevant documents based on the question text and metadata associated with the documents in the collection, such as document titles, user tags (“clouds”), or automatically identified document labels. The passage retrieval process is then restricted to return passages only from this subcorpus, which typically contains several orders of magnitude fewer documents than the entire collection.
  • The approach to efficient passage retrieval significantly constrains the pool of documents from which passages may be retrieved based on metadata associated with documents, such as document titles and user tags (“clouds”). The efficiency of passage retrieval is improved by providing the ability to dynamically select a subcorpus from which search will take place based on terms in the user question and metadata associated with documents in the corpus. More specifically, the user's input question string is analyzed to extract all matches between question terms and document metadata. Those matched documents comprise a subcorpus from which the system will extract passages for this question.
  • In a non-limiting example, there is considered the following user question
    • “which modem artist was Francoise Gilot, Dr. Jonas Salk's wife, once the companion of?”
  • In the example, matching the instances of document titles to the terms in the question, yields five entities: “modern”, “artist”, “Francoise Gilot”, “Jonas Salk”, and “companion” are identified as document titles in the corpus. It is understood that a term may map to multiple documents with that title. For example, “companion” may map to an article that talks about a caregiver, or an architectural feature of ships, or a character in “Doctor Who”. Using the document identifications (IDs) that corresponds to each document title, the documents with the identified document IDs are selected to form a subcorpus consisting of potentially highly relevant documents for answering the given question. The passage retrieval process is then constrained to finding the most relevant passages from this document subcorpus which may contain on the order of tens of documents, instead of from the entire collection which many contain millions of documents or more. In this example, several relevant passages, such as “Francoise Gilot (born 1921) is a French born painter and is known as a companion of Picasso between 1944 and 1953” from the document titled “Francoise Gilot”, and “In 1968, they divorced, and in 1970 Salk married Francoise Gilot, the former mistress of Pablo Picasso” from the document titled “Jonas Salk”.
  • FIG. 2 is a schematic diagram depicting passage retrieval components 75 that may be implemented in QA and non-QA systems according to one embodiment. In one embodiment, the system components 75 conducting passage retrieval operations make use of system modules from FIG. 1 such as: the question analysis processing component 20 that performs a query context analysis upon an received input query to break down said input query into query terms, and any searchable components thereof; and, the search component 30 a that formulates queries from the output searchable components of question analysis unit and that consults various resources such as the World Wide Web 40 or one or more knowledge resources, e.g., databases, knowledge bases 42.
  • More particularly, as shown in FIG. 2, the question analysis processing component 20 includes a programmed matcher component 80 that functions to identify document metadata present in the question. It performs this by consulting a resource 84 containing document metadata information for all documents. Document metadata may include any information that identifies the topic or domain of the document, such as the document title, manually or automatically derived category/domain classification, and crowdsourced or automatically derived tag clouds (“clouds”) which indicate general topics of the document. It is against this data resource 84 where matching of terms in the input question to the document metadata information is performed. Data corpus 89 represents the entire data corpus that the QA or non-QA system is using and may include both open domain and closed domain topics.
  • For example, a document containing George W. Bush's 2007 State of the Union address may include the following metadata:
  • Title: 2007 State of the Union Address
  • Category: Presidential Addresses, George W. Bush Speeches, . . .
  • Tags: Security, Iraq, Terrorists, Health, America, . . .
  • A sample implementation of this matcher component 80 is to represent the metadata in dictionary form and to leverage a dictionary matcher to identify dictionary terms that appear in an input question. For example, any matching component can be used to identify closed or open domain dictionary terms in text (e.g., legal terms, medical terms, or generic named entities) may be used. Thus, given a piece of text (an input query), the matching algorithm determines from the question text those terms that match entries in the dictionary. In one embodiment, a dictionary matcher includes the open source ConceptMapper annotator available at http://uima.apache.org/sandbox.html#concept.mapper.annotator, whose functionality is incorporated by reference as if fully set forth herein.
  • The matched dictionary entries (question terms) are used to identify a subset of documents for the passage retrieval process. That is, for the query terms that are mapped to the metadata (titles, tags, clouds) of a document in the resource 84, that document's index (or other document identifier) is flagged, tagged, or recorded for its inclusion in a subcorpus. In one embodiment, each dictionary entry in resource 84 encodes the document ID for each document that contains metadata matching that dictionary term. The metadata and associated document information in the dictionary entry that match the terms in the input question is represented as 85 in FIG. 2.
  • The passage retrieval component can be any standard IR (Information Retrieval) search engine 90 that supports both of: Retrieval of relevant short passages, instead of full documents; and Runtime specification of a relevant subcorpus for retrieval. One example IR search engine that satisfies this requirement is the Indri engine from the Lemur Toolkit such as the search engine with passage retrieval capability, such as Indri, http://www.lemurproject.org/indri/, incorporated by reference as if fully set forth herein.
  • In further view of FIG. 2 the matched documents identified by the matcher component 80 form a constrained document set 88, indicated in the entire corpus 89 having the entire index and a subcorpus 92 is built including the constrained document set 88 on which passage retrieval operation via IR search engine 90 are performed to select the most relevant passages.
  • A passage retrieval method 100 employed by the passage retrieval components 75 for improving the efficiency of passage retrieval is described with respect to FIG. 3. As shown in FIG. 3, the method 100 includes at 101, receiving at a processor device, an input query and, using a parser device or function, breaking down the query into searchable query terms. In one embodiment, the obtained searchable query terms from said input query are terms that match document metadata. Then, at 105, there is performed accessing a semi-structured source of information containing document metadata (such as the title of the documents, a category, or user tags or clouds). In one embodiment, the semi-structured source of information is a dictionary or corpus that associates data (e.g., definitions) with a large set of vocabulary items including document metadata stored in memory storage device.
  • That is, in one embodiment, the semi-structured source of information may be formed via off-line processes that extract document metadata from one or more documents of a large corpus of documents. The extracted document metadata is stored as a dictionary in the memory storage device, with each document metadata stored in the dictionary having one or more associated document identifications (IDs) that represent those documents matching the metadata in that dictionary entry.
  • Then, at 110, the programmed processor device performs invoking a matching component to match a document metadata against the query terms. As mentioned, a dictionary matcher may be invoked that includes the open source ConceptMapper annotator available at http://uima.apache.org/sandbox.html#concept.mapper.annotator.
  • Continuing to 115, there is next performed mapping of the matched document metadata to corresponding one or more document IDs. Then at 120, from the corresponding IDs, there is performed identifying the corresponding matched documents.
  • In one embodiment, for the matched document metadata found in the dictionary, the corresponding documents indicated by the mapped document IDs are identified, e.g., flagged, tagged or recorded in the corpus in which the actual documents are electronically stored with their ID. Thus, in one embodiment, the identified corresponding matched documents form the subcorpus 92 of documents including only the identified matched metadata documents of the larger corpus of documents. This step invokes corpus construction functionality to identify the subset of flagged, tagged or otherwise identified matched metadata documents obtained from the first corpus 84 (FIG. 2) during the matching step, which functionality for dynamically constructing subcorpora during runtime is provided for example in the above-incorporated Indri engine from the Lemur Toolkit.
  • In an alternate embodiment, there may be further performed at 125, extracting the identified corresponding matched documents are found in step 120 as the subcorpus 92.
  • Then, at 130, the method performs passage retrieval operations against those identified matched metadata documents obtained from the subcorpus 92 formed at step 120 or 125.
  • Finally, assuming a search engine has internal document ranking ability, then at 135, there is returned the resulting list of ranked passages at 125.
  • In one embodiment, the passage retrieval process 100, FIG. 3 when performed in parallel with traditional passage retrieval algorithms is more effective when the information sought in the question is present in documents whose relevant metadata field contains a term/phrase in the question. To increase recall, the dictionary can be constructed to include morphological variations for the given metadata information, such as including both the singular and plural forms of terms, as well as known synonyms. In one embodiment, redirect links between Wikipedia® titles (which, e.g., redirects requests for the document “artists” to the document titled “artist” and for example, “Ol' Blue Eyes” to “Frank Sinatra”) are used to capture morphological variations and synonyms. Alternatively, morphological and synonym information can be mined from publicly available resources such as WordNet® (Trademark of The CORPORATION NEW JERSEY Princeton University) available at http://wordnet.princeton.edu/. For these questions, this approach significantly reduces execution time in those situations compared with performing passage retrieval against a large unconstrained corpus 89.
  • As mentioned, FIG. 1 shows a system diagram described in U.S. patent application Ser. No. 12/126,642 depicting a high-level logical architecture of a QA system 10 and methodology in which a system and method for deferred type evaluation using text with limited structure is employed in one embodiment.
  • Generally, as shown in FIG. 1, the high level logical architecture 10 includes the Query Analysis module 20 implementing functions for receiving and analyzing a user query or question. The term “user” may refer to a person or persons interacting with the system, or refers to a computer system 22 generating a query by mechanical means, and where the term “user query” refers to such a mechanically generated query and context 19′. A candidate answer generation module 30 is provided to implement a search for candidate answers by traversing structured, semi structured and unstructured sources contained in primary sources (e.g., the Web, a data corpus 41) and in an Answer Source or a Knowledge Base (KB), e.g., containing collections of relations and lists extracted from primary sources. All the sources of information can be locally stored or distributed over a network, including the Internet.
  • The Candidate Answer generation module 30 of architecture 10 generates a plurality of output data structures containing candidate answers based upon the analysis of retrieved data. In FIG. 1, an Evidence Gathering module 50 further interfaces with the primary sources and knowledge base for concurrently analyzing the evidence based on passages having candidate answers, and scores each of candidate answers, in one embodiment, as parallel processing operations. In one embodiment, the architecture may be employed utilizing the Common Analysis System (CAS) candidate answer structures as is described in commonly-owned, issued U.S. Pat. No. 7,139,752, the whole contents and disclosure of which is incorporated by reference as if fully set forth herein.
  • As depicted in FIG. 1, when the Search System 30 a is employed in the context of a QA system, the Evidence Gathering and Scoring module 50 comprises a Candidate Answer Scoring module 40 for analyzing a retrieved passage and scoring each of candidate answers of a retrieved passage. The Answer Source Knowledge Base (KB) may comprise one or more databases of structured or semi-structured sources (pre-computed or otherwise) comprising collections of relations (e.g., Typed Lists). In an example implementation, the Answer Source knowledge base may comprise a database stored in a memory storage system, e.g., a hard drive.
  • An Answer Ranking module 60 may be invoked to provide functionality for ranking candidate answers and determining a response 99 returned to a user via a user's computer display interface (not shown) or a computer system 22, where the response may be an answer, or an elaboration of a prior answer or request for clarification in response to a question—when a high quality answer to the question is not found. A machine learning implementation is further provided where the “answer ranking” module 60 includes a trained model component (not shown) produced using a machine learning techniques from prior data.
  • The processing depicted in FIG. 1, may be local, on a server, or server cluster, within an enterprise, or alternately, may be distributed with or integral with or otherwise operate in conjunction with a public or privately available search engine in order to enhance the question answer functionality in the manner as described. Thus, the method may be provided as a computer program product comprising instructions executable by a processing device, or as a service deploying the computer program product. The architecture employs a search engine (e.g., a document retrieval system) as a part of Candidate Answer Generation module 30 which may be dedicated to searching the Internet, a publicly available database, a web-site (e.g., IMDB.com), a privately available collection of documents or, a privately available database. Databases can be stored in any storage system, non-volatile memory storage systems, e.g., a hard drive or flash memory, and can be distributed over the network or not.
  • In one embodiment, when employed in a QA system, the system and method of FIG. 1 makes use of the Common Analysis System (CAS), a subsystem of the Unstructured Information Management Architecture (UIMA) that handles data exchanges between the various UIMA components, such as analysis engines and unstructured information management applications. CAS supports data modeling via a type system independent of programming language, provides data access through a powerful indexing mechanism, and provides support for creating annotations on text data, such as described in (http://www.research.ibm.com/journal/sj/433/gotz.html) incorporated by reference as if set forth herein. It should be noted that the CAS allows for multiple definitions of the linkage between a document and its annotations, as is useful for the analysis of images, video, or other non-textual modalities (as taught in the herein incorporated reference U.S. Pat. No. 7,139,752).
  • In one embodiment, UIMA may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The UIMA system, method and computer program may be used to generate answers to input queries. The method includes inputting a document and operating at least one text analysis engine that comprises a plurality of coupled annotators for tokenizing document data and for identifying and annotating a particular type of semantic content. Thus it can be used to analyze a question and to extract entities as possible answers to a question from a collection of documents.
  • In an alternative environment, modules of FIGS. 1, 2 can be represented as functional components in GATE (General Architecture for Text Engineering) (see: http://gate.ac.uk/releases/gate-2.0alpha2-build484/doc/userguide.html). GATE employs components which are reusable software chunks with well-defined interfaces that are conceptually separate from GATE itself. All component sets are user-extensible and together are called CREOLE—a Collection of REusable Objects for Language Engineering. The GATE framework is a backplane into which plug CREOLE components. The user gives the system a list of URLs to search when it starts up, and components at those locations are loaded by the system. In one embodiment, only their configuration data is loaded to begin with; the actual classes are loaded when the user requests the instantiation of a resource.). GATE components are one of three types of specialized Java Beans: 1) Resource: The top-level interface, which describes all components. What all components share in common is that they can be loaded at runtime, and that the set of components is extendable by clients. They have Features, which are represented externally to the system as “meta-data” in a format such as RDF, plain XML, or Java properties. Resources may all be Java beans in one embodiment. 2) ProcessingResource: Is a resource that is runnable, may be invoked remotely (via RMI), and lives in class files. In order to load a PR (Processing Resource) the system knows where to find the class or jar files (which will also include the metadata); 3) LanguageResource: Is a resource that consists of data, accessed via a Java abstraction layer. They live in relational databases; and, VisualResource: Is a visual Java bean, component of GUIs, including of the main GATE GUI Like PRs these components live in .class or .jar files.
  • In describing the GATE processing model any resource whose primary characteristics are algorithmic, such as parsers, generators and so on, is modeled as a Processing Resource. A PR is a Resource that implements the Java Runnable interface. The GATE Visualisation Model implements resources whose task is to display and edit other resources are modeled as Visual Resources. The Corpus Model in GATE is a Java Set whose members are documents. Both Corpora and Documents are types of Language Resources(LR) with all LRs having a Feature Map (a Java Map) associated with them that stored attribute/value information about the resource. FeatureMaps are also used to associate arbitrary information with ranges of documents (e.g. pieces of text) via an annotation model. Documents have a DocumentContent which is a text at present (future versions may add support for audiovisual content) and one or more AnnotationSets which are Java Sets.
  • As UIMA, GATE can be used as a basis for implementing natural language dialog systems and multimodal dialog systems having a question answering system as one of the main submodules. The references, incorporated herein by reference above (U.S. Pat. Nos. 6,829,603 and 6,983,252, and 7,136,909) enable one skilled in the art to build such an implementation.
  • FIG. 4 illustrates an exemplary hardware configuration of a computing system 400 in which the present system and method may be employed. The hardware configuration preferably has at least one processor or central processing unit (CPU) 411. The CPUs 411 are interconnected via a system bus 412 to a random access memory (RAM) 414, read-only memory (ROM) 416, input/output (I/O) adapter 418 (for connecting peripheral devices such as disk units 421 and tape drives 440 to the bus 412), user interface adapter 422 (for connecting a keyboard 424, mouse 426, speaker 428, microphone 432, and/or other user interface device to the bus 412), a communication adapter 434 for connecting the system 400 to a data processing network, the Internet, an Intranet, a local area network (LAN), etc., and a display adapter 436 for connecting the bus 412 to a display device 438 and/or printer 439 (e.g., a digital printer of the like).
  • As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a system, apparatus, or device running an instruction.
  • A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device running an instruction.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may run entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • Thus, in one embodiment, the system and method for efficient passage retrieval may be performed with data structures native to various programming languages such as Java and C++.
  • Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which run via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which run on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more operable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be run substantially concurrently, or the blocks may sometimes be run in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
  • The embodiments described above are illustrative examples and it should not be construed that the present invention is limited to these particular embodiments. Thus, various changes and modifications may be effected by one skilled in the art without departing from the spirit or scope of the invention as defined in the appended claims.

Claims (25)

1. A computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising:
receiving an input query;
performing a query analysis upon said input query to obtain searchable query terms;
matching metadata associated with one or more documents against said query terms;
mapping matched document metadata to corresponding one or more documents;
identifying corresponding matched documents to form a subcorpus of documents; and
conducting a search in said data subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents,
wherein one or more processor devices performs one or more said retrieving, performing, matching, mapping, identifying and conducting.
2. The computer-implemented method of claim 1, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
3. The computer-implemented method of claim 2, wherein prior to matching of metadata associated with one or more documents against said query terms:
extracting document metadata from one or more documents of a corpus of documents;
providing said extracted document metadata as a dictionary in a storage device, each document metadata stored in said dictionary being associated with one or more corresponding document identifications.
4. The computer-implemented method of claim 3, wherein said matching of metadata against said query terms comprises: performing, by said processor device, dictionary matching.
5. The computer-implemented method of claim 2, wherein said data corpus comprising document metadata information includes variations of metadata including one or more of: singular and plural forms of metadata terms, and synonyms for metadata terms.
6. The computer-implemented method of claim 2, wherein obtaining searchable query terms from said input query comprises parsing, by said processor device, said input query to obtain terms matching document metadata.
7. The computer-implemented method of claim 2, wherein said identifying corresponding matched documents to form a subcorpus of documents includes tagging or flagging each matched metadata documents in said corpus of documents.
8. The computer-implemented method of claim 2, further comprising: extracting said tagged or flagged identified corresponding matched documents to form said subcorpus of documents.
9. A computer program product for efficiently retrieving relevant passages to questions based on a corpus of data, the computer program device comprising a non-transitory storage medium readable by a processing circuit and storing instructions run by the processing circuit for performing a method, the method comprising:
receiving an input query;
performing a query context analysis upon said input query to obtain searchable query terms;
matching metadata associated with one or more documents against said query terms;
mapping matched document metadata to corresponding one or more documents;
identifying corresponding matched documents to form a subcorpus of documents; and
conducting a search in said data subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified.
10. The computer program product of claim 9, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
11. The computer program product of claim 9, wherein prior to matching of metadata associated with one or more documents against said query terms:
extracting document metadata from one or more documents of a corpus of documents;
providing said extracted document metadata as a dictionary in a storage device, each document metadata stored in said dictionary being associated with one or more corresponding document identifications.
12. The computer program product of claim 11, wherein said matching of metadata against said query terms comprises: performing, by said processor device, dictionary matching.
13. The computer program product of claim 9, wherein said data corpus comprising document metadata information includes variations of metadata including one or more of:
singular and plural forms of metadata terms, and synonyms for metadata terms.
14. The computer program product of claim 10, wherein obtaining searchable query terms from said input query comprises parsing, by said processor device, said input query to obtain terms matching document metadata.
15. The computer program product of claim 10, wherein said identifying corresponding matched documents to form a subcorpus of documents includes tagging or flagging each matched metadata documents in said corpus of documents.
16. The computer program product of claim 10, further comprising: extracting said tagged or flagged identified corresponding matched documents to form said subcorpus of documents.
17. A computer-implemented method for efficiently retrieving relevant passages to questions based on a corpus of data comprising:
receiving an input query;
performing a query context analysis upon said input query to obtain searchable query terms;
accessing a dictionary of document metadata obtained from one or more documents of the data corpus, each stored document metadata being associated with one or more corresponding document identifications (IDs);
performing a dictionary matching of said metadata associated with one or more documents against said query terms;
mapping matched document metadata to corresponding one or more document IDs;
identifying corresponding matched documents to form a subcorpus of documents; and
conducting a search in said subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents, wherein one or more processor devices perform one or more said retrieving, performing query context analysis, accessing, performing dictionary matching, mapping, identifying and conducting.
18. The computer-implemented method of claim 17, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
19. The computer-implemented method of claim 18, wherein obtaining searchable query terms from said input query comprises parsing, by said processor device, said input query to obtain terms matching document metadata.
20. The computer-implemented method of claim 17, wherein said identifying corresponding matched documents to form a subcorpus of documents includes:
tagging or flagging each matched metadata documents in said data corpus; and,
extracting said tagged or flagged identified corresponding matched documents to form said subcorpus of documents.
21. A system for efficiently retrieving relevant passages to questions based on a corpus of data comprising:
a memory storage device;
a processor device in communication with the memory device that performs a method comprising:
receiving an input query;
performing a query context analysis upon said input query to obtain searchable query terms;
matching metadata associated with one or more documents against said query terms;
mapping matched document metadata to corresponding one or more documents;
identifying corresponding matched documents to form a subcorpus of documents; and
conducting a search in said data subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents.
22. The system of claim 21, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
23. The system of claim 22, wherein prior to matching of metadata associated with one or more documents against said query terms:
extracting document metadata from one or more documents of a corpus of documents;
providing said extracted document metadata as a dictionary in a storage device, each document metadata stored in said dictionary being associated with a corresponding document identification, wherein said matching of metadata against said query terms comprises performing a dictionary matching.
24. A computer program product for efficiently retrieving relevant passages to questions based on a corpus of data, the computer program device comprising a storage medium readable by a processing circuit and storing instructions run by the processing circuit for performing a method, the method comprising:
receiving, at a processor device, an input query;
performing, at said processor device, a query context analysis upon said input query to obtain searchable query terms;
accessing a dictionary of document metadata obtained from one or more documents of the data corpus, each stored document metadata being associated with a corresponding document identification (ID);
performing, by said processor device, a dictionary matching of said metadata associated with one or more documents against said query terms;
mapping matched document metadata to corresponding one or more document IDs;
identifying corresponding matched documents to form a subcorpus of documents; and
conducting a search in said subcorpus using said searchable query terms to obtain one or more passages relevant to the input query from said identified documents.
25. The computer program product of claim 24, wherein the document metadata includes one or more of: a title of the documents, one or more user tags, one or more automatically identified document labels.
US13/244,347 2010-09-24 2011-09-24 Efficient passage retrieval using document metadata Abandoned US20120078926A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/244,347 US20120078926A1 (en) 2010-09-24 2011-09-24 Efficient passage retrieval using document metadata
US13/605,313 US20120331003A1 (en) 2010-09-24 2012-09-06 Efficient passage retrieval using document metadata

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US38601910P 2010-09-24 2010-09-24
US13/244,347 US20120078926A1 (en) 2010-09-24 2011-09-24 Efficient passage retrieval using document metadata

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/605,313 Continuation US20120331003A1 (en) 2010-09-24 2012-09-06 Efficient passage retrieval using document metadata

Publications (1)

Publication Number Publication Date
US20120078926A1 true US20120078926A1 (en) 2012-03-29

Family

ID=45871676

Family Applications (4)

Application Number Title Priority Date Filing Date
US13/244,348 Active 2032-01-30 US9569724B2 (en) 2010-09-24 2011-09-24 Using ontological information in open domain type coercion
US13/244,347 Abandoned US20120078926A1 (en) 2010-09-24 2011-09-24 Efficient passage retrieval using document metadata
US13/605,339 Active 2032-03-20 US9508038B2 (en) 2010-09-24 2012-09-06 Using ontological information in open domain type coercion
US13/605,313 Abandoned US20120331003A1 (en) 2010-09-24 2012-09-06 Efficient passage retrieval using document metadata

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US13/244,348 Active 2032-01-30 US9569724B2 (en) 2010-09-24 2011-09-24 Using ontological information in open domain type coercion

Family Applications After (2)

Application Number Title Priority Date Filing Date
US13/605,339 Active 2032-03-20 US9508038B2 (en) 2010-09-24 2012-09-06 Using ontological information in open domain type coercion
US13/605,313 Abandoned US20120331003A1 (en) 2010-09-24 2012-09-06 Efficient passage retrieval using document metadata

Country Status (4)

Country Link
US (4) US9569724B2 (en)
EP (1) EP2616927A4 (en)
CN (1) CN103221915B (en)
WO (2) WO2012040676A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130086101A1 (en) * 2011-09-29 2013-04-04 Sap Ag Data Search Using Context Information
US20140278364A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Business intelligence data models with concept identification using language-specific clues
US20150006449A1 (en) * 2013-06-27 2015-01-01 International Business Machines Corporation Enhanced Document Input Parsing
US20150112683A1 (en) * 2012-03-13 2015-04-23 Mitsubishi Electric Corporation Document search device and document search method
US9092989B2 (en) 2012-11-16 2015-07-28 International Business Machines Corporation Multi-dimensional feature merging for open domain question answering
US20160055234A1 (en) * 2014-08-19 2016-02-25 International Business Machines Corporation Retrieving Text from a Corpus of Documents in an Information Handling System
US20160078102A1 (en) * 2014-09-12 2016-03-17 Nuance Communications, Inc. Text indexing and passage retrieval
US9613133B2 (en) 2014-11-07 2017-04-04 International Business Machines Corporation Context based passage retrieval and scoring in a question answering system
US9984116B2 (en) 2015-08-28 2018-05-29 International Business Machines Corporation Automated management of natural language queries in enterprise business intelligence analytics
US10002179B2 (en) 2015-01-30 2018-06-19 International Business Machines Corporation Detection and creation of appropriate row concept during automated model generation
US10331782B2 (en) 2014-11-19 2019-06-25 Lexisnexis, A Division Of Reed Elsevier Inc. Systems and methods for automatic identification of potential material facts in documents
CN110532229A (en) * 2019-06-14 2019-12-03 平安科技(深圳)有限公司 Instrument of evidence search method, device, computer equipment and storage medium
US10540588B2 (en) 2015-06-29 2020-01-21 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory
US10606651B2 (en) 2015-04-17 2020-03-31 Microsoft Technology Licensing, Llc Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit
US10698924B2 (en) 2014-05-22 2020-06-30 International Business Machines Corporation Generating partitioned hierarchical groups based on data sets for business intelligence data models
WO2020172155A1 (en) * 2019-02-18 2020-08-27 David Nahamoo Intelligent document system
CN113204621A (en) * 2021-05-12 2021-08-03 北京百度网讯科技有限公司 Document storage method, document retrieval method, device, equipment and storage medium
CN113448984A (en) * 2021-07-15 2021-09-28 中国银行股份有限公司 Document positioning display method and device, server and electronic equipment
CN113505147A (en) * 2021-07-27 2021-10-15 中国工商银行股份有限公司 Data processing method and device, electronic equipment and readable storage medium
WO2022204435A3 (en) * 2021-03-24 2022-11-24 Trust & Safety Laboratory Inc. Multi-platform detection and mitigation of contentious online content

Families Citing this family (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9002773B2 (en) 2010-09-24 2015-04-07 International Business Machines Corporation Decision-support application and system for problem solving using a question-answering system
US9275636B2 (en) * 2012-05-03 2016-03-01 International Business Machines Corporation Automatic accuracy estimation for audio transcriptions
US9418151B2 (en) * 2012-06-12 2016-08-16 Raytheon Company Lexical enrichment of structured and semi-structured data
US9305103B2 (en) * 2012-07-03 2016-04-05 Yahoo! Inc. Method or system for semantic categorization
KR102019975B1 (en) * 2012-08-29 2019-11-04 삼성전자주식회사 Device and contents searching method using the same
US9299024B2 (en) 2012-12-11 2016-03-29 International Business Machines Corporation Method of answering questions and scoring answers using structured knowledge mined from a corpus of data
CN111881374A (en) * 2012-12-12 2020-11-03 谷歌有限责任公司 Providing search results based on combined queries
US20140207776A1 (en) * 2013-01-22 2014-07-24 Maluuba Inc. Method and system for linking data sources for processing composite concepts
US9171267B2 (en) * 2013-03-14 2015-10-27 Xerox Corporation System for categorizing lists of words of arbitrary origin
US20140280008A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Axiomatic Approach for Entity Attribution in Unstructured Data
US9262938B2 (en) 2013-03-15 2016-02-16 International Business Machines Corporation Combining different type coercion components for deferred type evaluation
US10740396B2 (en) 2013-05-24 2020-08-11 Sap Se Representing enterprise data in a knowledge graph
US9158599B2 (en) * 2013-06-27 2015-10-13 Sap Se Programming framework for applications
US9740736B2 (en) 2013-09-19 2017-08-22 Maluuba Inc. Linking ontologies to expand supported language
US9411905B1 (en) * 2013-09-26 2016-08-09 Groupon, Inc. Multi-term query subsumption for document classification
CN105706075A (en) * 2013-10-30 2016-06-22 慧与发展有限责任合伙企业 Technology recommendation for software environment
CN105095182B (en) * 2014-05-22 2018-11-06 华为技术有限公司 A kind of return information recommendation method and device
US9535910B2 (en) * 2014-05-31 2017-01-03 International Business Machines Corporation Corpus generation based upon document attributes
US9720977B2 (en) 2014-06-10 2017-08-01 International Business Machines Corporation Weighting search criteria based on similarities to an ingested corpus in a question and answer (QA) system
US9922350B2 (en) * 2014-07-16 2018-03-20 Software Ag Dynamically adaptable real-time customer experience manager and/or associated method
US9619513B2 (en) * 2014-07-29 2017-04-11 International Business Machines Corporation Changed answer notification in a question and answer system
US10380687B2 (en) 2014-08-12 2019-08-13 Software Ag Trade surveillance and monitoring systems and/or methods
US11809501B2 (en) * 2014-08-28 2023-11-07 Ebay Inc. Systems, apparatuses, and methods for providing a ranking based recommendation
JP6277921B2 (en) * 2014-09-25 2018-02-14 京セラドキュメントソリューションズ株式会社 Glossary management device and glossary management program
US9449218B2 (en) 2014-10-16 2016-09-20 Software Ag Usa, Inc. Large venue surveillance and reaction systems and methods using dynamically analyzed emotional input
US9400956B2 (en) * 2014-11-05 2016-07-26 International Business Machines Corporation Answer interactions in a question-answering environment
US10176228B2 (en) * 2014-12-10 2019-01-08 International Business Machines Corporation Identification and evaluation of lexical answer type conditions in a question to generate correct answers
CN106033466A (en) * 2015-03-20 2016-10-19 华为技术有限公司 Database query method and device
CN104765779A (en) * 2015-03-20 2015-07-08 浙江大学 Patent document inquiry extension method based on YAGO2s
US20160321285A1 (en) * 2015-05-02 2016-11-03 Mohammad Faraz RASHID Method for organizing and distributing data
US10169326B2 (en) 2015-05-22 2019-01-01 International Business Machines Corporation Cognitive reminder notification mechanisms for answers to questions
US9912736B2 (en) 2015-05-22 2018-03-06 International Business Machines Corporation Cognitive reminder notification based on personal user profile and activity information
US10152534B2 (en) 2015-07-02 2018-12-11 International Business Machines Corporation Monitoring a corpus for changes to previously provided answers to questions
US10628413B2 (en) * 2015-08-03 2020-04-21 International Business Machines Corporation Mapping questions to complex database lookups using synthetic events
US10628521B2 (en) * 2015-08-03 2020-04-21 International Business Machines Corporation Scoring automatically generated language patterns for questions using synthetic events
CN105224630B (en) * 2015-09-24 2019-01-29 中国科学院自动化研究所 Integrated approach based on Ontology on Semantic Web data
US10769185B2 (en) 2015-10-16 2020-09-08 International Business Machines Corporation Answer change notifications based on changes to user profile information
EP3467678A4 (en) * 2016-05-30 2019-05-29 Sony Corporation Information processing device
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US20180232443A1 (en) * 2017-02-16 2018-08-16 Globality, Inc. Intelligent matching system with ontology-aided relation extraction
US10372824B2 (en) 2017-05-15 2019-08-06 International Business Machines Corporation Disambiguating concepts in natural language
US11074250B2 (en) * 2017-06-27 2021-07-27 OWOX Limted Technologies for implementing ontological models for natural language queries
US10831989B2 (en) 2018-12-04 2020-11-10 International Business Machines Corporation Distributing updated communications to viewers of prior versions of the communications
US10909180B2 (en) 2019-01-11 2021-02-02 International Business Machines Corporation Dynamic query processing and document retrieval
US10949613B2 (en) 2019-01-11 2021-03-16 International Business Machines Corporation Dynamic natural language processing
US11132390B2 (en) * 2019-01-15 2021-09-28 International Business Machines Corporation Efficient resolution of type-coercion queries in a question answer system using disjunctive sub-lexical answer types
US10915561B2 (en) 2019-01-28 2021-02-09 International Business Machines Corporation Implementing unstructured content utilization from structured sources in system for answering questions
KR102169143B1 (en) * 2019-04-10 2020-10-23 인천대학교 산학협력단 Apparatus for filtering url of harmful content web pages
US11403355B2 (en) 2019-08-20 2022-08-02 Ai Software, LLC Ingestion and retrieval of dynamic source documents in an automated question answering system
US11556758B2 (en) 2019-08-27 2023-01-17 International Business Machines Corporation Learning approximate translations of unfamiliar measurement units during deep question answering system training and usage
US11475339B2 (en) 2019-08-30 2022-10-18 International Business Machines Corporation Learning unfamiliar measurement units in a deep question answering system
KR102324196B1 (en) * 2019-09-18 2021-11-11 주식회사 솔트룩스 System and method for consolidating knowledge base
US11188546B2 (en) * 2019-09-24 2021-11-30 International Business Machines Corporation Pseudo real time communication system
CN112836023A (en) * 2019-11-22 2021-05-25 华为技术有限公司 Question-answering method and device based on knowledge graph
JP2024521668A (en) * 2021-05-17 2024-06-04 セールスフォース インコーポレイテッド SYSTEM AND METHOD FOR SENSE-BASED CLASSIC HIERARCHICAL SEARCH IN DEEP LEARNING - Patent application
CN113326361B (en) * 2021-05-25 2023-03-21 武汉理工大学 Knowledge question-answering method and system based on automobile industry map and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060026128A1 (en) * 2004-06-29 2006-02-02 Xerox Corporation Expanding a partially-correct list of category elements using an indexed document collection
US20080010262A1 (en) * 2006-06-12 2008-01-10 Metacarta, Inc. System and methods for providing statstically interesting geographical information based on queries to a geographic search engine
US20090094267A1 (en) * 2007-10-04 2009-04-09 Muguda Naveenkumar V System and Method for Implementing Metadata Extraction of Artifacts from Associated Collaborative Discussions on a Data Processing System
US7720836B2 (en) * 2000-11-21 2010-05-18 Aol Inc. Internet streaming media workflow architecture
US20100274790A1 (en) * 2009-04-22 2010-10-28 Palo Alto Research Center Incorporated System And Method For Implicit Tagging Of Documents Using Search Query Data
US7890521B1 (en) * 2007-02-07 2011-02-15 Google Inc. Document-based synonym generation
US8005850B2 (en) * 2004-03-15 2011-08-23 Yahoo! Inc. Search systems and methods with integration of user annotations
US8122022B1 (en) * 2007-08-10 2012-02-21 Google Inc. Abbreviation detection for common synonym generation
US8176062B2 (en) * 2008-04-28 2012-05-08 American Express Travel Related Services Company, Inc. Service provider framework
US8209321B2 (en) * 2007-08-31 2012-06-26 Microsoft Corporation Emphasizing search results according to conceptual meaning

Family Cites Families (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3559995A (en) 1968-04-29 1971-02-02 Psychomantic Game Co Question answering gameboard and spinner
JPS5853787B2 (en) 1979-08-30 1983-12-01 シャープ株式会社 electronic dictionary
JPS58201175A (en) 1982-05-20 1983-11-22 Kokusai Denshin Denwa Co Ltd <Kdd> Machine translation system
US4829423A (en) 1983-01-28 1989-05-09 Texas Instruments Incorporated Menu-based natural language understanding system
US5036472A (en) 1988-12-08 1991-07-30 Hallmark Cards, Inc. Computer controlled machine for vending personalized products or the like
US4921427A (en) 1989-08-21 1990-05-01 Dunn Jeffery W Educational device
US5546316A (en) 1990-10-22 1996-08-13 Hallmark Cards, Incorporated Computer controlled system for vending personalized products
US5559714A (en) 1990-10-22 1996-09-24 Hallmark Cards, Incorporated Method and apparatus for display sequencing personalized social occasion products
JP2804403B2 (en) 1991-05-16 1998-09-24 インターナショナル・ビジネス・マシーンズ・コーポレイション Question answering system
US5374894A (en) 1992-08-19 1994-12-20 Hyundai Electronics America Transition detection circuit
CA2175187A1 (en) 1993-10-28 1995-05-04 William K. Thomson Database search summary with user determined characteristics
US5550746A (en) 1994-12-05 1996-08-27 American Greetings Corporation Method and apparatus for storing and selectively retrieving product data by correlating customer selection criteria with optimum product designs based on embedded expert judgments
US5794050A (en) 1995-01-04 1998-08-11 Intelligent Text Processing, Inc. Natural language understanding system
US6061675A (en) 1995-05-31 2000-05-09 Oracle Corporation Methods and apparatus for classifying terminology utilizing a knowledge catalog
US7181438B1 (en) 1999-07-21 2007-02-20 Alberti Anemometer, Llc Database access system
US6947885B2 (en) 2000-01-18 2005-09-20 At&T Corp. Probabilistic model for natural language generation
US6829603B1 (en) 2000-02-02 2004-12-07 International Business Machines Corp. System, method and program product for interactive natural dialog
JP2001297259A (en) 2000-04-13 2001-10-26 Fujitsu Ltd Question answering system
US6981028B1 (en) * 2000-04-28 2005-12-27 Obongo, Inc. Method and system of implementing recorded data for automating internet interactions
CN1447943A (en) * 2000-06-22 2003-10-08 亚隆·梅耶 System and method for searching, finding and contacting dates on internet in instant messaging networks
US8396859B2 (en) 2000-06-26 2013-03-12 Oracle International Corporation Subject matter context search engine
JP2002041540A (en) 2000-07-28 2002-02-08 Shinichiro Okude Retrieval system with associating and inferring function and recording medium money contribution used for the same
US7092928B1 (en) 2000-07-31 2006-08-15 Quantum Leap Research, Inc. Intelligent portal engine
EP1490790A2 (en) 2001-03-13 2004-12-29 Intelligate Ltd. Dynamic natural language understanding
ATE410728T1 (en) 2001-05-04 2008-10-15 Microsoft Corp INTERFACE CONTROL
US6732090B2 (en) 2001-08-13 2004-05-04 Xerox Corporation Meta-document management system with user definable personalities
US7136909B2 (en) 2001-12-28 2006-11-14 Motorola, Inc. Multimodal communication method and apparatus with multimodal profile
JP2004139553A (en) 2002-08-19 2004-05-13 Matsushita Electric Ind Co Ltd Document retrieval system and question answering system
JP2004118740A (en) 2002-09-27 2004-04-15 Toshiba Corp Question answering system, question answering method and question answering program
US20040122660A1 (en) 2002-12-20 2004-06-24 International Business Machines Corporation Creating taxonomies and training data in multiple languages
US7139752B2 (en) 2003-05-30 2006-11-21 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US8666983B2 (en) 2003-06-13 2014-03-04 Microsoft Corporation Architecture for generating responses to search engine queries
US7454393B2 (en) 2003-08-06 2008-11-18 Microsoft Corporation Cost-benefit approach to automatically composing answers to questions by extracting information from large unstructured corpora
JP2005092271A (en) 2003-09-12 2005-04-07 Hitachi Ltd Question-answering method and question-answering device
KR100533810B1 (en) 2003-10-16 2005-12-07 한국전자통신연구원 Semi-Automatic Construction Method for Knowledge of Encyclopedia Question Answering System
JP3882048B2 (en) 2003-10-17 2007-02-14 独立行政法人情報通信研究機構 Question answering system and question answering processing method
JP3820242B2 (en) * 2003-10-24 2006-09-13 東芝ソリューション株式会社 Question answer type document search system and question answer type document search program
US7590606B1 (en) 2003-11-05 2009-09-15 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration (Nasa) Multi-user investigation organizer
JP3981734B2 (en) 2003-11-21 2007-09-26 独立行政法人情報通信研究機構 Question answering system and question answering processing method
JP3944159B2 (en) 2003-12-25 2007-07-11 株式会社東芝 Question answering system and program
US20050256700A1 (en) * 2004-05-11 2005-11-17 Moldovan Dan I Natural language question answering system and method utilizing a logic prover
US20060053000A1 (en) 2004-05-11 2006-03-09 Moldovan Dan I Natural language question answering system and method utilizing multi-modal logic
US20080077570A1 (en) 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
US20060106788A1 (en) 2004-10-29 2006-05-18 Microsoft Corporation Computer-implemented system and method for providing authoritative answers to a general information search
US20060122834A1 (en) 2004-12-03 2006-06-08 Bennett Ian M Emotion detection device & method for use in distributed systems
US20060141438A1 (en) 2004-12-23 2006-06-29 Inventec Corporation Remote instruction system and method
US7792829B2 (en) 2005-01-28 2010-09-07 Microsoft Corporation Table querying
JP2006252382A (en) 2005-03-14 2006-09-21 Fuji Xerox Co Ltd Question answering system, data retrieval method and computer program
JP4635659B2 (en) 2005-03-14 2011-02-23 富士ゼロックス株式会社 Question answering system, data retrieval method, and computer program
JP4645242B2 (en) 2005-03-14 2011-03-09 富士ゼロックス株式会社 Question answering system, data retrieval method, and computer program
JP4650072B2 (en) 2005-04-12 2011-03-16 富士ゼロックス株式会社 Question answering system, data retrieval method, and computer program
JP4654745B2 (en) * 2005-04-13 2011-03-23 富士ゼロックス株式会社 Question answering system, data retrieval method, and computer program
EP1889179A2 (en) * 2005-05-27 2008-02-20 Hakia, Inc. System and method for natural language processing and using ontological searches
JP4654776B2 (en) 2005-06-03 2011-03-23 富士ゼロックス株式会社 Question answering system, data retrieval method, and computer program
BRPI0613796A2 (en) 2005-06-07 2011-02-15 Univ Yale pharmaceutical compositions and their uses, and methods of treating cancer and other pathological conditions or conditions by the use of clevudine (lfmau) and telbivudine (ldt)
JP4654780B2 (en) 2005-06-10 2011-03-23 富士ゼロックス株式会社 Question answering system, data retrieval method, and computer program
US8756245B2 (en) 2005-07-25 2014-06-17 Iac Search & Media, Inc. Systems and methods for answering user questions
US20070073533A1 (en) 2005-09-23 2007-03-29 Fuji Xerox Co., Ltd. Systems and methods for structural indexing of natural language text
US20070078842A1 (en) 2005-09-30 2007-04-05 Zola Scot G System and method for responding to a user reference query
US7873624B2 (en) 2005-10-21 2011-01-18 Microsoft Corporation Question answering over structured content on the web
US7831597B2 (en) 2005-11-18 2010-11-09 The Boeing Company Text summarization method and apparatus using a multidimensional subspace
CN101305366B (en) 2005-11-29 2013-02-06 国际商业机器公司 Method and system for extracting and visualizing graph-structured relations from unstructured text
US8832064B2 (en) 2005-11-30 2014-09-09 At&T Intellectual Property Ii, L.P. Answer determination for natural language questioning
US7603330B2 (en) 2006-02-01 2009-10-13 Honda Motor Co., Ltd. Meta learning for question classification
JP2007219955A (en) 2006-02-17 2007-08-30 Fuji Xerox Co Ltd Question and answer system, question answering processing method and question answering program
WO2007149216A2 (en) 2006-06-21 2007-12-27 Information Extraction Systems An apparatus, system and method for developing tools to process natural language text
US7890499B1 (en) * 2006-07-28 2011-02-15 Google Inc. Presentation of search results with common subject matters
US20080071714A1 (en) * 2006-08-21 2008-03-20 Motorola, Inc. Method and apparatus for controlling autonomic computing system processes using knowledge-based reasoning mechanisms
US8145677B2 (en) 2007-03-27 2012-03-27 Faleh Jassem Al-Shameri Automated generation of metadata for mining image and text data
US7702695B2 (en) 2007-06-27 2010-04-20 Microsoft Corporation Object relational map verification system
US8229881B2 (en) * 2007-07-16 2012-07-24 Siemens Medical Solutions Usa, Inc. System and method for creating and searching medical ontologies
US20100100546A1 (en) 2008-02-08 2010-04-22 Steven Forrest Kohler Context-aware semantic virtual community for communication, information and knowledge management
US8326795B2 (en) 2008-02-26 2012-12-04 Sap Ag Enhanced process query framework
US7966316B2 (en) 2008-04-15 2011-06-21 Microsoft Corporation Question type-sensitive answer summarization
US8332394B2 (en) * 2008-05-23 2012-12-11 International Business Machines Corporation System and method for providing question and answers with deferred type evaluation
US8275803B2 (en) * 2008-05-14 2012-09-25 International Business Machines Corporation System and method for providing answers to questions
US8195672B2 (en) * 2009-01-14 2012-06-05 Xerox Corporation Searching a repository of documents using a source image as a query
US8825640B2 (en) * 2009-03-16 2014-09-02 At&T Intellectual Property I, L.P. Methods and apparatus for ranking uncertain data in a probabilistic database
US8280838B2 (en) 2009-09-17 2012-10-02 International Business Machines Corporation Evidence evaluation system and method based on question answering

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720836B2 (en) * 2000-11-21 2010-05-18 Aol Inc. Internet streaming media workflow architecture
US8005850B2 (en) * 2004-03-15 2011-08-23 Yahoo! Inc. Search systems and methods with integration of user annotations
US20060026128A1 (en) * 2004-06-29 2006-02-02 Xerox Corporation Expanding a partially-correct list of category elements using an indexed document collection
US20080010262A1 (en) * 2006-06-12 2008-01-10 Metacarta, Inc. System and methods for providing statstically interesting geographical information based on queries to a geographic search engine
US7890521B1 (en) * 2007-02-07 2011-02-15 Google Inc. Document-based synonym generation
US8122022B1 (en) * 2007-08-10 2012-02-21 Google Inc. Abbreviation detection for common synonym generation
US8209321B2 (en) * 2007-08-31 2012-06-26 Microsoft Corporation Emphasizing search results according to conceptual meaning
US20090094267A1 (en) * 2007-10-04 2009-04-09 Muguda Naveenkumar V System and Method for Implementing Metadata Extraction of Artifacts from Associated Collaborative Discussions on a Data Processing System
US8176062B2 (en) * 2008-04-28 2012-05-08 American Express Travel Related Services Company, Inc. Service provider framework
US20100274790A1 (en) * 2009-04-22 2010-10-28 Palo Alto Research Center Incorporated System And Method For Implicit Tagging Of Documents Using Search Query Data

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9245006B2 (en) * 2011-09-29 2016-01-26 Sap Se Data search using context information
US20130086101A1 (en) * 2011-09-29 2013-04-04 Sap Ag Data Search Using Context Information
US20150112683A1 (en) * 2012-03-13 2015-04-23 Mitsubishi Electric Corporation Document search device and document search method
US9092989B2 (en) 2012-11-16 2015-07-28 International Business Machines Corporation Multi-dimensional feature merging for open domain question answering
US9092988B2 (en) 2012-11-16 2015-07-28 International Business Machines Corporation Multi-dimensional feature merging for open domain question answering
US10002126B2 (en) 2013-03-15 2018-06-19 International Business Machines Corporation Business intelligence data models with concept identification using language-specific clues
US20140278364A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Business intelligence data models with concept identification using language-specific clues
US10157175B2 (en) * 2013-03-15 2018-12-18 International Business Machines Corporation Business intelligence data models with concept identification using language-specific clues
US9558187B2 (en) * 2013-06-27 2017-01-31 International Business Machines Corporation Enhanced document input parsing
US9418066B2 (en) 2013-06-27 2016-08-16 International Business Machines Corporation Enhanced document input parsing
US10437890B2 (en) 2013-06-27 2019-10-08 International Business Machines Corporation Enhanced document input parsing
US20150006449A1 (en) * 2013-06-27 2015-01-01 International Business Machines Corporation Enhanced Document Input Parsing
US10430469B2 (en) 2013-06-27 2019-10-01 International Business Machines Corporation Enhanced document input parsing
US10698924B2 (en) 2014-05-22 2020-06-30 International Business Machines Corporation Generating partitioned hierarchical groups based on data sets for business intelligence data models
US9727637B2 (en) * 2014-08-19 2017-08-08 International Business Machines Corporation Retrieving text from a corpus of documents in an information handling system
US20160055234A1 (en) * 2014-08-19 2016-02-25 International Business Machines Corporation Retrieving Text from a Corpus of Documents in an Information Handling System
US10430445B2 (en) * 2014-09-12 2019-10-01 Nuance Communications, Inc. Text indexing and passage retrieval
US20160078102A1 (en) * 2014-09-12 2016-03-17 Nuance Communications, Inc. Text indexing and passage retrieval
US9613133B2 (en) 2014-11-07 2017-04-04 International Business Machines Corporation Context based passage retrieval and scoring in a question answering system
US10331782B2 (en) 2014-11-19 2019-06-25 Lexisnexis, A Division Of Reed Elsevier Inc. Systems and methods for automatic identification of potential material facts in documents
US10019507B2 (en) 2015-01-30 2018-07-10 International Business Machines Corporation Detection and creation of appropriate row concept during automated model generation
US10002179B2 (en) 2015-01-30 2018-06-19 International Business Machines Corporation Detection and creation of appropriate row concept during automated model generation
US10891314B2 (en) 2015-01-30 2021-01-12 International Business Machines Corporation Detection and creation of appropriate row concept during automated model generation
US10606651B2 (en) 2015-04-17 2020-03-31 Microsoft Technology Licensing, Llc Free form expression accelerator with thread length-based thread assignment to clustered soft processor cores that share a functional circuit
US10540588B2 (en) 2015-06-29 2020-01-21 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory
US9984116B2 (en) 2015-08-28 2018-05-29 International Business Machines Corporation Automated management of natural language queries in enterprise business intelligence analytics
WO2020172155A1 (en) * 2019-02-18 2020-08-27 David Nahamoo Intelligent document system
US11797583B2 (en) 2019-02-18 2023-10-24 Pryon Incorporated Intelligent document system
CN110532229A (en) * 2019-06-14 2019-12-03 平安科技(深圳)有限公司 Instrument of evidence search method, device, computer equipment and storage medium
WO2022204435A3 (en) * 2021-03-24 2022-11-24 Trust & Safety Laboratory Inc. Multi-platform detection and mitigation of contentious online content
CN113204621A (en) * 2021-05-12 2021-08-03 北京百度网讯科技有限公司 Document storage method, document retrieval method, device, equipment and storage medium
CN113448984A (en) * 2021-07-15 2021-09-28 中国银行股份有限公司 Document positioning display method and device, server and electronic equipment
CN113505147A (en) * 2021-07-27 2021-10-15 中国工商银行股份有限公司 Data processing method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
US20120078873A1 (en) 2012-03-29
EP2616927A4 (en) 2017-02-22
US9569724B2 (en) 2017-02-14
WO2012040676A1 (en) 2012-03-29
US9508038B2 (en) 2016-11-29
WO2012040677A1 (en) 2012-03-29
EP2616927A1 (en) 2013-07-24
US20120330921A1 (en) 2012-12-27
CN103221915A (en) 2013-07-24
US20120331003A1 (en) 2012-12-27
CN103221915B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
US20120331003A1 (en) Efficient passage retrieval using document metadata
US11144544B2 (en) Providing answers to questions including assembling answers from multiple document segments
US11409751B2 (en) Providing answers to questions using hypothesis pruning
US10823265B2 (en) Providing answers to questions using multiple models to score candidate answers
US9830381B2 (en) Scoring candidates using structural information in semi-structured documents for question answering systems
US9703861B2 (en) System and method for providing answers to questions
US8332394B2 (en) System and method for providing question and answers with deferred type evaluation
US20110078192A1 (en) Inferring lexical answer types of questions from context
CA2812338A1 (en) Lexical answer type confidence estimation and application
JP2023507286A (en) Automatic creation of schema annotation files for converting natural language queries to structured query language
Kiran et al. An approach towards establishing reference linking in desktop reference manager
Fogarolli Wikipedia as a source of ontological knowledge: state of the art and application
Samih et al. * Improving Natural Language Queries Search and Retrieval through Semantic Image Annotation Understanding
Ang et al. Ontology-centric, Service-Oriented Enterprise Campaign Management System
Aly et al. XML information retrieval from spoken word archives
Amato et al. A Semantic Search Engine in the Cloud

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHU-CARROLL, JENNIFER;FERRUCCI, DAVID A.;SIGNING DATES FROM 20111006 TO 20111007;REEL/FRAME:027375/0048

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION