WO2016048996A1 - Semantic text search - Google Patents

Semantic text search Download PDF

Info

Publication number
WO2016048996A1
WO2016048996A1 PCT/US2015/051403 US2015051403W WO2016048996A1 WO 2016048996 A1 WO2016048996 A1 WO 2016048996A1 US 2015051403 W US2015051403 W US 2015051403W WO 2016048996 A1 WO2016048996 A1 WO 2016048996A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
matching
search query
query
search
Prior art date
Application number
PCT/US2015/051403
Other languages
French (fr)
Inventor
Vladimir Zelevinsky
Yevgeniy Dashevsky
Diana Ye
Original Assignee
Oracle International Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corporation filed Critical Oracle International Corporation
Priority to CN201580049293.5A priority Critical patent/CN106716408B/en
Priority to EP15844520.5A priority patent/EP3198490A4/en
Priority to JP2017515135A priority patent/JP6588089B2/en
Publication of WO2016048996A1 publication Critical patent/WO2016048996A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • One embodiment is directed generally to a computer system, and in particular to a computer system that provides searching of a text corpus.
  • One embodiment is a system for performing semantic search.
  • the system receives an electronic text corpus and separates the text corpus into a plurality of sentences.
  • the system parses and converts each sentence into a sentence tree.
  • the system receives a search query and matches the search query with one or more of the sentence trees.
  • FIG. 1 is a block diagram of a computer server/system in accordance with an embodiment of the present invention.
  • Fig. 2 is a high-level flow diagram of the functionality of a semantic text search module and other elements of Fig. 1 in accordance with one embodiment of the present invention.
  • FIG. 3 illustrates an example tree formed by parsing the sentence: "The car did not deploy front airbags during the accident" in accordance with one
  • FIGs. 4A, 4B and 4C illustrate a screenshot showing parsed sentences and hypernym matching in accordance with embodiments of the invention.
  • FIGs. 5A and 5B illustrate screenshots of a semantic search user interface in accordance with embodiments of the invention.
  • FIG. 6 illustrates screenshots of a single term modification via a user interface in accordance with one embodiment.
  • Figs. 7A and 7B are example user interfaces that illustrate refinements and a summary of the result set in accordance with one embodiment.
  • the embodiments described herein give improved technical solutions to the problem of searching the massive volumes of electronic documents for useful information.
  • the purpose of the examples given are only to illustrate the embodiments of the invention.
  • the actual use cases for the embodiments of the invention include the searching of a text corpus of arbitrary size, possibly including millions or more electronic documents (e.g., emails, articles, books, web pages, tweets, etc.), where the sheer number of words makes any searching for information manually impractical or near impossible, while the precision/recall tradeoffs inherent in keyword search make the approach useless for the cases where high recall or high precision are required.
  • One embodiment is a system that performs semantic text search by converting each sentence of a text corpus into a tree.
  • a search query is then also converted into a tree and/or interpreted as a tree, and the search tree is matched with one or more of the text corpus trees.
  • a response of documents that correspond to the search query is generated.
  • refinements of related queries can be generated.
  • the use of tree matching provides for a semantic-based search.
  • Another embodiment of the present invention can locate the entities of interest, such as brand or product names, and perform high-precision sentiment extraction based on the other terms that modify such entities of interest.
  • embodiments find text based not only on the words, but also on the way the words act upon and modify each other.
  • Other embodiments apply additional knowledge bases by enriching the text with additional information such as synonyms.
  • Embodiments augment text with additional semantic knowledge to allow high recall during retrieval, and utilize as much underlying structure of the text as possible, to obtain high precision.
  • Fig. 1 is a block diagram of a computer server/system 10 in accordance with an embodiment of the present invention. Although shown as a single system, the functionality of system 10 can be implemented as a distributed system. Further, the functionality disclosed herein can be implemented on separate servers or devices that may be coupled together over a network. Further, one or more components of system 10 may not be included. For example, for the functionality of a server that performs semantic text search, system 10 may not include peripheral devices such as keyboard 26 and cursor control 28.
  • System 10 includes a bus 12 or other communication mechanism for communicating information, and a processor 22 coupled to bus 12 for processing information.
  • Processor 22 may be any type of general or specific purpose processor.
  • System 10 further includes a memory 14 for storing information and instructions to be executed by processor 22.
  • Memory 14 can be comprised of any combination of random access memory (“RAM”), read only memory (“ROM”), static storage such as a magnetic or optical disk, or any other type of computer readable media.
  • System 10 further includes a communication device 20, such as a network interface card, to provide access to a network. Therefore, a user may interface with system 10 directly or remotely through a network, or any other method.
  • Computer readable media may be any available media that can be accessed by processor 22 and includes both volatile and nonvolatile media, removable and non-removable media, and communication media.
  • Communication media may include computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • Processor 22 is further coupled via bus 12 to a display 24, such as a Liquid Crystal Display ("LCD").
  • LCD Liquid Crystal Display
  • a keyboard 26 and a cursor control device 28, such as a computer mouse, are further coupled to bus 12 to enable a user to interface with system 10.
  • memory 14 stores software modules that provide functionality when executed by processor 22.
  • the modules include an operating system 15 that provides operating system functionality for system 10.
  • the modules further include semantic text search module 16 for providing semantic text search, and all other functionality disclosed herein.
  • System 10 can be part of a larger system.
  • system 10 can include one or more additional functional modules 18 to include the additional functionality.
  • a database 17 is coupled to bus 12 to provide centralized storage for modules 16 and 18 and store a text corpus, trees, etc.
  • first server or servers which locate and download electronic documents from either the Internet or Intranet or any combination thereof. These documents are then stored in a database (e.g., Structured Query Language (“SQL”) or Not only SQL (“NoSQL”), or any combination thereof).
  • SQL Structured Query Language
  • NoSQL Not only SQL
  • a second server or servers have semantic text search software that when executed by the second server processors using the documents stored in the database performs the
  • the search query in one embodiment, is received at 210 via a Graphical User Interface ("GUI") that is displayed on a Personal Computer (“PC”), mobile phone or other mobile device.
  • GUI Graphical User Interface
  • Fig. 2 is a high-level flow diagram of the functionality of semantic text search module 16 and other elements of Fig. 1 in accordance with one embodiment of the present invention.
  • an electronic document(s) is any information recorded in a manner that requires a computer or other electronic device to display, interpret, and process it. This includes documents generated by software and stored on volatile and/or non-volatile storage. Some examples include articles, electronic mail, web pages, tweets, unstructured text records, or any
  • the electronic document includes some electronic parsable text.
  • a text corpus is understood as a group of one or more electronic documents. Examples of a text corpus include the entire Internet, an electronic library, or a document repository.
  • a text corpus is received.
  • the text corpus can be stored on database 17 of Fig. 1 , or any remote or local volatile or non-volatile memory.
  • the text corpus is separated into sentences.
  • each sentence is parsed and converted into a tree (i.e., a "sentence tree").
  • the sentence parse can be the grammatical parse or a sentence diagram.
  • different embodiments can use various available computer-implemented natural language parsers, including, but not limited to, "The Stanford Parser: A statistical parser", "ClearNLP", and others.
  • Each tree is formed of nodes corresponding to each term in the sentence, connected by edges.
  • the edges provide the grammatical relations of the connected nodes. For example, an edge can indicate that one term is a modifier of another term connected by the edge.
  • Fig. 3 illustrates an example tree formed by parsing the sentence: "The car did not deploy front airbags during the accident” in accordance with one embodiment.
  • the nodes include the words in the sentence: “deploy”, “car”, “did”, “not”, “airbags”, “during”, “the”, “front”, “accident”, and “the”.
  • the edges include the
  • the parse tree can include nodes which include the words in the sentence and each node can include a type (i.e., the syntactic function (e.g., subject, verb, object)), and the edges can include dependencies between nodes (i.e., all or part of the structure of the parse tree).
  • a type i.e., the syntactic function (e.g., subject, verb, object)
  • the edges can include dependencies between nodes (i.e., all or part of the structure of the parse tree).
  • the edges may optionally include a type (e.g., direct object or indirect object).
  • the sentences "Joe passed the salt shaker.” and “Joe passed the salt shaker to his father.” have “salt shaker” as a direct object in both sentences and "his father” as an indirect object in the second sentence.
  • the first parse tree of the first sentence may have an edge with type: direct object, while the second parse tree of the second sentence may also include an edge with type: indirect object.
  • trees from 206 are optionally modified. For example, trees can be split, trimmed, augmented with additional edges, edge types collapsed, etc. Article nodes can be dropped. Different embodiments can define "term" in different ways: one embodiment can interpret each word as a separate term. Other
  • one or more nodes can be expanded to include synonyms, hypernyms, hyponyms, and/or other related words or phrases. These related words or phrases can be organized inside the node as a flat list or as a more complicated structure, such as a tree.
  • edges can also be collapsed or expanded. For example, in the sentence "Joe passed the salt shaker to his father.”, the edge with the type indirect object “his father” and the edge with the direct object edge “salt shaker” can be both converted into the generic "object” type edge.
  • Different embodiments can expand the edge types, for example by both retaining the original edge type (direct object, indirect object) and adding the broader type of a generic "object” type on top of that.
  • a search query is received.
  • the query can consist of a single term, several terms, or be in the form of a complete sentence.
  • the query is optionally interpreted as a tree and/or is converted into a tree (i.e., a "query tree").
  • the query is interpreted as a tree even if it contains only a single term (i.e., a one node tree).
  • refinements are generated/suggested to the user to add more related terms, and the refinements lead to the creation of a tree.
  • the query can be converted into a tree (e.g., an automatic conversion into a tree by parsing), or through a mechanism that only allows a user to construct a query of a connected tree that is built through suggested refinements.
  • the tree generated in response to the query is matched with one or more sentence trees generated from the text corpus at 206.
  • the matching determines if the query tree is a subtree of any of the sentence trees. Tree matching can be performed either strictly (i.e., the match is considered positive if the exact matching set of nodes is connected in exactly the same way by the exactly matching set of nodes) or approximately (i.e., the nodes can be the same but the edges can differ, or the same set of edges can connect only a subset of the query nodes, etc.).
  • a response of corresponding matching documents is generated. Specifically, the documents that include the sentences that correspond to the matching trees are selected as matching documents.
  • refinements of related queries are generated based on the matches to build the query tree into a bigger tree.
  • users can refine their search based on entities actually acting upon each other. For example, after the user searches for "car”, refinement searches can be offered such as “drove car”, “crashed car” and "car accident".
  • the "car accident” query would only return the documents where the accident was a car accident. Therefore, it would not return the document that included "we saw a train accident from our car”.
  • Different embodiments can suggest broadening refinements (where the suggested query is a subtree of the current query) or lateral refinements (where a term or an edge is replaced with another term or edge).
  • the refinement process could be iterated for as many steps as desired.
  • the query "drove car” would return future refinements such as "John drove car”.
  • This latter query is different from regular text search, as the semantic search would not match documents that include “Peter drove the car, while John stayed home” and different from phrase search, as a phrase search for "John drove car” would not match the text "John, my neighbor, drove my car", while semantic search in accordance to embodiments would match it correctly because it understands the meaning encoded in the sentence structure.
  • refinements can be grouped by the edge type. For example, if the current query is "drove car”, refinements can be grouped by subjects ("John drove car”, “Peter drove car”); adjectives ("drove old car”, “drove new car”); adverbs ("drove car carelessly”, “drove car carefully”); etc.
  • sentiment extraction is performed in some embodiments. For example, if the search term is a company called “Ace”, all sentiment such as modifiers of "terrible”, “great”, etc., that are grammatically linked to the search term are retrieved. Sentences that merely contain modifiers that do not modify the target search term will not be counted (i.e., "Used Acme product on my terrible ride home” will not return negative sentiment for the term "Acme").
  • other sources are used to enrich the trees generated at 206.
  • the other sources can be external taxonomies that provide tree-like structures, such as "WordNet®" from
  • a "hyponym” is a word or phrase whose semantic field is included within that of another word, its "hypernym.”
  • a hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle, and seagull are all hyponyms of bird (their hypernym), which, in turn, is a hyponym of animal.
  • anaphora resolution is used to generate a semantic search result. For example, if the text corpus includes the text "John drove the car. He crashed it.”, embodiments can infer who crashed what, and return the second sentence to the queries "John crashed” and "crashed car”.
  • Fig. 4 illustrates a screenshot from a GUI showing parsed sentences and hypernym matching in accordance with embodiments of the invention.
  • Fig. 4a illustrates the left side of the screenshot
  • Fig. 4b illustrates the right side of the screenshot
  • Fig. 4c illustrates an enlarged unstructured text example.
  • the sentences are parsed into trees with nodes for actor 401 , action 402 and object 403.
  • this query causes the retrieval of all parse trees where the Actor node matches the term "car” with the hypernym "vehicle”, and the Object node matches the term "problem" with the hypernym "difficulty”.
  • Some embodiments only use selected nodes from the hypernym/hyponym trees, skipping the others. For example, Princeton's WordNet v3.1 contains the following hypernym trees:
  • Si (n) motor vehicle automotive vehicle (a self-propelled wheeled vehicle that does not run on rails)
  • Si (n) self-propelled vehicle a wheeled vehicle that carries in itself a means of propulsion
  • Si (n) wheeled vehicle (a vehicle that moves on wheels and usually has a container for transporting things or people) "the oldest known wheeled vehicles were found in Sumer and Iran and date from around 3500 BC"
  • Si (n) vehicle a conveyance that transports people or objects
  • Si (n) problem, job (a state of difficulty that needs to be resolved) "she and her husband are having problems "; "it is always a job to contact him “; "urban problems such as traffic congestion and smog"
  • Certain embodiments can skip the "motor vehicle”, “self-propelled vehicle”, and “wheeled vehicle” hypernyms, directly connecting the term “car” to the hypernym "vehicle”.
  • every matching actor car, truck, motorcycle, etc.
  • the word “vehicle” does not need to appear in the text of the document for it to be matched.
  • a query for "vehicle” would return all instances of "vehicle", whether they are actors, objects, or possibly other nodes of parsed trees.
  • the users are not required to explicitly specify they are interested in hypernyms.
  • the query "vehicle” would return both the documents where the verbatim text mentions "vehicle”, as well as those documents where the entities (car, truck, etc.) have the term "vehicle” as a hypernym.
  • Some embodiments allow complicated query construction, where the query specifies not only nodes but also edges.
  • the user can specify the query that retrieves all nodes that match the terms "allergy” or "allergic”, refine the resulting trees by the name of a particular medicine, and then further refine the resulting trees by the absence of a negative edge.
  • Such a query would retrieve only the records of the patients who are allergic to the particular medicine.
  • a user is allowed to modify semantic search queries on a term-by-term basis.
  • Fig. 6 illustrates screenshots of a single term modification via a user interface 600 in accordance with one embodiment.
  • the specific verbatim form is shown at 601
  • the hyponym/hypernym options are shown at 602. Unavailable refinements may be grayed out.
  • hierarchical term modification can be used for multiple hypernyms such as: device > computer > apple
  • Such hierarchies also allow a user to perform sense disambiguation.
  • breadcrumbs can be used for other enriched dimensions (e.g., geographical location, semantic role, etc.) such as:
  • the query can be modified by exposing noun phrase modifiers as dynamic dimensions such as follows:
  • Such modifiers can also be included into the query modification Ul 600, as shown at 603 of Fig. 6.
  • Tag clouds typically provide a set of frequent terms, given the query. However, in embodiments that use semantics, tag clouds can provides a set of frequent subtrees, given the query.
  • Embodiments bridge the gap between semantics and text (in other words, the meaning and its representation) by shifting the focus of search from tokens to the graph representation of the text, namely terms/entities as nodes and connections between them as edges.
  • a prior art "bag of words” model does not distinguish between searches such as [theory group] and [group theory], and maps all senses of polysemic terms such as "Java” to the same token.
  • Embodiments use linguistic constructs known as "phrases".
  • a noun phrase can include a sequence of zero or more adjectives, followed by one or more nouns; other embodiments may take into accounts articles and prepositions.
  • a verb phrase is zero or more adverbs followed by one or more verbs.
  • Embodiments further use typed entity extraction, where trained extractors mark up the corpus text for entities such as People, Places, or Organizations.
  • Exposing such entities to the user via information-scent-rich interfaces allows the user to select not just the terms of interest but the terms that are used to convey a particular meaning.
  • One embodiment uses the extraction of actor/action/object triples (where action is a verb phrase, and actor and object are noun phrases).
  • action is a verb phrase, and actor and object are noun phrases.
  • the sentence "The car did not deploy front airbags during the accident" can thus be parsed into one such triple: ⁇ car/did not deploy/front airbags ⁇ .
  • Such parsing is lossy: it necessarily simplifies the sentence structure, including, in this example, the loss of the modifier "during the accident". However, it does extract a wealth of information from the sentence that is not accessible to the bag of words model.
  • RDF schema ⁇ subject/predicate/object ⁇ is a superset of the above. Such a schema supports the extraction of different types of entities and predicates.
  • One embodiment parses sentences or sentence fragments into graphs, with terms (subjects, objects, actions, modifiers, etc.) mapped to nodes in the graph, while predicates form (directed) edges.
  • the edges are mapped from the grammatical structure of the sentence, including edges that describe relationships between terms such as "clausal”, “direct”, “prepositional”, etc.
  • One taxonomy of such edges is disclosed in de Marneffe et al., "Stanford typed
  • Embodiments provide guided (or faceted) navigation that includes a user interface element that offers refining links (possibly with the count of records that would be obtained upon selection, or other indication of the record count, such as histogram bars) such as shown in Fig. 7.
  • the guided navigation interface in one embodiment has a two-fold purpose.
  • the refinements offer venues for narrowing down the result set.
  • they offer the summary of the result set, which provides the semantic search aspect to the guided navigation user experience.
  • Figs. 7a and 7b are example interfaces that illustrate refinements and a summary of the result set in accordance with one embodiment.
  • selecting a particular term can cause an embodiment to display possible refining (more detailed) queries, by querying the database of parse trees and limiting the set of refining queries to those that contain the original term and one additional term, connected to the original one by a direct edge.
  • Such refinements both represent to the user the semantic content of the current result set and suggest the possible direction of further navigation.
  • the further refinements can be selected from the list, or the user can search for the terms of interest via the search box.
  • Fig. 7a shows one such embodiment, where for the original search term "color”, refinements such as "find color”, "favorite color”, and "color use” are suggested, along with the histogram bars indicating the frequency of each refinement.
  • Embodiments allow the same interaction model to be utilized at the very outset of the querying process, allowing the user to start by refining the entire collection of information via selecting a term or a set of terms from the database of parsed semantic trees, displayed in a manner identical to "tag clouds” or "word clouds”.
  • embodiments use the context to group tokens into entities and sentence structure to detect the way entities act on other entities. Further, embodiments use information outside of the corpus, such as augmenting each field of a parsed semantic tree with a more generic hierarchical parent value. In some
  • "Word Net ®” can be used to locate a hypernym (a broadening synonym) for nodes such as nouns or verbs.
  • the head terms in English, almost always the rightmost term of a phrase
  • airbags for "front airbags”
  • Centon for both "Hillary Clinton” and "Bill Clinton”.
  • Fig. 7a illustrates the process where the user can start by searching for any term that expresses the search intent ("color”), and then proceed by selecting a refinement ("favorite color”) to narrow down the result set. The embodiment then offers further possible refinements in Fig. 7b.
  • embodiments provide semantic text search by parsing sentences into trees and providing a mechanism to match such trees with user queries.
  • the matching trees can generate a response of matching corresponding documents, refinements, or provide the set of sentiment-bearing terms to power sentiment extraction.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A system for performing semantic search receives an electronic text corpus and separates the text corpus into a plurality of sentences. The system parses and converts each sentence into a sentence tree. The system receives a search query and matches the search query with one or more of the sentence trees.

Description

SEMANTIC TEXT SEARCH
CROSS REFERENCE TO RELATED APPLICATIONS
[0001]This application claims priority of Provisional Patent Application Serial No. 62/053,283, filed on September 22, 2014, the contents of which is hereby incorporated by reference.
FIELD
[0002] One embodiment is directed generally to a computer system, and in particular to a computer system that provides searching of a text corpus.
BACKGROUND INFORMATION
[0003] As the volume of text-based content available continues to grow
exponentially, both on the Internet and other content repositories, such as behind-the- firewall data, the importance of search engines and search technology is reinforced. Virtually every user employs one or more search engines to locate relevant content on a frequent basis. With the large quantity of material available, various tools and methods for the refinement of search engine results have been created with varying degrees of success.
[0004] The most popular search engines available primarily follow the interaction model of the user entering a set of text search terms through a search engine interface, and the text search terms are then used to extract a result set from the index created or administered by the search engine. However, one of the limitations of a purely text- based search is that if a text search term is used that can have more than one definition or meaning, the result set which is retrieved will not be as focused or relevant to the topic of interest as might be desired. An additional limitation occurs when the user enters more than one search term. Many search engines limit their interpretation of such multi-term query as a simple request to locate all documents that contain all search query terms or some logical combination or simple variation (e.g., stemming) thereof. Results of this type of search have generally been unsatisfactory for all but the most basic text document retrieval tasks.
[0005] Specifically, although meaning is communicated via words, the typical text or keyword search does not search for the meaning. The creator of the text to be searched has encoded a certain meaning inside the text. Similarly, the person initiating a search encodes desired meaning in a keyword query. The search will return the "correct" result only if both encodings coincide.
SUMMARY
[0006] One embodiment is a system for performing semantic search. The system receives an electronic text corpus and separates the text corpus into a plurality of sentences. The system parses and converts each sentence into a sentence tree. The system receives a search query and matches the search query with one or more of the sentence trees. BRIEF DESCRIPTION OF THE DRAWINGS
[0007] Fig. 1 is a block diagram of a computer server/system in accordance with an embodiment of the present invention.
[0008] Fig. 2 is a high-level flow diagram of the functionality of a semantic text search module and other elements of Fig. 1 in accordance with one embodiment of the present invention.
[0009] Fig. 3 illustrates an example tree formed by parsing the sentence: "The car did not deploy front airbags during the accident" in accordance with one
embodiment.
[0010] Figs. 4A, 4B and 4C illustrate a screenshot showing parsed sentences and hypernym matching in accordance with embodiments of the invention.
[0011] Figs. 5A and 5B illustrate screenshots of a semantic search user interface in accordance with embodiments of the invention.
[0012] Fig. 6 illustrates screenshots of a single term modification via a user interface in accordance with one embodiment.
[0013] Figs. 7A and 7B are example user interfaces that illustrate refinements and a summary of the result set in accordance with one embodiment.
DETAILED DESCRIPTION
[0014] The problem of getting a satisfactory answer to a user query by the electronic searching of a massive volume of electronic documents existed since the early days of computers; however, it has not been fully solved. There have been many different approaches to locating a set of documents to match a user's query, including such well-known search engines including "Google" search from Google Inc. and "Bing" search from Microsoft Corp.
[0015] It is well known that keyword searching using the omnipresent search box is insufficient for supporting many common information seeking tasks. One possible searching technique to improve results is disclosed in Stewart et al., "Idea Navigation: Structured Browsing for Unstructured Text', Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1789-1792 (2008 ACM), which is herein incorporated by reference.
[0016] The embodiments described herein give improved technical solutions to the problem of searching the massive volumes of electronic documents for useful information. The purpose of the examples given are only to illustrate the embodiments of the invention. The actual use cases for the embodiments of the invention include the searching of a text corpus of arbitrary size, possibly including millions or more electronic documents (e.g., emails, articles, books, web pages, tweets, etc.), where the sheer number of words makes any searching for information manually impractical or near impossible, while the precision/recall tradeoffs inherent in keyword search make the approach useless for the cases where high recall or high precision are required.
[0017] One embodiment is a system that performs semantic text search by converting each sentence of a text corpus into a tree. A search query is then also converted into a tree and/or interpreted as a tree, and the search tree is matched with one or more of the text corpus trees. As a result of the match, a response of documents that correspond to the search query is generated. In addition, refinements of related queries can be generated. The use of tree matching provides for a semantic-based search. Another embodiment of the present invention can locate the entities of interest, such as brand or product names, and perform high-precision sentiment extraction based on the other terms that modify such entities of interest.
[0018] In general, embodiments find text based not only on the words, but also on the way the words act upon and modify each other. Other embodiments apply additional knowledge bases by enriching the text with additional information such as synonyms. Embodiments augment text with additional semantic knowledge to allow high recall during retrieval, and utilize as much underlying structure of the text as possible, to obtain high precision.
[0019] Fig. 1 is a block diagram of a computer server/system 10 in accordance with an embodiment of the present invention. Although shown as a single system, the functionality of system 10 can be implemented as a distributed system. Further, the functionality disclosed herein can be implemented on separate servers or devices that may be coupled together over a network. Further, one or more components of system 10 may not be included. For example, for the functionality of a server that performs semantic text search, system 10 may not include peripheral devices such as keyboard 26 and cursor control 28.
[0020] System 10 includes a bus 12 or other communication mechanism for communicating information, and a processor 22 coupled to bus 12 for processing information. Processor 22 may be any type of general or specific purpose processor. System 10 further includes a memory 14 for storing information and instructions to be executed by processor 22. Memory 14 can be comprised of any combination of random access memory ("RAM"), read only memory ("ROM"), static storage such as a magnetic or optical disk, or any other type of computer readable media. System 10 further includes a communication device 20, such as a network interface card, to provide access to a network. Therefore, a user may interface with system 10 directly or remotely through a network, or any other method.
[0021] Computer readable media may be any available media that can be accessed by processor 22 and includes both volatile and nonvolatile media, removable and non-removable media, and communication media. Communication media may include computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and includes any information delivery media.
[0022] Processor 22 is further coupled via bus 12 to a display 24, such as a Liquid Crystal Display ("LCD"). A keyboard 26 and a cursor control device 28, such as a computer mouse, are further coupled to bus 12 to enable a user to interface with system 10.
[0023] In one embodiment, memory 14 stores software modules that provide functionality when executed by processor 22. The modules include an operating system 15 that provides operating system functionality for system 10. The modules further include semantic text search module 16 for providing semantic text search, and all other functionality disclosed herein. System 10 can be part of a larger system.
Therefore, system 10 can include one or more additional functional modules 18 to include the additional functionality. A database 17 is coupled to bus 12 to provide centralized storage for modules 16 and 18 and store a text corpus, trees, etc.
[0024] In another embodiment, there is a first server or servers which locate and download electronic documents from either the Internet or Intranet or any combination thereof. These documents are then stored in a database (e.g., Structured Query Language ("SQL") or Not only SQL ("NoSQL"), or any combination thereof). A second server or servers have semantic text search software that when executed by the second server processors using the documents stored in the database performs the
functionality shown in Fig. 2. The search query, in one embodiment, is received at 210 via a Graphical User Interface ("GUI") that is displayed on a Personal Computer ("PC"), mobile phone or other mobile device.
[0025] Fig. 2 is a high-level flow diagram of the functionality of semantic text search module 16 and other elements of Fig. 1 in accordance with one embodiment of the present invention.
[0026] In one embodiment an electronic document(s) (or "document(s)" herein) is any information recorded in a manner that requires a computer or other electronic device to display, interpret, and process it. This includes documents generated by software and stored on volatile and/or non-volatile storage. Some examples include articles, electronic mail, web pages, tweets, unstructured text records, or any
combination thereof. The electronic document includes some electronic parsable text.
[0027] A text corpus is understood as a group of one or more electronic documents. Examples of a text corpus include the entire Internet, an electronic library, or a document repository.
[0028] At 202, a text corpus is received. The text corpus can be stored on database 17 of Fig. 1 , or any remote or local volatile or non-volatile memory.
[0029] At 204, the text corpus is separated into sentences.
[0030] At 206, each sentence (or sentence fragment) is parsed and converted into a tree (i.e., a "sentence tree"). The sentence parse can be the grammatical parse or a sentence diagram. To obtain such a parse, different embodiments can use various available computer-implemented natural language parsers, including, but not limited to, "The Stanford Parser: A statistical parser", "ClearNLP", and others. Each tree is formed of nodes corresponding to each term in the sentence, connected by edges. The edges provide the grammatical relations of the connected nodes. For example, an edge can indicate that one term is a modifier of another term connected by the edge.
[0031] Fig. 3 illustrates an example tree formed by parsing the sentence: "The car did not deploy front airbags during the accident" in accordance with one embodiment. The nodes include the words in the sentence: "deploy", "car", "did", "not", "airbags", "during", "the", "front", "accident", and "the". The edges include the
grammatical relations of the nodes. For example, "car" is the noun subject ("nsubj" edge) of "deploy", and "airbags" is the direct object ("dobj" edge) of "deploy".
[0032] In another embodiment, the parse tree can include nodes which include the words in the sentence and each node can include a type (i.e., the syntactic function (e.g., subject, verb, object)), and the edges can include dependencies between nodes (i.e., all or part of the structure of the parse tree). For example, car [subject] is dependent on deploy, so taken with car's type [Subject], car is the subject of deploy [type ROOT]. The edges may optionally include a type (e.g., direct object or indirect object). For example, the sentences "Joe passed the salt shaker." and "Joe passed the salt shaker to his father." have "salt shaker" as a direct object in both sentences and "his father" as an indirect object in the second sentence. The first parse tree of the first sentence may have an edge with type: direct object, while the second parse tree of the second sentence may also include an edge with type: indirect object.
[0033]At 208, one or more trees from 206 are optionally modified. For example, trees can be split, trimmed, augmented with additional edges, edge types collapsed, etc. Article nodes can be dropped. Different embodiments can define "term" in different ways: one embodiment can interpret each word as a separate term. Other
embodiments can perform dictionary lookup or use statistical or natural language processing techniques to identify multi-word sequences such as "United States" or "chief of staff' as single terms. [0034] In another embodiment, one or more nodes can be expanded to include synonyms, hypernyms, hyponyms, and/or other related words or phrases. These related words or phrases can be organized inside the node as a flat list or as a more complicated structure, such as a tree. In addition, edges can also be collapsed or expanded. For example, in the sentence "Joe passed the salt shaker to his father.", the edge with the type indirect object "his father" and the edge with the direct object edge "salt shaker" can be both converted into the generic "object" type edge. Different embodiments can expand the edge types, for example by both retaining the original edge type (direct object, indirect object) and adding the broader type of a generic "object" type on top of that.
[0035] At 210, a search query is received. The query can consist of a single term, several terms, or be in the form of a complete sentence.
[0036] At 212, the query is optionally interpreted as a tree and/or is converted into a tree (i.e., a "query tree"). In one embodiment, the query is interpreted as a tree even if it contains only a single term (i.e., a one node tree). In another embodiment, refinements are generated/suggested to the user to add more related terms, and the refinements lead to the creation of a tree. The query can be converted into a tree (e.g., an automatic conversion into a tree by parsing), or through a mechanism that only allows a user to construct a query of a connected tree that is built through suggested refinements.
[0037] At 214, the tree generated in response to the query (or just the query itself if the query is not converted into a tree) is matched with one or more sentence trees generated from the text corpus at 206. In one embodiment, the matching determines if the query tree is a subtree of any of the sentence trees. Tree matching can be performed either strictly (i.e., the match is considered positive if the exact matching set of nodes is connected in exactly the same way by the exactly matching set of nodes) or approximately (i.e., the nodes can be the same but the edges can differ, or the same set of edges can connect only a subset of the query nodes, etc.).
[0038] At 216, in response to the matching trees, a response of corresponding matching documents is generated. Specifically, the documents that include the sentences that correspond to the matching trees are selected as matching documents.
[0039] At 218, in response to the matching trees, refinements of related queries are generated based on the matches to build the query tree into a bigger tree. As the result of this, users can refine their search based on entities actually acting upon each other. For example, after the user searches for "car", refinement searches can be offered such as "drove car", "crashed car" and "car accident". The "car accident" query would only return the documents where the accident was a car accident. Therefore, it would not return the document that included "we saw a train accident from our car". Different embodiments can suggest broadening refinements (where the suggested query is a subtree of the current query) or lateral refinements (where a term or an edge is replaced with another term or edge).
[0040] The refinement process could be iterated for as many steps as desired. The query "drove car" would return future refinements such as "John drove car". This latter query is different from regular text search, as the semantic search would not match documents that include "Peter drove the car, while John stayed home" and different from phrase search, as a phrase search for "John drove car" would not match the text "John, my neighbor, drove my car", while semantic search in accordance to embodiments would match it correctly because it understands the meaning encoded in the sentence structure.
[0041] Further, refinements can be grouped by the edge type. For example, if the current query is "drove car", refinements can be grouped by subjects ("John drove car", "Peter drove car"); adjectives ("drove old car", "drove new car"); adverbs ("drove car carelessly", "drove car carefully"); etc.
[0042]At 220, in response to the matching trees, sentiment extraction is performed in some embodiments. For example, if the search term is a company called "Ace", all sentiment such as modifiers of "terrible", "great", etc., that are grammatically linked to the search term are retrieved. Sentences that merely contain modifiers that do not modify the target search term will not be counted (i.e., "Used Acme product on my terrible ride home" will not return negative sentiment for the term "Acme").
[0043] In another embodiment, in addition to an ingested text corpus, other sources are used to enrich the trees generated at 206. The other sources can be external taxonomies that provide tree-like structures, such as "WordNet®" from
Princeton University, which provides senses, or Wikipedia, which organizes concepts according to various taxonomies, such as a taxonomy of geographical locations, related terms, or categories. As a result, hypernyms, hyponyms, synonyms, etc., can be generated in response to query terms. In general, a "hyponym" is a word or phrase whose semantic field is included within that of another word, its "hypernym." In other words, a hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle, and seagull are all hyponyms of bird (their hypernym), which, in turn, is a hyponym of animal.
[0044] For example, if it is known that every "car" is a "vehicle" and every "crash" is an "accident", a search can be performed for "vehicle accident", and every instance of "car crash" can be located. The reverse would not be true, since there are vehicles other than cars and accidents other than crashes. Other taxonomies can be applicable as well. For example, a geographic taxonomy would allow a search for "crime in Massachusetts" to retrieve a reference to "jaywalking in Boston".
[0045] Further, in one embodiment, anaphora resolution is used to generate a semantic search result. For example, if the text corpus includes the text "John drove the car. He crashed it.", embodiments can infer who crashed what, and return the second sentence to the queries "John crashed" and "crashed car".
[0046] Fig. 4 illustrates a screenshot from a GUI showing parsed sentences and hypernym matching in accordance with embodiments of the invention. Fig. 4a illustrates the left side of the screenshot, Fig. 4b illustrates the right side of the screenshot and Fig. 4c illustrates an enlarged unstructured text example. The sentences are parsed into trees with nodes for actor 401 , action 402 and object 403. The parsed trees and corresponding sentences from the text corpus that match the query "actor=/vehicle/car AND object=7difficulty/problem" are shown at 410 and 41 1 , illustrating the matching of the trees that include the hypernym hierarchies. In this embodiment, this query causes the retrieval of all parse trees where the Actor node matches the term "car" with the hypernym "vehicle", and the Object node matches the term "problem" with the hypernym "difficulty". Some embodiments only use selected nodes from the hypernym/hyponym trees, skipping the others. For example, Princeton's WordNet v3.1 contains the following hypernym trees:
Si (n) car, auto, automobile, machine, motorcar (a motor vehicle with four wheels;
usually propelled by an internal combustion engine) "he needs a car to get to work"
I inherited hypernym I
Si (n) motor vehicle, automotive vehicle (a self-propelled wheeled vehicle that does not run on rails)
Si (n) self-propelled vehicle (a wheeled vehicle that carries in itself a means of propulsion)
Si (n) wheeled vehicle (a vehicle that moves on wheels and usually has a container for transporting things or people) "the oldest known wheeled vehicles were found in Sumer and Syria and date from around 3500 BC"
Si (n) vehicle (a conveyance that transports people or objects)
Si (n) problem, job (a state of difficulty that needs to be resolved) "she and her husband are having problems "; "it is always a job to contact him "; "urban problems such as traffic congestion and smog"
/ direct hypernym I Si (n) difficulty (a condition or state of affairs almost beyond one's ability to deal with and requiring great effort to bear or overcome) "grappling with financial difficulties "
[0047] Certain embodiments can skip the "motor vehicle", "self-propelled vehicle", and "wheeled vehicle" hypernyms, directly connecting the term "car" to the hypernym "vehicle".
[0048] Fig. 5a illustrates a screenshot of one embodiment, where the user searched for every document that matches the query "ACTOR=/vehicle", with the leading backslash 7" indicating that the user is interested in hypernyms. For this query, every matching actor (car, truck, motorcycle, etc.) is a particular type of a vehicle. The word "vehicle" does not need to appear in the text of the document for it to be matched.
[0049] Fig. 5b illustrates a subsequent step of navigation of one embodiment, where the user further refined the search by filtering the "ACTOR=/vehicle" query by only returning the documents that also match "OBJECT=/difficulty" query. As the result, only those documents that refer to any kind of vehicle (car, truck, etc.) experiencing any kind of difficulty (problem, trouble, etc.) are returned.
[0050] Other embodiments of the present invention do not require the user to explicitly specify what field of a parsed text the user is interested in. In such an embodiment, a query for "vehicle" would return all instances of "vehicle", whether they are actors, objects, or possibly other nodes of parsed trees. Likewise, in some embodiments the users are not required to explicitly specify they are interested in hypernyms. In such embodiments the query "vehicle" would return both the documents where the verbatim text mentions "vehicle", as well as those documents where the entities (car, truck, etc.) have the term "vehicle" as a hypernym.
[0051] Some embodiments allow complicated query construction, where the query specifies not only nodes but also edges. In one embodiment that searches medical records, the user can specify the query that retrieves all nodes that match the terms "allergy" or "allergic", refine the resulting trees by the name of a particular medicine, and then further refine the resulting trees by the absence of a negative edge. Such a query would retrieve only the records of the patients who are allergic to the particular medicine.
[0052] In some embodiments, a user is allowed to modify semantic search queries on a term-by-term basis. Fig. 6 illustrates screenshots of a single term modification via a user interface 600 in accordance with one embodiment. For the term "accident", the specific verbatim form is shown at 601 , and the hyponym/hypernym options are shown at 602. Unavailable refinements may be grayed out.
[0053] In one embodiment, hierarchical term modification can be used for multiple hypernyms such as: device > computer > apple
food > fruit > apple.
Such hierarchies also allow a user to perform sense disambiguation.
[0054] In one embodiment, breadcrumbs can be used for other enriched dimensions (e.g., geographical location, semantic role, etc.) such as:
U.S. > Massachusetts > Cambridge
U.K. > Cambridgeshire > Cambridge
Person > speaker > Larry Ellison Person > investor > Larry Ellison.
[0055] In one embodiment, the query can be modified by exposing noun phrase modifiers as dynamic dimensions such as follows:
ACCIDENT
major accident (20)
serious accident (15)
first accident (12)
Such modifiers can also be included into the query modification Ul 600, as shown at 603 of Fig. 6.
[0056] Tag clouds typically provide a set of frequent terms, given the query. However, in embodiments that use semantics, tag clouds can provides a set of frequent subtrees, given the query.
[0057] Embodiments bridge the gap between semantics and text (in other words, the meaning and its representation) by shifting the focus of search from tokens to the graph representation of the text, namely terms/entities as nodes and connections between them as edges. A prior art "bag of words" model does not distinguish between searches such as [theory group] and [group theory], and maps all senses of polysemic terms such as "Java" to the same token.
[0058] Embodiments use linguistic constructs known as "phrases". In English, a noun phrase can include a sequence of zero or more adjectives, followed by one or more nouns; other embodiments may take into accounts articles and prepositions. A verb phrase is zero or more adverbs followed by one or more verbs. Embodiments further use typed entity extraction, where trained extractors mark up the corpus text for entities such as People, Places, or Organizations.
[0059] Exposing such entities to the user via information-scent-rich interfaces (related terms, lists, tag clouds, refinements) allows the user to select not just the terms of interest but the terms that are used to convey a particular meaning.
[0060] Meaning is encoded not only on the entity level, but also on the sentence level. Sentence structure expresses the way that entities act on and modify each other. Extraction of this information is essential to retaining the semantics of text.
[0061] One embodiment uses the extraction of actor/action/object triples (where action is a verb phrase, and actor and object are noun phrases). The sentence "The car did not deploy front airbags during the accident" can thus be parsed into one such triple: {car/did not deploy/front airbags}. Such parsing is lossy: it necessarily simplifies the sentence structure, including, in this example, the loss of the modifier "during the accident". However, it does extract a wealth of information from the sentence that is not accessible to the bag of words model.
[0062] Other embodiments do not constrain the extracted triples to the format described above. The "RDF schema" {subject/predicate/object} is a superset of the above. Such a schema supports the extraction of different types of entities and predicates.
[0063] One embodiment parses sentences or sentence fragments into graphs, with terms (subjects, objects, actions, modifiers, etc.) mapped to nodes in the graph, while predicates form (directed) edges. In one embodiment, the edges are mapped from the grammatical structure of the sentence, including edges that describe relationships between terms such as "clausal", "direct", "prepositional", etc. One taxonomy of such edges is disclosed in de Marneffe et al., "Stanford typed
dependencies manuaf (September 2008, Revised for the Stanford Parser v. 3.3 in December 2013).
[0064] Embodiments provide guided (or faceted) navigation that includes a user interface element that offers refining links (possibly with the count of records that would be obtained upon selection, or other indication of the record count, such as histogram bars) such as shown in Fig. 7.
[0065] The guided navigation interface in one embodiment has a two-fold purpose. The refinements, on one hand, offer venues for narrowing down the result set. On the other hand, they offer the summary of the result set, which provides the semantic search aspect to the guided navigation user experience. Figs. 7a and 7b are example interfaces that illustrate refinements and a summary of the result set in accordance with one embodiment.
[0066] In particular, selecting a particular term can cause an embodiment to display possible refining (more detailed) queries, by querying the database of parse trees and limiting the set of refining queries to those that contain the original term and one additional term, connected to the original one by a direct edge. Such refinements both represent to the user the semantic content of the current result set and suggest the possible direction of further navigation. The further refinements can be selected from the list, or the user can search for the terms of interest via the search box. Fig. 7a shows one such embodiment, where for the original search term "color", refinements such as "find color", "favorite color", and "color use" are suggested, along with the histogram bars indicating the frequency of each refinement.
[0067] Embodiments allow the same interaction model to be utilized at the very outset of the querying process, allowing the user to start by refining the entire collection of information via selecting a term or a set of terms from the database of parsed semantic trees, displayed in a manner identical to "tag clouds" or "word clouds".
[0068]As described, embodiments use the context to group tokens into entities and sentence structure to detect the way entities act on other entities. Further, embodiments use information outside of the corpus, such as augmenting each field of a parsed semantic tree with a more generic hierarchical parent value. In some
embodiments, "Word Net®" can be used to locate a hypernym (a broadening synonym) for nodes such as nouns or verbs. For other embodiments, the head terms (in English, almost always the rightmost term of a phrase) can be used, such as using "airbags" for "front airbags", or "Clinton" for both "Hillary Clinton" and "Bill Clinton".
[0069] Fig. 7a illustrates the process where the user can start by searching for any term that expresses the search intent ("color"), and then proceed by selecting a refinement ("favorite color") to narrow down the result set. The embodiment then offers further possible refinements in Fig. 7b.
[0070]As disclosed, embodiments provide semantic text search by parsing sentences into trees and providing a mechanism to match such trees with user queries. The matching trees can generate a response of matching corresponding documents, refinements, or provide the set of sentiment-bearing terms to power sentiment extraction.
[0071] Several embodiments are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the disclosed embodiments are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.

Claims

WHAT IS CLAIMED IS:
1 . A computer-readable medium having instructions stored thereon that, when executed by a processor, cause the processor to perform semantic search, the semantic search comprising:
receiving an electronic text corpus;
separating the text corpus into a plurality of sentences;
parsing and converting each sentence into a sentence tree;
receiving a search query via a user interface; and
matching the search query with one or more of the sentence trees.
2. The computer-readable medium of claim 1 , the semantic search further comprising:
converting the search query into a query tree, wherein the matching comprises matching the query tree with one or more of the sentence trees.
3. The computer-readable medium of claim 1 , further comprising:
in response to the matching, providing documents from the text corpus that correspond to the matched sentence trees.
4. The computer-readable medium of claim 1 , further comprising:
in response to the matching, providing refinements to the search query that correspond to the matched sentence trees.
5. The computer-readable medium of claim 1 , further comprising: in response to the matching, extracting sentiments for the search query that correspond to the matched sentence trees.
6. The computer-readable medium of claim 1 , further comprising using a source external to the text corpus to enrich the sentence trees.
7. The computer-readable medium of claim 2, wherein the user interface displays hypernym matching for the query tree.
8. The computer-readable medium of claim 1 , further comprising modifying one or more of the sentence trees, wherein each sentence tree comprises nodes and at least one edge, and the modifying comprises at least one of: expanding a node, collapsing a node, expanding an edge, or collapsing an edge.
9. A method of performing semantic search, the method comprising:
receiving an electronic text corpus;
separating the text corpus into a plurality of sentences;
parsing and converting each sentence into a sentence tree;
receiving a search query via a user interface; and
matching the search query with one or more of the sentence trees.
10. The method of claim 9, further comprising:
converting the search query into a query tree, wherein the matching comprises matching the query tree with one or more of the sentence trees.
1 1 . The method of claim 9, further comprising:
in response to the matching, providing documents from the text corpus that correspond to the matched sentence trees.
12. The method of claim 9, further comprising:
in response to the matching, providing refinements to the search query that correspond to the matched sentence trees.
13. The method of claim 9, further comprising:
in response to the matching, extracting sentiments for the search query that correspond to the matched sentence trees.
14. The method of claim 9, further comprising using a source external to the text corpus to enrich the sentence trees.
15. The method of claim 10, wherein the user interface displays hypernym matching for the query tree.
16. The method of claim 9, further comprising modifying one or more of the sentence trees, wherein each sentence tree comprises nodes and at least one edge, and the modifying comprises at least one of: expanding a node, collapsing a node, expanding an edge, or collapsing an edge.
17. A semantic text search query system comprising:
a processor;
a storage device coupled to the processor, wherein the storage device stores a plurality of sentence trees that are formed by separating a text corpus into a plurality of sentences and parsing and converting each sentence into one of the sentence trees; wherein the processor is adapted to generate a user interface that receives a search query; and
the processor is adapted to match the search query with one or more of the sentence trees.
18. The semantic text search query system of claim 17, the processor further adapted to convert the search query into a query tree, wherein the matching comprises matching the query tree with one or more of the sentence trees.
19. The semantic text search query system of claim 17, the processor further adapted to, in response to the matching, providing documents from the text corpus that correspond to the matched sentence trees.
20. The semantic text search query system of claim 17, the processor further adapted to, in response to the matching, providing refinements to the search query that correspond to the matched sentence trees.
21 . The semantic text search query system of claim 17, the processor further adapted to, in response to the matching, extracting sentiments for the search query that correspond to the matched sentence trees
22. The semantic text search query system of claim 18, the processor further adapted to, in response to the matching, displaying hypernym matching for the query tree on the user interface.
23. The semantic text search query system of claim 17, the processor further adapted to modify one or more of the sentence trees, wherein each sentence tree comprises nodes and at least one edge, and the modifying comprises at least one of: expanding a node, collapsing a node, expanding an edge, or collapsing an edge.
24. The computer-readable medium of claim 2, wherein the query tree comprises at least one node, further comprising:
after the matching, allowing the user to further refine the semantic search by adding an additional node to the query tree and repeating the matching.
25. The computer-readable medium of claim 1 , further comprising, after receiving the search query, providing selectable refining search queries via the user interface.
PCT/US2015/051403 2014-09-22 2015-09-22 Semantic text search WO2016048996A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201580049293.5A CN106716408B (en) 2014-09-22 2015-09-22 Semantic text search
EP15844520.5A EP3198490A4 (en) 2014-09-22 2015-09-22 Semantic text search
JP2017515135A JP6588089B2 (en) 2014-09-22 2015-09-22 Semantic text search

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201462053283P 2014-09-22 2014-09-22
US62/053,283 2014-09-22
US14/643,390 2015-03-10
US14/643,390 US9836529B2 (en) 2014-09-22 2015-03-10 Semantic text search

Publications (1)

Publication Number Publication Date
WO2016048996A1 true WO2016048996A1 (en) 2016-03-31

Family

ID=55525959

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/051403 WO2016048996A1 (en) 2014-09-22 2015-09-22 Semantic text search

Country Status (5)

Country Link
US (2) US9836529B2 (en)
EP (1) EP3198490A4 (en)
JP (1) JP6588089B2 (en)
CN (1) CN106716408B (en)
WO (1) WO2016048996A1 (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10102275B2 (en) 2015-05-27 2018-10-16 International Business Machines Corporation User interface for a query answering system
US10176251B2 (en) * 2015-08-31 2019-01-08 Raytheon Company Systems and methods for identifying similarities using unstructured text analysis
US10691709B2 (en) 2015-10-28 2020-06-23 Open Text Sa Ulc System and method for subset searching and associated search operators
US10146858B2 (en) 2015-12-11 2018-12-04 International Business Machines Corporation Discrepancy handler for document ingestion into a corpus for a cognitive computing system
US10176250B2 (en) 2016-01-12 2019-01-08 International Business Machines Corporation Automated curation of documents in a corpus for a cognitive computing system
US9842161B2 (en) 2016-01-12 2017-12-12 International Business Machines Corporation Discrepancy curator for documents in a corpus of a cognitive computing system
CN107871259A (en) * 2016-09-26 2018-04-03 阿里巴巴集团控股有限公司 A kind of processing method of information recommendation, device and client
US11068658B2 (en) * 2016-12-07 2021-07-20 Disney Enterprises, Inc. Dynamic word embeddings
US10417269B2 (en) 2017-03-13 2019-09-17 Lexisnexis, A Division Of Reed Elsevier Inc. Systems and methods for verbatim-text mining
US10747815B2 (en) 2017-05-11 2020-08-18 Open Text Sa Ulc System and method for searching chains of regions and associated search operators
US11556527B2 (en) 2017-07-06 2023-01-17 Open Text Sa Ulc System and method for value based region searching and associated search operators
US10565189B2 (en) * 2018-02-26 2020-02-18 International Business Machines Corporation Augmentation of a run-time query
US10824686B2 (en) * 2018-03-05 2020-11-03 Open Text Sa Ulc System and method for searching based on text blocks and associated search operators
IL258689A (en) 2018-04-12 2018-05-31 Browarnik Abel A system and method for computerized semantic indexing and searching
WO2020003355A1 (en) * 2018-06-25 2020-01-02 株式会社フォーラムエンジニアリング Matching score calculation device
WO2020079748A1 (en) * 2018-10-16 2020-04-23 株式会社島津製作所 Case searching method and case searching system
US20220027397A1 (en) * 2018-10-16 2022-01-27 Shimadzu Corporation Case search method
WO2020079752A1 (en) 2018-10-16 2020-04-23 株式会社島津製作所 Document search method and document search system
CN110765368B (en) * 2018-12-29 2020-10-27 滴图(北京)科技有限公司 Artificial intelligence system and method for semantic retrieval
US11126793B2 (en) * 2019-10-04 2021-09-21 Omilia Natural Language Solutions Ltd. Unsupervised induction of user intents from conversational customer service corpora
US11829420B2 (en) * 2019-12-19 2023-11-28 Oracle International Corporation Summarized logical forms for controlled question answering
US12093253B2 (en) 2019-12-19 2024-09-17 Oracle International Corporation Summarized logical forms based on abstract meaning representation and discourse trees
US11599725B2 (en) 2020-01-24 2023-03-07 Oracle International Corporation Acquiring new definitions of entities
WO2021227059A1 (en) * 2020-05-15 2021-11-18 深圳市世强元件网络有限公司 Multi-way tree-based search word recommendation method and system
CN113051286A (en) * 2021-04-20 2021-06-29 中国工商银行股份有限公司 Method and device for generating SQL (structured query language) statement conversion model
US11164153B1 (en) * 2021-04-27 2021-11-02 Skyhive Technologies Inc. Generating skill data through machine learning
CN113377922B (en) * 2021-06-25 2024-04-02 北京百度网讯科技有限公司 Method, device, electronic equipment and medium for matching information
CN113377921B (en) * 2021-06-25 2023-07-21 北京百度网讯科技有限公司 Method, device, electronic equipment and medium for matching information
US20230342544A1 (en) * 2022-04-25 2023-10-26 Lemon Inc. Semantic parsing for short text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009199280A (en) * 2008-02-21 2009-09-03 Hitachi Ltd Similarity retrieval system using partial syntax tree profile
CN102298642B (en) * 2011-09-15 2012-09-05 苏州大学 Method and system for extracting text information
US8533208B2 (en) * 2009-09-28 2013-09-10 Ebay Inc. System and method for topic extraction and opinion mining
US20140019122A1 (en) * 2012-07-10 2014-01-16 Robert D. New Method for Parsing Natural Language Text
KR20140053717A (en) * 2012-10-26 2014-05-08 고려대학교 산학협력단 Sentiment-based query processing system and method

Family Cites Families (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5933822A (en) * 1997-07-22 1999-08-03 Microsoft Corporation Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision
US6081774A (en) 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
US20020010574A1 (en) * 2000-04-20 2002-01-24 Valery Tsourikov Natural language processing and query driven information retrieval
US7027974B1 (en) 2000-10-27 2006-04-11 Science Applications International Corporation Ontology-based parser for natural language processing
JP2005309666A (en) * 2004-04-20 2005-11-04 Konica Minolta Holdings Inc Information retrieval device
US8244726B1 (en) 2004-08-31 2012-08-14 Bruce Matesso Computer-aided extraction of semantics from keywords to confirm match of buyer offers to seller bids
US20070260450A1 (en) * 2006-05-05 2007-11-08 Yudong Sun Indexing parsed natural language texts for advanced search
US9069750B2 (en) * 2006-10-10 2015-06-30 Abbyy Infopoisk Llc Method and system for semantic searching of natural language texts
US7698259B2 (en) 2006-11-22 2010-04-13 Sap Ag Semantic search in a database
US7925498B1 (en) * 2006-12-29 2011-04-12 Google Inc. Identifying a synonym with N-gram agreement for a query phrase
JP2009193448A (en) * 2008-02-15 2009-08-27 Oki Electric Ind Co Ltd Dialog system, method, and program
JP5038939B2 (en) * 2008-03-03 2012-10-03 インターナショナル・ビジネス・マシーンズ・コーポレーション Information retrieval system, method and program
US8180754B1 (en) 2008-04-01 2012-05-15 Dranias Development Llc Semantic neural network for aggregating query searches
US9317589B2 (en) 2008-08-07 2016-04-19 International Business Machines Corporation Semantic search by means of word sense disambiguation using a lexicon
CA2639438A1 (en) 2008-09-08 2010-03-08 Semanti Inc. Semantically associated computer search index, and uses therefore
JP4499179B1 (en) * 2009-05-12 2010-07-07 株式会社エヌ・ティ・ティ・データ Terminal device
US8112436B2 (en) 2009-09-21 2012-02-07 Yahoo ! Inc. Semantic and text matching techniques for network search
US9208435B2 (en) 2010-05-10 2015-12-08 Oracle Otc Subsidiary Llc Dynamic creation of topical keyword taxonomies
EP2400400A1 (en) 2010-06-22 2011-12-28 Inbenta Professional Services, S.L. Semantic search engine using lexical functions and meaning-text criteria
KR101192439B1 (en) * 2010-11-22 2012-10-17 고려대학교 산학협력단 Apparatus and method for serching digital contents
JP5710317B2 (en) * 2011-03-03 2015-04-30 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Information processing apparatus, natural language analysis method, program, and recording medium
US9176949B2 (en) 2011-07-06 2015-11-03 Altamira Technologies Corporation Systems and methods for sentence comparison and sentence-based search
CN103365924B (en) * 2012-04-09 2016-04-06 北京大学 A kind of method of internet information search, device and terminal
US9336297B2 (en) * 2012-08-02 2016-05-10 Paypal, Inc. Content inversion for user searches and product recommendations systems and methods
US9244909B2 (en) 2012-12-10 2016-01-26 General Electric Company System and method for extracting ontological information from a body of text
CN103544242B (en) * 2013-09-29 2017-02-15 广东工业大学 Microblog-oriented emotion entity searching system
CN103530415A (en) * 2013-10-29 2014-01-22 谭永 Natural language search method and system compatible with keyword search
CN103729456B (en) * 2014-01-07 2016-09-28 合肥工业大学 Microblog multi-modal sentiment analysis method based on microblog group environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009199280A (en) * 2008-02-21 2009-09-03 Hitachi Ltd Similarity retrieval system using partial syntax tree profile
US8533208B2 (en) * 2009-09-28 2013-09-10 Ebay Inc. System and method for topic extraction and opinion mining
CN102298642B (en) * 2011-09-15 2012-09-05 苏州大学 Method and system for extracting text information
US20140019122A1 (en) * 2012-07-10 2014-01-16 Robert D. New Method for Parsing Natural Language Text
KR20140053717A (en) * 2012-10-26 2014-05-08 고려대학교 산학협력단 Sentiment-based query processing system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3198490A4 *

Also Published As

Publication number Publication date
US9836529B2 (en) 2017-12-05
US20180075132A1 (en) 2018-03-15
EP3198490A4 (en) 2018-03-14
US20160085853A1 (en) 2016-03-24
EP3198490A1 (en) 2017-08-02
CN106716408A (en) 2017-05-24
JP6588089B2 (en) 2019-10-09
US10324967B2 (en) 2019-06-18
JP2017528842A (en) 2017-09-28
CN106716408B (en) 2021-09-10

Similar Documents

Publication Publication Date Title
US10324967B2 (en) Semantic text search
AU2005217413B2 (en) Intelligent search and retrieval system and method
US11080295B2 (en) Collecting, organizing, and searching knowledge about a dataset
US10387469B1 (en) System and methods for discovering, presenting, and accessing information in a collection of text contents
EP1675025A2 (en) Systems and methods for generating user-interest sensitive abstracts of search results
US8316039B2 (en) Identifying conceptually related terms in search query results
US20140114942A1 (en) Dynamic Pruning of a Search Index Based on Search Results
US9619555B2 (en) System and process for natural language processing and reporting
Armentano et al. NLP-based faceted search: Experience in the development of a science and technology search engine
Ceri et al. The information retrieval process
Torres-Parejo et al. MTCIR: A multi-term tag cloud information retrieval system
Bawakid Automatic documents summarization using ontology based methodologies
US20240354318A1 (en) System and method for searching tree based organizational hierarchies, including topic hierarchies, and generating and presenting search interfaces for same
Dien et al. An approach for semantic-based searching in learning resources
Saruladha et al. Design of new indexing techniques based on ontology for information retrieval systems
Ghoddousi Integrating Medical Ontology and Pseudo Relevance Feedback For Medical Document Retrieval
Narayan Advanced Intranet Search Engine
Lyall-Wilson Automatic concept-based query expansion using term relational pathways built from a collection-specific association thesaurus
Holmes Feasibility of the Application of Semantic Web Ontologies To Enhance Question Answering Systems in Practical Domains
Domenech et al. Geographical information resolution and its application to the question answering systems
Kim Implications of punctuation mark normalization on text retrieval
Sousan et al. Semantics based web data filtering agents
Damljanovic Natural Language Interfaces to Ontologies: usability and performance (Transfer report)
Aslandogan Concept-based Search Using Parallel Query Expansion
Dornescu Encyclopaedic Question Answering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15844520

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2015844520

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015844520

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017515135

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE