US20160179931A1 - System And Method For Supplementing Search Queries - Google Patents
System And Method For Supplementing Search Queries Download PDFInfo
- Publication number
- US20160179931A1 US20160179931A1 US15/057,061 US201615057061A US2016179931A1 US 20160179931 A1 US20160179931 A1 US 20160179931A1 US 201615057061 A US201615057061 A US 201615057061A US 2016179931 A1 US2016179931 A1 US 2016179931A1
- Authority
- US
- United States
- Prior art keywords
- terms
- document
- glossary
- tagged
- glossaries
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
-
- G06F17/30613—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/211—Schema design and management
- G06F16/212—Schema design and management with details for data modelling support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2272—Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24575—Query processing with adaptation to user needs using context
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3349—Reuse of stored results of previous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G06F17/30657—
-
- G06F17/30693—
Definitions
- This application relates in general to document tagging, and in particular, to a system and method for supplementing search queries.
- tagging has greatly increased due to the increase in social networking and use of desktop software.
- a user assigns a tag, such as a word or phrase, to a document, image, or Web page to “mark” the item with an identifying concept.
- a tag such as a word or phrase
- users commonly post and tag photographs of themselves and friends using a tag as an identifier for the people in the picture, such as by name, nickname, or user name.
- the tag provides metadata for the marked item and once assigned, can be used for quickly determining a topic of the document or for identification and retrieval, such as through search queries based on the tag.
- tags are sparsely defined, coarse, and ambiguous.
- conventional tagging systems are unable to effectively and efficiently distinguish between homonyms, which have different meanings for the same tag, or combine synonyms, which include multiple tags for the same concept.
- a tag with the word “java” can refer to coffee, an island, or programming language.
- a user or tagging system is unable to determine the meaning of the tag.
- the assignment of tags is usually subjectively determined by a user, which can result in an assigned tag that may not accurately reflect a topic of a document or that is not understandable by other users. For instance, a user marks a picture of a baby with a tag that provides the name of the baby's mother. The user knows that the tagged person is the mother; however, another user may believe that the tag refers to the baby. In another example, a user assigns a tag “Section IV-Research” to a document regarding sickle cell anemia. The user assigns the particular tag because he is working on a paper about genetic disorders and wants to identify the section of the paper that will cite the document. However, the tag is likely to be misunderstood or confusing to another user. Therefore, conventional tags are limited in use and may only be beneficial for the creator of the tags.
- tags are refined without requiring retagging of documents.
- Glossaries of terms associated with tags provide a context for the tags assigned to a document.
- the glossaries of one or more of the tags can be used to automatically tag other documents, augment search queries, or create an index for the document.
- the terms defined in the glossary can be used themselves as tags to identify sub-topics of the main tag.
- An embodiment provides a system and method for supplementing search queries.
- a set of tagged documents is accessed and each tagged document is associated with a glossary that includes a plurality of entries.
- Each glossary entry is associated with a term, at least one alias, and a definition for the term.
- a search query having one or more search words is received.
- the search query is compared with the glossaries associated with each of the tagged documents.
- at least one entry that includes one of the search words of the search query is identified.
- a supplemental search query is generated by selecting one of the term, alias, and at least a portion of the definition from one or more of the identified entries for inclusion in the query.
- FIG. 1 is a system for generating tag glossaries and use thereof, in accordance with one embodiment.
- FIG. 2 is a flow diagram showing a method for generating tag glossaries, in accordance with one embodiment.
- FIG. 3 is a block diagram showing, by way of example, a tag glossary.
- FIG. 4 is a data flow diagram showing uses for tag glossaries.
- FIG. 5 is a flow diagram showing, by way of example, a method for automatically assigning tags to documents.
- FIG. 6 is a flow diagram showing, by way of example, a method for supplementing a search query with information from tag glossaries.
- FIG. 7 is a flow diagram showing, by way of example, a method for generating an index using tag glossaries.
- Document tagging allows users to assign tags to documents, such as books, articles, pictures, and Web pages for describing a topic or other aspects of the document and for identifying the document during a search.
- documents such as books, articles, pictures, and Web pages
- the terms “document” and “Web page” are used interchangeably with the same intended meaning, unless otherwise indicated.
- tags provide minimal amounts of information and can be coarse and ambiguous. Many tagging systems are unable to understand and resolve the ambiguity to determine a meaning of the assigned tags, which limits the support and analysis that can be provided by the system.
- Tag glossaries provide a context for the tag from which a user or tagging system can distinguish the meanings of multiple tags.
- FIG. 1 is a system for generating, updating, and utilizing tag glossaries, in accordance with one embodiment.
- One or more user devices 11 - 13 are connected to a Web server 17 via an Internetwork 15 , such as the Internet.
- the user devices 11 - 13 can include a computer, laptop, or mobile device, such as a cellular telephone or personal digital assistant.
- each user device 11 - 13 is a Web-enabled device that executes a Web browser, which supports interfacing tools and information exchange with the Web server 17 .
- the Web server 17 is interconnected to a Web page database 18 and a tag repository 14 .
- the Web page database 18 stores Web pages 19 , which are provided to the user devices 11 - 13 upon request.
- the tag repository 14 stores metadata tags 20 associated with the Web pages 19 .
- the tag repository 14 also stores glossaries 16 , which are associated with the metadata tags 20 .
- the glossaries each include terms, term definitions, and aliases of the terms.
- the tags and glossaries can be stored locally on the user devices 11 - 13 , along with documents, such as books, articles, pictures, and Web pages.
- a user tags a document displayed on a Web page, which was obtained from the Web page database 18 via the Web server 17 . If the tag is not associated with a glossary, a new glossary can be generated by adding terms, definitions, and aliases associated with the tag.
- a glossary generator 21 is interconnected to the user devices 11 - 13 and the Web server 17 via the internetwork 15 , and includes a glossary selection module 22 , term selection module 23 , and compiler 24 .
- a new glossary can also be attached to a tag by inheriting the glossary entries of one or more pre-existing tags, perhaps belonging to the same user, or one or more other users.
- the glossary entries can include the terms of the glossary and the definitions and aliases associated with the terms.
- the user defining the new glossary can modify or update the inherited glossary entries by adding, deleting, and editing terms, as well as “overriding” the existing definition or alias of a term with a new definition or alias for the new glossary.
- the user can also add new terms and aliases to the glossary that are not present in the glossaries from which the new glossary inherits.
- the glossary selection module 22 optionally selects a preexisting glossary for use as a template for the new glossary.
- the preexisting glossary can be associated with the same tag as or a different tag than the newly-assigned tag.
- the term selection module 23 selects terms for inclusion in the glossary. The terms can be selected from sources, including the tag repository 14 and user devices 11 - 13 , as well as other sources. Additionally, a user can manually add terms to the glossary. Each term can be preexisting, such as from a preexisting glossary or newly generated. The terms can also be modified or removed, and new terms can be added. Additionally, the definition and aliases for each term can also be added to, edited, or replaced entirely.
- the compiler 24 compiles the selected terms and associated definitions and aliases into the new glossary, which is then associated with the newly-assigned tag. Once generated, glossaries can be used for automatic tagging of untagged documents, search query augmentation, and index generation, which are described below in detail with respect to FIGS. 4-7 .
- the user devices 11 - 13 , glossary generator 21 , and Web server 17 each include components conventionally found in general purpose programmable computing devices, such as a central processing unit, memory, input/output ports, network interfaces, and non-volatile storage, although other components are possible.
- general purpose programmable computing devices such as a central processing unit, memory, input/output ports, network interfaces, and non-volatile storage, although other components are possible.
- other information sources in lieu of or in addition to the servers, and other information consumers, in lieu of or in addition to the user devices, are possible.
- the user devices 11 - 13 , glossary generator 21 , and Web server 17 can each include one or more modules for carrying out the embodiments disclosed herein.
- the modules can be implemented as a computer program or procedure written as source code in a conventional programming language and is presented for execution by the central processing unit as object or byte code.
- the modules could also be implemented in hardware, either as integrated circuitry or burned into read-only memory components.
- the various implementations of the source code and object and byte codes can be held on a computer-readable storage medium, such as a floppy disk, hard drive, digital video disk (DVD), random access memory (RAM), read-only memory (ROM) and similar storage mediums.
- Other types of modules and module functions are possible, as well as other physical hardware components.
- Glossaries provide a context for content tags and assist users and tagging systems by providing a set of terms and definitions for those terms which help to establish the context for the tag, thereby resolving ambiguity in a user-appropriate fashion.
- a user who is researching colleges can tag all Websites and documents reviewed with a tag titled “college.”
- the glossary can include a term representative of each of the colleges, such as a name, that the user is interested in attending, as well as aliases for the college and a brief description of the college, such as location and requirements for admission.
- the user can tag each Website with a tag for a specific college, such as University of Washington, Washington State University, and Eastern Washington University.
- the terms in the glossary for each of the respective tags can include academic information, such as majors, sports teams, and dorms.
- FIG. 2 is a flow diagram showing a method for generating glossaries via inheritance, in accordance with one embodiment.
- Tag inheritance allows a tag to inherit or exclude terms and definitions from one or more glossaries.
- a new tag is assigned to an untagged document (block 31 ).
- the tag can be assigned manually by a user or automatically by a tagging system.
- the tag is typically a keyword or phrase that is used to describe the corresponding document and help facilitate document searches. Other types of tags are possible, including images and colors.
- a determination is made as to whether the tag is associated with a glossary (block 32 ). For example, the user may already have created or associated a glossary with the tag in a prior use of the same tag for a different document. If so, the glossary attached to the tag is automatically associated with the tagged document (block 33 ). In a further example, a determination can be made as to whether the tag term is included in the glossary. If the tag is included as a term or alias in the glossary, that glossary is automatically associated with the document marked with the tag.
- a search for other uses of the same tag, such as by other users, can be made (block 34 ). If other uses of the tag are located (block 35 ), those tags can be examined for attached glossaries (block 36 ). If found, those glossaries associated with the located tags can be offered to the user for building the new glossary for his personal use of the tag (block 37 ). If no other uses of the tag are identified, a new glossary can be generated (block 38 ), such as by inheriting a template of an existing glossary and adding terms, definitions, and aliases.
- Terms, definitions, and aliases of the inherited glossaries can be added, modified, or removed to suit the user's needs.
- an automatic search is conducted for potential definitions.
- the potential definitions can be selected from dictionaries, papers, documents, and other glossaries.
- the potential definitions are then presented to the user who can select one of the definitions or choose to enter a newly-defined definition.
- the definition can be automatically selected by the tagging system.
- portions of one or more definitions can be used to generate a new definition for the term.
- a glossary can be generated by selecting one or more terms from glossaries associated with one or more tagged documents, without first selecting a glossary template, and compiling the selected terms. Modification of the terms, such as by adding, removing, replacing, or changing the terms, definitions, and aliases can occur, as requested by a user.
- a user can manually create the glossary.
- the user selects one or more terms for inclusion in the glossary.
- the user can identify aliases for the one or more of the terms.
- the user can provide definitions for the terms.
- the terms, definitions, and aliases are compiled to form the glossary.
- a glossary can be generated using an existing index associated with a tagged document.
- An index includes words or terms, known as headers, and pointers for the headers. The pointers identify the location of information in a document that relates to the header.
- the terms are selected from the index and pointers associated with the index terms, such as page numbers and other locators, are removed.
- a definition is provided automatically or manually by the user for each index term. The definitions can be newly-generated, copied, or modified from another glossary.
- One or more aliases for the terms can be determined and compiled with the index terms and definitions to form the glossary.
- a glossary includes information related to the tag, which can provide a context for understanding the meaning of the tag.
- FIG. 3 is a block diagram showing, by way of example, a tag glossary 40 .
- the glossary 40 can include a title box 45 , a list of terms 46 , and a list of definitions 47 that are associated with the terms.
- the title box 45 includes identification 41 of the tag 48 associated with the glossary 40 and user selectable options 42 for rescanning documents in a category of documents, displaying an editable version of the glossary, and searching for documents that are missing from the category. Other user selectable options and displays of the glossary are possible.
- the list of terms 46 includes terms 43 , such as words or phrases that are commonly used in discussion of the document to which the tag refers, and aliases 45 for the terms.
- the aliases 45 can be an alternative reference for a term, such as nickname for, abbreviation of, or initials for the term 43 .
- the aliases can be listed as a separate terms in the list with their own definitions and aliases.
- a specific number of terms 43 can be included in a glossary 40 or alternatively, an unspecified number of terms can be included.
- the list of definitions 47 includes at least one definition 44 for each term 43 in the list. Each definition 44 can include a textual description of the term 43 , such as use, purpose, location, and manufacture.
- the definitions 44 can include hyperlinks and cross-references. Other types of definitions, layouts, and formats for the glossary are possible. At a minimum, the glossary 40 should include terms 43 and definitions 44 . However, in one embodiment, the definition field can be blank.
- the glossaries help tagging systems and users distinguish between tags subjectively created by different users. For example, first and second items are both marked with the tag “java.” However, alone, the term “java” is ambiguous and can refer to a programming language, an island in Indonesia, or coffee. Without further information, the tagging system or user is unable to distinguish which meaning of the word was intended by the user.
- the glossary for java the programming language, can include terms, such as “garbage collection,” “generics,” “classes,” and “applet.” Meanwhile, the glossary for java, the island, can include terms, such as “Jakarta,” “Mount Merapi,” and “Bengawan Solo River,” while the glossary for java, the coffee, can include the terms “green coffee,” “robusta,” “Arabica,” and “coffee house.” By reviewing the glossary, the tagging system or user can better understand the context surrounding the tag.
- FIG. 4 is a data flow diagram showing uses for tag glossaries.
- the tag glossaries can be used 51 for automatic tagging 52 , augmenting search queries 53 , generating indices 54 , automatic term view 55 , hierarchical tagging 56 , and entity extraction 57 .
- automatic tagging 52 the terms of an untagged document can be compared with one or more glossaries associated with tagged documents. Subsequently, the tag or tags associated with the glossary having a closest or highest similarity is assigned to the untagged document. Automatic tagging is further discussed below in detail with reference to FIG. 5 .
- Search query augmentation 53 utilizes the glossaries to supplement a search query by adding additional terms or aliases from the glossaries or qualifying the search results using the glossary terms and aliases. Augmenting search queries is further discussed below in detail with reference to FIG. 6 .
- Indices for the documents can also be generated 54 using the tag glossaries to provide users with a directory of topics located within a particular tagged document. Generating indices is further discussed below in detail with reference to FIG. 7 .
- glossaries can be used to provide an automatic view 55 of the terms and definitions in the glossary during an occurrence of one of the terms in the associated document.
- the automatic term view 55 assists users that are unfamiliar with the text or terms of the document.
- the term, definition, or alias can be displayed in a pop-up message, text box, or highlighted menu option. Other displays of the data are possible.
- the automatic term view can be applied to each occurrence of the term or alias of the term, or only for the first occurrences. In a further embodiment, the automatic term view is only provided for uncommon or frequently misunderstood terms.
- the terms in a glossary can be used as subtopics for the tag to which the glossary belongs. Each subtopic is associated with a subtag, which is in turn associated with a subglossary.
- one tag represents a category, which includes one or more other tags.
- the other tags can also represent other categories, which include further tags.
- the further tags can also represent other categories and so on.
- a category represented by a tag can be identified by adding a link in the glossary entry for that tag, to another glossary, which includes the terms of that category.
- the glossary entry for the tag can include links to two or more other terms in the same glossary, where the other terms are associated with the category represented by the first term.
- Entity extraction 57 involves using the terms of the glossary as extra “entities” to identify in documents associated with the tag that corresponds to the glossary.
- regular expressions of text are matched to identify references to people, companies, places, and dates, as well as other entities.
- the references are textually ambiguous and may have multiple expressions, which can make properly forming correct regular expressions difficult.
- entity extraction can be conducted by looking for the terms and their aliases in the glossary, as defined by the entry for the term, instead of having to resolve ambiguous regular expressions.
- the aliases can include “UW” and “Udub.” All references to the institution can be identified using the three references.
- FIG. 5 is a flow diagram showing, by way of example, a method for automatically assigning content tags to documents.
- a set of tagged documents is obtained (block 61 ) and at least one untagged document is also obtained (block 62 ).
- Each of the tagged documents are associated with one or more glossaries, each having terms, definitions, and aliases.
- the untagged document is compared (block 63 ) with the glossaries of the tagged documents and a similarity measure between the untagged document and each glossary is determined (block 64 ).
- the similarity measures can be calculated using cosine similarity.
- a vector is determined for each of the tagged documents by identifying the terms in the corresponding glossaries and assigning a weight to each of the terms.
- the vector for the untagged document is determined by identifying terms in the document and assigning a weight to each of the terms.
- the vectors of the untagged document and the glossaries associated with the tagged documents are compared to determine the similarity measure.
- the tag associated with the glossary having the highest similarity is assigned to the untagged document (block 65 ). The higher the similarity measure, the higher the likelihood that the tag associated with the tagged document should be assigned to the untagged document.
- One or more tags can be assigned to a single document.
- a predetermined threshold is applied to the similarity measures.
- the tag associated with the tagged document having the highest similarity measure, which satisfies the threshold, is selected and assigned to the untagged document.
- similarity can be determined by identifying common terms. For example, an untagged document and tagged document are compared. More specifically, the terms in the untagged document and the glossary terms of the tagged document are compared to identify those terms in common. The tags of the documents having the most terms in common are assigned to the untagged document.
- a predetermined threshold of terms-in-common can also be applied. For instance, in the example above, a predetermined threshold of 12 terms-in-common is set. If one of the tagged documents has 12 or more terms-in-common with the untagged document, then the tag of that tagged document is assigned. Alternatively, if the predetermined threshold is not satisfied, no tag is assigned to the untagged document. Other methods for determining similarity and threshold values are possible, such as percentages and absolute numbers. Additionally, the predetermined threshold can be based on factors, such as document size, subject matter, and type of document, as well as other factors.
- a search for “apple” may provide results for both of or one of Apple products, licensed by Apple Inc., Cupertino, Calif., or apple, the fruit.
- Glossaries can provide contextual information, such as terms, definitions, and aliases to supplement the search query.
- adding terms from glossaries associated with the different “apple” tags provides a context for each of the different searches.
- a search for documents containing a specific term is generally limited to the term as indicated in the search query.
- search query includes the phrase “University of Washington,” documents with occurrences of other related terms, or aliases, that are commonly used to refer to the University of Washington may be missed, such as “UW,” “Udub,” or “Huskies.”
- FIG. 6 is a flow diagram showing, by way of example, a method for supplementing a search query with information from a tag glossary.
- a query is obtained or received (block 71 ) and tagged documents are accessed (block 72 ).
- the query can be obtained from a user, a search engine, or a database, as well as another repository.
- the tagged documents can be accessed from a database, file, or other repository, and can include documents recently tagged, such as automatically or by a user. Recently tagged documents can include those that satisfy a particular time threshold, such as being tagged within a particular period of time. As well, the tagged documents can be associated with a particular user, such as the user that entered the query.
- Glossaries associated with the tagged documents are identified (block 73 ) and the query is compared to the glossaries (block 74 ). Subsequently, one or more glossary terms or aliases are selected for inclusion in the query (block 75 ). The glossary terms and aliases can be selected based on a relatedness or similarity to the search query.
- the terms or aliases for inclusion in the query can be selected by locating entries in the glossary that include one or more of the words in the query, such as a term, definition, or alias. Subsequently, the terms, definitions, and aliases of those entries can be used to expand the query. For example, if the query includes the term “UW,” one or more associated aliases, such as “University of Washington,” “Udub,” and “Huskies” may be provided for inclusion in the query. Further, all the glossaries associated with a user submitting the search query can be identified and glossary entries for which a term or alias appears in the query string are identified.
- each search term can be looked up in a thesaurus to determine other similar or related terms for the search term and subsequently, the search term and similar terms can be compared with the glossaries to identify terms, definitions, or aliases for adding to the search query.
- glossary terms and aliases can be selected using term frequency-inverse document frequency (“tf-idf”), which generally assumes that the frequency or popularity of a term models the importance of that term. For example, the importance of a term increases the more times the term is identified in a document.
- tf-idf term frequency-inverse document frequency
- a set of glossaries associated with the user is identified and a dual-weighted score is constructed for each term or alias in the glossary.
- a standard cosine similarity measure between the query and each glossary entry, including the terms and aliases, is determined, as described above.
- a tf-idf value is determined for each glossary against all documents tagged with tags from any of the glossaries to select the most selective terms in each glossary.
- a product of the cosine similarity score and the tf-idf score are used to select the N most significant glossary terms to add to the query.
- search results are identified using only the search query and the glossaries are subsequently used to refine the results by selecting one or more of the results that are most closely related to the glossaries associated with the tagged documents as recently used tags.
- a query is provided and applied to a corpus of documents.
- One or more of the documents are identified as search results based on a relationship with or a similarity to the query. The similarity can be based on satisfying a predetermined number of search terms in the query or by determining cosine similarity between the search query and the documents. Other measures of similarity are possible.
- One or more tagged documents related to the query are obtained.
- the search results are compared with the glossaries associated with the tagged documents. Those search results that are the closest related, or most similar, to the glossaries are selected for providing to a user as relevant results.
- Indices are collections of relevant topics and provide a topical context for a document, such as a book or research paper, and are generally located at the back of a document.
- An index includes words or terms, known as headers, and pointers for the headers. The pointers identify the location of information relating to the header in a document.
- Indices allow a user to quickly obtain an overview of topics included in a document, as well as to identify sections of the document relevant to particular topics.
- indices can be used for directing users to a particular index term, or topic, within the document. However, manually generating indices can be difficult and expensive. As well, automatically generated indexes can be complex to build.
- FIG. 7 is a flow diagram showing, by way of example, a method for generating an index using a tag glossary.
- Index terms are identified (block 81 ) from a document for which the index will be generated.
- a corpus of tagged documents related to the document to be indexed is obtained (block 82 ) and each tagged document is associated with at least one tag.
- the tagged documents can be obtained from a database, file, or other repository based on a similarity or relevance measure.
- Each tag is associated with a glossary, which includes terms, term definitions, and aliases.
- the index terms are then compared with the glossaries of the tags attached to documents in the corpus (block 83 ).
- aliases are identified (block 84 ) in the glossaries for one or more of the index terms.
- occurrences of the index terms and aliases are identified (block 85 ) in the document.
- the index term and alias occurrences can be identified using a finite state toolkit (“FST”), which identifies patterns within the document using the glossaries, such as described in Karttunen, L., Pattern Matching with FST—a tutorial. PARC Technical Report 2010-1. 2010 Nov. 29.
- FST finite state toolkit
- the index terms, aliases, and occurrences are compiled into the index (block 86 ), which is incorporated into the document.
- FST allows a user to create networks from text files and regular expressions, as well as apply the networks to input strings or files.
- Each glossary term and associated alias, if any, can form a pattern.
- the patterns are compiled into a finite state machine, which is run over the text of each document. Terms in the documents that match one or more of the patterns are selected as results. Page numbers for each occurrence of the resulting terms in the document are identified and organized to generate the index.
- the index can be generated for a tagged document using the assigned tags as index terms.
- regular expression machinery such as FST, is applied to find all expressions of the index terms in the document. The terms and locations of the term expressions are then compiled into an index for the document.
- the index can be displayed with terms and a representation of the locations of that term in the associated document.
- the location representation can include page number, page thumbnails, and page strips.
- Page strips are representative of the pages in a document and include a visual display of term occurrences.
- the length of the page strip corresponds with the number of pages in a document.
- the term occurrences are indicated in the page strip by location in the document using graphs, colors, bars, or lines. Selecting a location on the page strip opens the document to the corresponding page.
- two or more tagged documents can be compared for similarity based on the glossaries associated with the tags assigned to the tagged documents.
- the similarity can be based on a number of common terms in the glossaries or based on a cosine similarity between the text of the glossaries associated with the tags.
- the glossaries provide for better natural language processing by providing contextual information about a tag and associated document.
- the natural language processing can include distinctions, anaphora resolution, and date resolution, as well as other types of language processing.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Library & Information Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A system and method for supplementing search queries is provided. A set of tagged documents is accessed and each tagged document is associated with a glossary that includes a plurality of entries. Each entry is associated with a term, at least one alias, and a definition for the term. A search query having one or more search words is received. The search query is compared with the glossaries associated with each of the tagged documents. In one or more of the glossaries, at least one entry that includes one of the search words of the search query is identified. A supplemental search query is generated by selecting one of the term, alias, and at least a portion of the definition from one or more of the identified entries for inclusion in the query.
Description
- This non-provisional patent application is a continuation of U.S. patent application Ser. No. 14/697,563, filed Apr. 27, 2015, pending; which is a divisional of U.S. Pat. No. 9,020,950, issued Apr. 28, 2015, the disclosures of which are incorporated by reference.
- This application relates in general to document tagging, and in particular, to a system and method for supplementing search queries.
- The act of “tagging” has greatly increased due to the increase in social networking and use of desktop software. During tagging, a user assigns a tag, such as a word or phrase, to a document, image, or Web page to “mark” the item with an identifying concept. For example, in social networking, users commonly post and tag photographs of themselves and friends using a tag as an identifier for the people in the picture, such as by name, nickname, or user name. The tag provides metadata for the marked item and once assigned, can be used for quickly determining a topic of the document or for identification and retrieval, such as through search queries based on the tag.
- However, conventional tags are sparsely defined, coarse, and ambiguous. Currently, conventional tagging systems are unable to effectively and efficiently distinguish between homonyms, which have different meanings for the same tag, or combine synonyms, which include multiple tags for the same concept. For example, a tag with the word “java” can refer to coffee, an island, or programming language. Without further information, a user or tagging system is unable to determine the meaning of the tag.
- Further, the assignment of tags is usually subjectively determined by a user, which can result in an assigned tag that may not accurately reflect a topic of a document or that is not understandable by other users. For instance, a user marks a picture of a baby with a tag that provides the name of the baby's mother. The user knows that the tagged person is the mother; however, another user may believe that the tag refers to the baby. In another example, a user assigns a tag “Section IV-Research” to a document regarding sickle cell anemia. The user assigns the particular tag because he is working on a paper about genetic disorders and wants to identify the section of the paper that will cite the document. However, the tag is likely to be misunderstood or confusing to another user. Therefore, conventional tags are limited in use and may only be beneficial for the creator of the tags.
- Thus, a system and method for providing context for a tag, such that similar tags can be distinguished, is needed. Preferably, the tags are refined without requiring retagging of documents.
- Glossaries of terms associated with tags provide a context for the tags assigned to a document. The glossaries of one or more of the tags can be used to automatically tag other documents, augment search queries, or create an index for the document. The terms defined in the glossary can be used themselves as tags to identify sub-topics of the main tag.
- An embodiment provides a system and method for supplementing search queries. A set of tagged documents is accessed and each tagged document is associated with a glossary that includes a plurality of entries. Each glossary entry is associated with a term, at least one alias, and a definition for the term. A search query having one or more search words is received. The search query is compared with the glossaries associated with each of the tagged documents. In one or more of the glossaries, at least one entry that includes one of the search words of the search query is identified. A supplemental search query is generated by selecting one of the term, alias, and at least a portion of the definition from one or more of the identified entries for inclusion in the query.
- Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein is described embodiments of the invention by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.
-
FIG. 1 is a system for generating tag glossaries and use thereof, in accordance with one embodiment. -
FIG. 2 is a flow diagram showing a method for generating tag glossaries, in accordance with one embodiment. -
FIG. 3 is a block diagram showing, by way of example, a tag glossary. -
FIG. 4 is a data flow diagram showing uses for tag glossaries. -
FIG. 5 is a flow diagram showing, by way of example, a method for automatically assigning tags to documents. -
FIG. 6 is a flow diagram showing, by way of example, a method for supplementing a search query with information from tag glossaries. -
FIG. 7 is a flow diagram showing, by way of example, a method for generating an index using tag glossaries. - With the increase in computer use and social networking, document tagging has also increased. Document tagging allows users to assign tags to documents, such as books, articles, pictures, and Web pages for describing a topic or other aspects of the document and for identifying the document during a search. Hereinafter, the terms “document” and “Web page” are used interchangeably with the same intended meaning, unless otherwise indicated. However, tags provide minimal amounts of information and can be coarse and ambiguous. Many tagging systems are unable to understand and resolve the ambiguity to determine a meaning of the assigned tags, which limits the support and analysis that can be provided by the system. Tag glossaries provide a context for the tag from which a user or tagging system can distinguish the meanings of multiple tags.
- A tag represents a topic and the associated glossary provides a context for the topic. Specifically, the glossary includes terms relevant to the at least one of the tags, definitions for the terms, and aliases for the terms.
FIG. 1 is a system for generating, updating, and utilizing tag glossaries, in accordance with one embodiment. One or more user devices 11-13 are connected to a Web server 17 via an Internetwork 15, such as the Internet. The user devices 11-13 can include a computer, laptop, or mobile device, such as a cellular telephone or personal digital assistant. In general, each user device 11-13 is a Web-enabled device that executes a Web browser, which supports interfacing tools and information exchange with the Web server 17. - The Web server 17 is interconnected to a
Web page database 18 and atag repository 14. TheWeb page database 18 stores Web pages 19, which are provided to the user devices 11-13 upon request. Thetag repository 14 stores metadata tags 20 associated with the Web pages 19. Thetag repository 14 also storesglossaries 16, which are associated with the metadata tags 20. Other types of data and metadata are possible. The glossaries each include terms, term definitions, and aliases of the terms. In a further embodiment, the tags and glossaries can be stored locally on the user devices 11-13, along with documents, such as books, articles, pictures, and Web pages. - A user tags a document displayed on a Web page, which was obtained from the
Web page database 18 via the Web server 17. If the tag is not associated with a glossary, a new glossary can be generated by adding terms, definitions, and aliases associated with the tag. Aglossary generator 21 is interconnected to the user devices 11-13 and the Web server 17 via theinternetwork 15, and includes aglossary selection module 22,term selection module 23, andcompiler 24. - A new glossary can also be attached to a tag by inheriting the glossary entries of one or more pre-existing tags, perhaps belonging to the same user, or one or more other users. The glossary entries can include the terms of the glossary and the definitions and aliases associated with the terms. The user defining the new glossary can modify or update the inherited glossary entries by adding, deleting, and editing terms, as well as “overriding” the existing definition or alias of a term with a new definition or alias for the new glossary. The user can also add new terms and aliases to the glossary that are not present in the glossaries from which the new glossary inherits.
- For example, the
glossary selection module 22 optionally selects a preexisting glossary for use as a template for the new glossary. The preexisting glossary can be associated with the same tag as or a different tag than the newly-assigned tag. Theterm selection module 23 selects terms for inclusion in the glossary. The terms can be selected from sources, including thetag repository 14 and user devices 11-13, as well as other sources. Additionally, a user can manually add terms to the glossary. Each term can be preexisting, such as from a preexisting glossary or newly generated. The terms can also be modified or removed, and new terms can be added. Additionally, the definition and aliases for each term can also be added to, edited, or replaced entirely. Thecompiler 24 compiles the selected terms and associated definitions and aliases into the new glossary, which is then associated with the newly-assigned tag. Once generated, glossaries can be used for automatic tagging of untagged documents, search query augmentation, and index generation, which are described below in detail with respect toFIGS. 4-7 . - The user devices 11-13,
glossary generator 21, and Web server 17 each include components conventionally found in general purpose programmable computing devices, such as a central processing unit, memory, input/output ports, network interfaces, and non-volatile storage, although other components are possible. Moreover, other information sources in lieu of or in addition to the servers, and other information consumers, in lieu of or in addition to the user devices, are possible. - Further, the user devices 11-13,
glossary generator 21, and Web server 17 can each include one or more modules for carrying out the embodiments disclosed herein. The modules can be implemented as a computer program or procedure written as source code in a conventional programming language and is presented for execution by the central processing unit as object or byte code. Alternatively, the modules could also be implemented in hardware, either as integrated circuitry or burned into read-only memory components. The various implementations of the source code and object and byte codes can be held on a computer-readable storage medium, such as a floppy disk, hard drive, digital video disk (DVD), random access memory (RAM), read-only memory (ROM) and similar storage mediums. Other types of modules and module functions are possible, as well as other physical hardware components. - Associating Glossaries with Tags
- Glossaries provide a context for content tags and assist users and tagging systems by providing a set of terms and definitions for those terms which help to establish the context for the tag, thereby resolving ambiguity in a user-appropriate fashion. For example, a user who is researching colleges can tag all Websites and documents reviewed with a tag titled “college.” The glossary can include a term representative of each of the colleges, such as a name, that the user is interested in attending, as well as aliases for the college and a brief description of the college, such as location and requirements for admission. Alternatively, the user can tag each Website with a tag for a specific college, such as University of Washington, Washington State University, and Eastern Washington University. The terms in the glossary for each of the respective tags can include academic information, such as majors, sports teams, and dorms. Once generated or inherited, the glossary can be maintained, updated, and utilized to provide a context for the tag.
- Additionally, the glossary terms can be used to select more precise tags for a document. For instance, returning to the above example, the user becomes interested in Stanford University and assigns a Website a tag titled “Stanford.” If the tag is a term selected from the glossary associated with the tag “college,” the glossary is automatically associated with the tag “college,” as well as the tag “Stanford.” However, if the tag “Stanford” is not selected from or included in the “college” glossary, but the “college” tag is applied independently to the same Website, an entry for Stanford University can be generated in the glossary for colleges.
FIG. 2 is a flow diagram showing a method for generating glossaries via inheritance, in accordance with one embodiment. Tag inheritance allows a tag to inherit or exclude terms and definitions from one or more glossaries. A new tag is assigned to an untagged document (block 31). The tag can be assigned manually by a user or automatically by a tagging system. The tag is typically a keyword or phrase that is used to describe the corresponding document and help facilitate document searches. Other types of tags are possible, including images and colors. A determination is made as to whether the tag is associated with a glossary (block 32). For example, the user may already have created or associated a glossary with the tag in a prior use of the same tag for a different document. If so, the glossary attached to the tag is automatically associated with the tagged document (block 33). In a further example, a determination can be made as to whether the tag term is included in the glossary. If the tag is included as a term or alias in the glossary, that glossary is automatically associated with the document marked with the tag. - If the tag is not associated with a glossary, a search for other uses of the same tag, such as by other users, can be made (block 34). If other uses of the tag are located (block 35), those tags can be examined for attached glossaries (block 36). If found, those glossaries associated with the located tags can be offered to the user for building the new glossary for his personal use of the tag (block 37). If no other uses of the tag are identified, a new glossary can be generated (block 38), such as by inheriting a template of an existing glossary and adding terms, definitions, and aliases.
- Terms, definitions, and aliases of the inherited glossaries, such as a glossary already associated with the tag or a glossary associated with a similar or related tag, can be added, modified, or removed to suit the user's needs. In one embodiment, upon adding a new term to the glossary, an automatic search is conducted for potential definitions. The potential definitions can be selected from dictionaries, papers, documents, and other glossaries. The potential definitions are then presented to the user who can select one of the definitions or choose to enter a newly-defined definition. Alternatively, the definition can be automatically selected by the tagging system. Also, portions of one or more definitions can be used to generate a new definition for the term.
- In a further embodiment, a glossary can be generated by selecting one or more terms from glossaries associated with one or more tagged documents, without first selecting a glossary template, and compiling the selected terms. Modification of the terms, such as by adding, removing, replacing, or changing the terms, definitions, and aliases can occur, as requested by a user.
- In yet a further embodiment, a user can manually create the glossary. The user selects one or more terms for inclusion in the glossary. Next, the user can identify aliases for the one or more of the terms. Finally, the user can provide definitions for the terms. The terms, definitions, and aliases are compiled to form the glossary.
- In a still further embodiment, a glossary can be generated using an existing index associated with a tagged document. An index includes words or terms, known as headers, and pointers for the headers. The pointers identify the location of information in a document that relates to the header. To generate the glossary, the terms are selected from the index and pointers associated with the index terms, such as page numbers and other locators, are removed. A definition is provided automatically or manually by the user for each index term. The definitions can be newly-generated, copied, or modified from another glossary. One or more aliases for the terms can be determined and compiled with the index terms and definitions to form the glossary.
- A glossary includes information related to the tag, which can provide a context for understanding the meaning of the tag.
FIG. 3 is a block diagram showing, by way of example, a tag glossary 40. The glossary 40 can include atitle box 45, a list ofterms 46, and a list ofdefinitions 47 that are associated with the terms. Thetitle box 45 includesidentification 41 of thetag 48 associated with the glossary 40 and user selectableoptions 42 for rescanning documents in a category of documents, displaying an editable version of the glossary, and searching for documents that are missing from the category. Other user selectable options and displays of the glossary are possible. - The list of
terms 46 includesterms 43, such as words or phrases that are commonly used in discussion of the document to which the tag refers, andaliases 45 for the terms. Thealiases 45 can be an alternative reference for a term, such as nickname for, abbreviation of, or initials for theterm 43. In one embodiment, the aliases can be listed as a separate terms in the list with their own definitions and aliases. A specific number ofterms 43 can be included in a glossary 40 or alternatively, an unspecified number of terms can be included. The list ofdefinitions 47 includes at least one definition 44 for eachterm 43 in the list. Each definition 44 can include a textual description of theterm 43, such as use, purpose, location, and manufacture. As well, the definitions 44 can include hyperlinks and cross-references. Other types of definitions, layouts, and formats for the glossary are possible. At a minimum, the glossary 40 should includeterms 43 and definitions 44. However, in one embodiment, the definition field can be blank. - The glossaries help tagging systems and users distinguish between tags subjectively created by different users. For example, first and second items are both marked with the tag “java.” However, alone, the term “java” is ambiguous and can refer to a programming language, an island in Indonesia, or coffee. Without further information, the tagging system or user is unable to distinguish which meaning of the word was intended by the user. The glossary for java, the programming language, can include terms, such as “garbage collection,” “generics,” “classes,” and “applet.” Meanwhile, the glossary for java, the island, can include terms, such as “Jakarta,” “Mount Merapi,” and “Bengawan Solo River,” while the glossary for java, the coffee, can include the terms “green coffee,” “robusta,” “Arabica,” and “coffee house.” By reviewing the glossary, the tagging system or user can better understand the context surrounding the tag.
- The glossary-enhanced tags can be used to tag further documents and provide additional information regarding the documents.
FIG. 4 is a data flow diagram showing uses for tag glossaries. The tag glossaries can be used 51 for automatic tagging 52, augmenting search queries 53, generatingindices 54,automatic term view 55, hierarchical tagging 56, andentity extraction 57. For automatic tagging 52, the terms of an untagged document can be compared with one or more glossaries associated with tagged documents. Subsequently, the tag or tags associated with the glossary having a closest or highest similarity is assigned to the untagged document. Automatic tagging is further discussed below in detail with reference toFIG. 5 .Search query augmentation 53 utilizes the glossaries to supplement a search query by adding additional terms or aliases from the glossaries or qualifying the search results using the glossary terms and aliases. Augmenting search queries is further discussed below in detail with reference toFIG. 6 . Indices for the documents can also be generated 54 using the tag glossaries to provide users with a directory of topics located within a particular tagged document. Generating indices is further discussed below in detail with reference toFIG. 7 . - Additionally, glossaries can be used to provide an
automatic view 55 of the terms and definitions in the glossary during an occurrence of one of the terms in the associated document. Theautomatic term view 55 assists users that are unfamiliar with the text or terms of the document. The term, definition, or alias can be displayed in a pop-up message, text box, or highlighted menu option. Other displays of the data are possible. The automatic term view can be applied to each occurrence of the term or alias of the term, or only for the first occurrences. In a further embodiment, the automatic term view is only provided for uncommon or frequently misunderstood terms. - In hierarchical tagging 56, the terms in a glossary can be used as subtopics for the tag to which the glossary belongs. Each subtopic is associated with a subtag, which is in turn associated with a subglossary. Specifically, in hierarchical tagging, one tag represents a category, which includes one or more other tags. The other tags can also represent other categories, which include further tags. In turn, the further tags can also represent other categories and so on. A category represented by a tag can be identified by adding a link in the glossary entry for that tag, to another glossary, which includes the terms of that category. In a further embodiment, the glossary entry for the tag can include links to two or more other terms in the same glossary, where the other terms are associated with the category represented by the first term.
-
Entity extraction 57 involves using the terms of the glossary as extra “entities” to identify in documents associated with the tag that corresponds to the glossary. During entity extraction, regular expressions of text are matched to identify references to people, companies, places, and dates, as well as other entities. Generally, the references are textually ambiguous and may have multiple expressions, which can make properly forming correct regular expressions difficult. However, by including terms in a glossary associated with the document as entities, entity extraction can be conducted by looking for the terms and their aliases in the glossary, as defined by the entry for the term, instead of having to resolve ambiguous regular expressions. Thus, given a glossary entry for a place “University of Washington,” the aliases can include “UW” and “Udub.” All references to the institution can be identified using the three references. - Automatic Tagging
- Automatically tagging documents can be extremely difficult using conventional tagging systems since tags can be general and ambiguous and the tagging systems are unable to distinguish between the meanings of tags. Glossaries of terms associated with the tag can be used to distinguish the meanings between tags and further assign a tag to an untagged item.
FIG. 5 is a flow diagram showing, by way of example, a method for automatically assigning content tags to documents. A set of tagged documents is obtained (block 61) and at least one untagged document is also obtained (block 62). Each of the tagged documents are associated with one or more glossaries, each having terms, definitions, and aliases. The untagged document is compared (block 63) with the glossaries of the tagged documents and a similarity measure between the untagged document and each glossary is determined (block 64). - The similarity measures can be calculated using cosine similarity. A vector is determined for each of the tagged documents by identifying the terms in the corresponding glossaries and assigning a weight to each of the terms. The vector for the untagged document is determined by identifying terms in the document and assigning a weight to each of the terms. Once determined, the vectors of the untagged document and the glossaries associated with the tagged documents are compared to determine the similarity measure. The tag associated with the glossary having the highest similarity is assigned to the untagged document (block 65). The higher the similarity measure, the higher the likelihood that the tag associated with the tagged document should be assigned to the untagged document. One or more tags can be assigned to a single document.
- In a further embodiment, a predetermined threshold is applied to the similarity measures. The tag associated with the tagged document having the highest similarity measure, which satisfies the threshold, is selected and assigned to the untagged document.
- In yet a further embodiment, similarity can be determined by identifying common terms. For example, an untagged document and tagged document are compared. More specifically, the terms in the untagged document and the glossary terms of the tagged document are compared to identify those terms in common. The tags of the documents having the most terms in common are assigned to the untagged document. A predetermined threshold of terms-in-common can also be applied. For instance, in the example above, a predetermined threshold of 12 terms-in-common is set. If one of the tagged documents has 12 or more terms-in-common with the untagged document, then the tag of that tagged document is assigned. Alternatively, if the predetermined threshold is not satisfied, no tag is assigned to the untagged document. Other methods for determining similarity and threshold values are possible, such as percentages and absolute numbers. Additionally, the predetermined threshold can be based on factors, such as document size, subject matter, and type of document, as well as other factors.
- Query Augmentation
- Often times, document searches are limited based on the query provided. For example, a search for “apple” may provide results for both of or one of Apple products, licensed by Apple Inc., Cupertino, Calif., or apple, the fruit. Glossaries can provide contextual information, such as terms, definitions, and aliases to supplement the search query. In the above example, adding terms from glossaries associated with the different “apple” tags provides a context for each of the different searches. Additionally, a search for documents containing a specific term is generally limited to the term as indicated in the search query. For example, if the search query includes the phrase “University of Washington,” documents with occurrences of other related terms, or aliases, that are commonly used to refer to the University of Washington may be missed, such as “UW,” “Udub,” or “Huskies.”
- As described above, query augmentation can be used to obtain focused and directed search results.
FIG. 6 is a flow diagram showing, by way of example, a method for supplementing a search query with information from a tag glossary. A query is obtained or received (block 71) and tagged documents are accessed (block 72). The query can be obtained from a user, a search engine, or a database, as well as another repository. The tagged documents can be accessed from a database, file, or other repository, and can include documents recently tagged, such as automatically or by a user. Recently tagged documents can include those that satisfy a particular time threshold, such as being tagged within a particular period of time. As well, the tagged documents can be associated with a particular user, such as the user that entered the query. Glossaries associated with the tagged documents are identified (block 73) and the query is compared to the glossaries (block 74). Subsequently, one or more glossary terms or aliases are selected for inclusion in the query (block 75). The glossary terms and aliases can be selected based on a relatedness or similarity to the search query. - Specifically, the terms or aliases for inclusion in the query can be selected by locating entries in the glossary that include one or more of the words in the query, such as a term, definition, or alias. Subsequently, the terms, definitions, and aliases of those entries can be used to expand the query. For example, if the query includes the term “UW,” one or more associated aliases, such as “University of Washington,” “Udub,” and “Huskies” may be provided for inclusion in the query. Further, all the glossaries associated with a user submitting the search query can be identified and glossary entries for which a term or alias appears in the query string are identified. Aliases for the terms or other aliases and terms for the alias can then be added to the query string. In a further embodiment, all the glossaries associated with a particular document, such as a document being read by the user, are searched for glossary entries to augment a query. Also, synonyms of each search term can be identified and used to augment the query. For example, a search term is selected from the query and looked up in a thesaurus to determine other similar terms, which are added to the query.
- In yet a further embodiment, each search term can be looked up in a thesaurus to determine other similar or related terms for the search term and subsequently, the search term and similar terms can be compared with the glossaries to identify terms, definitions, or aliases for adding to the search query.
- In yet an even further embodiment, glossary terms and aliases can be selected using term frequency-inverse document frequency (“tf-idf”), which generally assumes that the frequency or popularity of a term models the importance of that term. For example, the importance of a term increases the more times the term is identified in a document. A set of glossaries associated with the user is identified and a dual-weighted score is constructed for each term or alias in the glossary. First, a standard cosine similarity measure between the query and each glossary entry, including the terms and aliases, is determined, as described above. Next, a tf-idf value is determined for each glossary against all documents tagged with tags from any of the glossaries to select the most selective terms in each glossary. A product of the cosine similarity score and the tf-idf score are used to select the N most significant glossary terms to add to the query.
- In still a further embodiment, search results are identified using only the search query and the glossaries are subsequently used to refine the results by selecting one or more of the results that are most closely related to the glossaries associated with the tagged documents as recently used tags. Specifically, a query is provided and applied to a corpus of documents. One or more of the documents are identified as search results based on a relationship with or a similarity to the query. The similarity can be based on satisfying a predetermined number of search terms in the query or by determining cosine similarity between the search query and the documents. Other measures of similarity are possible. One or more tagged documents related to the query are obtained. The search results are compared with the glossaries associated with the tagged documents. Those search results that are the closest related, or most similar, to the glossaries are selected for providing to a user as relevant results.
- Index Generation
- Indices are collections of relevant topics and provide a topical context for a document, such as a book or research paper, and are generally located at the back of a document. An index includes words or terms, known as headers, and pointers for the headers. The pointers identify the location of information relating to the header in a document. Indices allow a user to quickly obtain an overview of topics included in a document, as well as to identify sections of the document relevant to particular topics. Also, indices can be used for directing users to a particular index term, or topic, within the document. However, manually generating indices can be difficult and expensive. As well, automatically generated indexes can be complex to build.
-
FIG. 7 is a flow diagram showing, by way of example, a method for generating an index using a tag glossary. Index terms are identified (block 81) from a document for which the index will be generated. A corpus of tagged documents related to the document to be indexed is obtained (block 82) and each tagged document is associated with at least one tag. The tagged documents can be obtained from a database, file, or other repository based on a similarity or relevance measure. Each tag is associated with a glossary, which includes terms, term definitions, and aliases. The index terms are then compared with the glossaries of the tags attached to documents in the corpus (block 83). During the comparison (block 83), aliases are identified (block 84) in the glossaries for one or more of the index terms. Subsequently, occurrences of the index terms and aliases are identified (block 85) in the document. The index term and alias occurrences can be identified using a finite state toolkit (“FST”), which identifies patterns within the document using the glossaries, such as described in Karttunen, L., Pattern Matching with FST—a tutorial. PARC Technical Report 2010-1. 2010 Nov. 29. Finally, the index terms, aliases, and occurrences are compiled into the index (block 86), which is incorporated into the document. - FST allows a user to create networks from text files and regular expressions, as well as apply the networks to input strings or files. Each glossary term and associated alias, if any, can form a pattern. The patterns are compiled into a finite state machine, which is run over the text of each document. Terms in the documents that match one or more of the patterns are selected as results. Page numbers for each occurrence of the resulting terms in the document are identified and organized to generate the index.
- In a further embodiment, the index can be generated for a tagged document using the assigned tags as index terms. Once identified, regular expression machinery, such as FST, is applied to find all expressions of the index terms in the document. The terms and locations of the term expressions are then compiled into an index for the document.
- Once generated, the index can be displayed with terms and a representation of the locations of that term in the associated document. For example, the location representation can include page number, page thumbnails, and page strips. Page strips are representative of the pages in a document and include a visual display of term occurrences. The length of the page strip corresponds with the number of pages in a document. The term occurrences are indicated in the page strip by location in the document using graphs, colors, bars, or lines. Selecting a location on the page strip opens the document to the corresponding page.
- In a further embodiment, two or more tagged documents can be compared for similarity based on the glossaries associated with the tags assigned to the tagged documents. The similarity can be based on a number of common terms in the glossaries or based on a cosine similarity between the text of the glossaries associated with the tags.
- In yet a further embodiment, the glossaries provide for better natural language processing by providing contextual information about a tag and associated document. The natural language processing can include distinctions, anaphora resolution, and date resolution, as well as other types of language processing.
- While the invention has been particularly shown and described as referenced to the embodiments thereof, those skilled in the art will understand that the foregoing and other changes in form and detail may be made therein without departing from the spirit and scope of the invention.
Claims (20)
1. A system for supplementing search queries, comprising:
tagged documents each associated with a glossary comprising a plurality of entries, wherein each entry comprises a term, at least one alias, and a definition for the term;
a receipt module to receive a search query comprising one or more search words;
a comparison module to compare the search query with the glossaries associated with each of the tagged documents;
an identification module to identify in one or more of the glossaries at least one entry that includes one of the search words of the search query; and
a query module to generate a supplemental search query by selecting one of the term, alias, and at least a portion of the definition from one or more of the identified entries for inclusion in the query.
2. A system according to claim 1 , further comprising:
a query augmentation module to select one of a further term and alias for inclusion in the search query by identifying a set of glossaries related to a user of the search query, determining an N number of most significant terms from the related glossaries, and adding the N most significant related glossary terms to the search query.
3. A system according to claim 1 , wherein inclusion of the selected term, alias, or definition in the search query facilitates at least one of disambiguating one or more of the search words and broadening the search query.
4. A system according to claim 1 , wherein the tagged documents are each associated with a user that provided the search query.
5. A system according to claim 1 , further comprising:
a term comparison module to compare one or more of the search words to a thesaurus and to determine one or more related terms in the thesaurus for at least one of the search words;
a glossary comparison module to compare the related terms from the thesaurus with the glossaries associated with each tagged document and to identify in at least one glossary one or more entries associated with one of the related terms; and
a query augmentation module to select for inclusion in the search query one of the term, alias, and a portion of the definition from at least one of the entries associated with the related terms.
6. A system according to claim 1 , further comprising:
a query execution module to apply the supplemental search query to a set of documents comprising tagged and untagged documents and to provide results of the supplemental search query comprising one or more of the tagged and untagged documents.
7. A system according to claim 6 , further comprising:
a display module to display one of the tagged document results and to automatically provide a view of one of the terms in the glossary associated with the displayed document during an occurrence of that term in the document.
8. A system according to claim 6 , further comprising:
a document access module to access one of the untagged document results;
a document comparison module to compare the untagged document result to the glossaries of the tagged documents, wherein each of the glossaries is associated with a tag for the tagged document;
a similarity module to calculate a similarity measure between the untagged document result and each glossary; and
a tag assignment module to assign to the untagged document result a tag associated with the glossary having the highest similarity measure.
9. A system according to claim 1 , further comprising:
an index module to generate an index for at least one of the tagged documents, comprising:
index terms selected from the tagged document;
a corpus of tagged documents each associated with a glossary and related to the tagged document;
a term comparison module to compare the index terms with the glossaries of the tagged documents in the corpus;
an alias identification module to identify an alias for one or more of the index terms based on the comparison;
an occurrence module to determine occurrences in the tagged document for each of the index terms and aliases; and
a compiler to compile the index terms, aliases, and occurrences into the index for that tagged document.
10. A system according to claim 1 , wherein each glossary is associated with a tag and the terms in that glossary are subtopics for the tag and further wherein each subtopic is associated with a subtag that is associated with a subglossary.
11. A method for supplementing search queries, comprising:
accessing tagged documents each associated with a glossary comprising a plurality of entries, wherein each entry comprises a term, at least one alias, and a definition for the term;
receiving a search query comprising one or more search words;
comparing the search query with the glossaries associated with each of the tagged documents;
identifying in one or more of the glossaries at least one entry that includes one of the search words of the search query; and
generating a supplemental search query by selecting one of the term, alias, and at least a portion of the definition from one or more of the identified entries for inclusion in the query.
12. A method according to claim 11 , further comprising:
selecting one of a further term and alias for inclusion in the search query, comprising:
identifying a set of glossaries related to a user of the search query;
determining an N number of most significant terms from the related glossaries; and
adding the N most significant related glossary terms to the search query.
13. A method according to claim 1 , wherein inclusion of the selected term, alias, or definition in the search query facilitates at least one of disambiguating one or more of the search words and broadening the search query.
14. A method according to claim 11 , wherein the tagged documents are associated with a user that provided the search query.
15. A method according to claim 11 , further comprising:
comparing one or more of the search words to a thesaurus;
determining one or more related terms in the thesaurus for at least one of the search words;
comparing the related terms from the thesaurus with the glossaries associated with each tagged document;
identifying one or more entries in at least one glossary associated with the related terms; and
selecting for inclusion in the search query one of the term, alias, and a portion of the definition from at least one of the entries associated with the related terms.
16. A method according to claim 11 , further comprising:
applying the supplemental search query to a set of documents comprising tagged and untagged documents; and
providing results of the supplemental search query comprising one or more of the tagged and untagged documents.
17. A method according to claim 16 , further comprising:
displaying one of the tagged document results; and
automatically providing a view of one of the terms in the glossary associated with the displayed document during an occurrence of that term in the document.
18. A method according to claim 16 , further comprising:
accessing one of the untagged document results;
comparing the untagged document result to the glossaries of the tagged documents, wherein each of the glossaries is associated with a tag for the tagged document;
calculating a similarity measure between the untagged document result and each glossary; and
assigning to the untagged document result a tag associated with the glossary having the highest similarity measure.
19. A method according to claim 11 , further comprising:
generating an index for at least one of the tagged documents, comprising:
identifying index terms within the tagged document;
obtaining a corpus of tagged documents each associated with a glossary and related to the tagged document;
comparing the index terms with the glossaries of the tagged documents in the corpus;
identifying an alias for one or more of the index terms based on the comparison;
determining occurrences in the tagged document for each of the index terms and aliases; and
compiling the index terms, aliases, and occurrences into the index for that tagged document.
20. A method according to claim 11 , wherein each glossary is associated with a tag and the terms in that glossary are subtopics for the tag and further wherein each subtopic is associated with a subtag that is associated with a subglossary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/057,061 US20160179931A1 (en) | 2011-12-19 | 2016-02-29 | System And Method For Supplementing Search Queries |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/330,488 US9020950B2 (en) | 2011-12-19 | 2011-12-19 | System and method for generating, updating, and using meaningful tags |
US14/697,563 US9275062B2 (en) | 2011-12-19 | 2015-04-27 | Computer-implemented system and method for augmenting search queries using glossaries |
US15/057,061 US20160179931A1 (en) | 2011-12-19 | 2016-02-29 | System And Method For Supplementing Search Queries |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/697,563 Continuation US9275062B2 (en) | 2011-12-19 | 2015-04-27 | Computer-implemented system and method for augmenting search queries using glossaries |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160179931A1 true US20160179931A1 (en) | 2016-06-23 |
Family
ID=48611249
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/330,488 Expired - Fee Related US9020950B2 (en) | 2011-12-19 | 2011-12-19 | System and method for generating, updating, and using meaningful tags |
US14/697,563 Expired - Fee Related US9275062B2 (en) | 2011-12-19 | 2015-04-27 | Computer-implemented system and method for augmenting search queries using glossaries |
US15/057,061 Abandoned US20160179931A1 (en) | 2011-12-19 | 2016-02-29 | System And Method For Supplementing Search Queries |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/330,488 Expired - Fee Related US9020950B2 (en) | 2011-12-19 | 2011-12-19 | System and method for generating, updating, and using meaningful tags |
US14/697,563 Expired - Fee Related US9275062B2 (en) | 2011-12-19 | 2015-04-27 | Computer-implemented system and method for augmenting search queries using glossaries |
Country Status (1)
Country | Link |
---|---|
US (3) | US9020950B2 (en) |
Families Citing this family (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140081969A1 (en) * | 2012-09-19 | 2014-03-20 | Deep River Ventures, Llc | Methods, Systems, and Program Products for Locating Tagged Resources in a Resource Scope |
US20140081967A1 (en) * | 2012-09-19 | 2014-03-20 | Deep River Ventures, Llc | Methods, Systems, and Program Products for Distinguishing Tags for a Resource |
US20140365486A1 (en) * | 2012-09-19 | 2014-12-11 | Cedar Point Partners, Llc | Methods, systems, and computer program products for tagging a resource |
US20140081968A1 (en) * | 2012-09-19 | 2014-03-20 | Deep River Ventures, Llc | Methods, Systems, and Program Products for Automatically Managing Tagging of a Resource |
US20140081624A1 (en) * | 2012-09-19 | 2014-03-20 | Deep River Ventures, Llc | Methods, Systems, and Program Products for Navigating Tagging Contexts |
US20140081981A1 (en) * | 2012-09-19 | 2014-03-20 | Deep River Ventures, Llc | Methods, Systems, and Program Products for Identifying a Matched Tag Set |
US20140081966A1 (en) * | 2012-09-19 | 2014-03-20 | Deep River Ventures, Llc | Methods, Systems, and Program Products for Tagging a Resource |
US10277945B2 (en) * | 2013-04-05 | 2019-04-30 | Lenovo (Singapore) Pte. Ltd. | Contextual queries for augmenting video display |
US9311300B2 (en) | 2013-09-13 | 2016-04-12 | International Business Machines Corporation | Using natural language processing (NLP) to create subject matter synonyms from definitions |
US20150127323A1 (en) * | 2013-11-04 | 2015-05-07 | Xerox Corporation | Refining inference rules with temporal event clustering |
US9613012B2 (en) * | 2013-11-25 | 2017-04-04 | Dell Products L.P. | System and method for automatically generating keywords |
US9304657B2 (en) * | 2013-12-31 | 2016-04-05 | Abbyy Development Llc | Audio tagging |
US10380204B1 (en) * | 2014-02-12 | 2019-08-13 | Pinterest, Inc. | Visual search |
US9633115B2 (en) * | 2014-04-08 | 2017-04-25 | International Business Machines Corporation | Analyzing a query and provisioning data to analytics |
US11379781B2 (en) * | 2014-06-27 | 2022-07-05 | o9 Solutions, Inc. | Unstructured data processing in plan modeling |
US10614400B2 (en) | 2014-06-27 | 2020-04-07 | o9 Solutions, Inc. | Plan modeling and user feedback |
US10268667B1 (en) * | 2014-07-23 | 2019-04-23 | Evernote Corporation | Contextual optimization of news streams associated with content entry |
US11126592B2 (en) | 2014-09-02 | 2021-09-21 | Microsoft Technology Licensing, Llc | Rapid indexing of document tags |
US9465792B2 (en) * | 2014-12-30 | 2016-10-11 | Successfactors, Inc. | Computer automated organization glossary generation systems and methods |
US9779632B2 (en) | 2014-12-30 | 2017-10-03 | Successfactors, Inc. | Computer automated learning management systems and methods |
CN106202124B (en) * | 2015-05-08 | 2019-12-31 | 广州市动景计算机科技有限公司 | Webpage classification method and device |
US10558721B2 (en) | 2016-09-06 | 2020-02-11 | International Business Machines Corporation | Search tool enhancement using dynamic tagging |
US11036938B2 (en) * | 2017-10-20 | 2021-06-15 | ConceptDrop Inc. | Machine learning system for optimizing projects |
US11567980B2 (en) * | 2018-05-07 | 2023-01-31 | Google Llc | Determining responsive content for a compound query based on a set of generated sub-queries |
CN109522275B (en) * | 2018-11-27 | 2020-11-20 | 掌阅科技股份有限公司 | Label mining method based on user production content, electronic device and storage medium |
EP3660699A1 (en) * | 2018-11-29 | 2020-06-03 | Tata Consultancy Services Limited | Method and system to extract domain concepts to create domain dictionaries and ontologies |
US11042580B2 (en) | 2018-12-30 | 2021-06-22 | Paypal, Inc. | Identifying false positives between matched words |
CN109918655B (en) * | 2019-02-27 | 2023-11-14 | 浙江数链科技有限公司 | Logistics term library generation method and device |
CN110532345A (en) * | 2019-07-15 | 2019-12-03 | 北京小米智能科技有限公司 | A kind of processing method of unlabeled data, device and storage medium |
US11475222B2 (en) | 2020-02-21 | 2022-10-18 | International Business Machines Corporation | Automatically extending a domain taxonomy to the level of granularity present in glossaries in documents |
US11823082B2 (en) | 2020-05-06 | 2023-11-21 | Kore.Ai, Inc. | Methods for orchestrating an automated conversation in one or more networks and devices thereof |
US11531708B2 (en) | 2020-06-09 | 2022-12-20 | International Business Machines Corporation | System and method for question answering with derived glossary clusters |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060028689A1 (en) * | 1996-11-12 | 2006-02-09 | Perry Burt W | Document management with embedded data |
US6708311B1 (en) * | 1999-06-17 | 2004-03-16 | International Business Machines Corporation | Method and apparatus for creating a glossary of terms |
US20040205671A1 (en) * | 2000-09-13 | 2004-10-14 | Tatsuya Sukehiro | Natural-language processing system |
FR2822261A1 (en) * | 2001-03-16 | 2002-09-20 | Thomson Multimedia Sa | Navigation procedure for multimedia documents includes software selecting documents similar to current view, using data associated with each document file |
US7403938B2 (en) * | 2001-09-24 | 2008-07-22 | Iac Search & Media, Inc. | Natural language query processing |
US7143133B2 (en) * | 2002-11-01 | 2006-11-28 | Sun Microsystems, Inc. | System and method for appending server-side glossary definitions to transient web content in a networked computing environment |
US20050283473A1 (en) * | 2004-06-17 | 2005-12-22 | Armand Rousso | Apparatus, method and system of artificial intelligence for data searching applications |
US7930629B2 (en) * | 2005-07-14 | 2011-04-19 | Microsoft Corporation | Consolidating local and remote taxonomies |
US7856597B2 (en) * | 2006-06-01 | 2010-12-21 | Sap Ag | Adding tag name to collection |
US20090019051A1 (en) * | 2007-07-11 | 2009-01-15 | Pharmaceutical Product Development, Lp | Ubiquitous document routing enforcement |
US8117242B1 (en) * | 2008-01-18 | 2012-02-14 | Boadin Technology, LLC | System, method, and computer program product for performing a search in conjunction with use of an online application |
US8156053B2 (en) * | 2008-05-09 | 2012-04-10 | Yahoo! Inc. | Automated tagging of documents |
US8200649B2 (en) * | 2008-05-13 | 2012-06-12 | Enpulz, Llc | Image search engine using context screening parameters |
US8914363B2 (en) * | 2008-05-22 | 2014-12-16 | International Business Machines Corporation | Disambiguating tags in network based multiple user tagging systems |
US20100161631A1 (en) * | 2008-12-19 | 2010-06-24 | Microsoft Corporation | Techniques to share information about tags and documents across a computer network |
US20120059838A1 (en) * | 2010-09-07 | 2012-03-08 | Microsoft Corporation | Providing entity-specific content in response to a search query |
-
2011
- 2011-12-19 US US13/330,488 patent/US9020950B2/en not_active Expired - Fee Related
-
2015
- 2015-04-27 US US14/697,563 patent/US9275062B2/en not_active Expired - Fee Related
-
2016
- 2016-02-29 US US15/057,061 patent/US20160179931A1/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
US20130159306A1 (en) | 2013-06-20 |
US20150234847A1 (en) | 2015-08-20 |
US9020950B2 (en) | 2015-04-28 |
US9275062B2 (en) | 2016-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9275062B2 (en) | Computer-implemented system and method for augmenting search queries using glossaries | |
US8135669B2 (en) | Information access with usage-driven metadata feedback | |
US9600533B2 (en) | Matching and recommending relevant videos and media to individual search engine results | |
US8117185B2 (en) | Media discovery and playlist generation | |
Hider | Information resource description: creating and managing metadata | |
US8977953B1 (en) | Customizing information by combining pair of annotations from at least two different documents | |
US20050149538A1 (en) | Systems and methods for creating and publishing relational data bases | |
US8447758B1 (en) | System and method for identifying documents matching a document metaprint | |
US20140372451A1 (en) | Discovering and scoring relationships extracted from human generated lists | |
US20090070322A1 (en) | Browsing knowledge on the basis of semantic relations | |
KR20090010185A (en) | Method and system for managing single and multiple taxonomies | |
EP2601573A1 (en) | Method and system for integrating web-based systems with local document processing applications | |
US20120179709A1 (en) | Apparatus, method and program product for searching document | |
Ghobadi et al. | An ontology based semantic extraction approach for B2C eCommerce | |
Hyvönen et al. | A content creation process for the semantic web | |
Jannach et al. | Automated ontology instantiation from tabular web sources—the AllRight system | |
Krutil et al. | Web page classification based on schema. org collection | |
US8875007B2 (en) | Creating and modifying an image wiki page | |
Alemayehu et al. | Methodology for creating a community corpus using a Wikibase knowledge graph | |
WO2019142094A1 (en) | System and method for semantic text search | |
Cameron et al. | Semantics-empowered text exploration for knowledge discovery | |
Keerthana et al. | Dspaa: A data sharing platform with automated annotation | |
Manthalu | Annotating Web Search Results | |
Hlava | Implementing a Taxonomy in a Database or on a Website | |
Kolovos et al. | Folklore Collections Database Users' Manual |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |