WO2016009321A1 - Système de recherche, de recommandation et d'exploration de documents par associations conceptuelles et table inversée pour stocker et interroger des index conceptuels - Google Patents

Système de recherche, de recommandation et d'exploration de documents par associations conceptuelles et table inversée pour stocker et interroger des index conceptuels Download PDF

Info

Publication number
WO2016009321A1
WO2016009321A1 PCT/IB2015/055280 IB2015055280W WO2016009321A1 WO 2016009321 A1 WO2016009321 A1 WO 2016009321A1 IB 2015055280 W IB2015055280 W IB 2015055280W WO 2016009321 A1 WO2016009321 A1 WO 2016009321A1
Authority
WO
WIPO (PCT)
Prior art keywords
concept
concepts
documents
query
document
Prior art date
Application number
PCT/IB2015/055280
Other languages
English (en)
Inventor
Luis Lastras-Montano
Michele Martino Franceschini
Livio Baldini SOARES
Mark Wegman
John Thomas RICHARDS
Original Assignee
International Business Machines Corporation
Ibm United Kingdom Limited
Ibm (China) Investment Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US14/330,358 external-priority patent/US10503761B2/en
Priority claimed from US14/330,438 external-priority patent/US9703858B2/en
Application filed by International Business Machines Corporation, Ibm United Kingdom Limited, Ibm (China) Investment Company Limited filed Critical International Business Machines Corporation
Publication of WO2016009321A1 publication Critical patent/WO2016009321A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • the present disclosure relates generally to semantic searching, and more specifically, to searching, recommending, and exploring documents through conceptual associations and an inverted table for storing and querying conceptual indices.
  • LSA latent semantic analysis
  • LSA functions by projecting a document's representation, typically in the form of a term-frequency/inverse-document-frequency vector, to a smaller space called the latent semantic space. This projecting is performed using matrix factorization techniques such as singular value decomposition (SVD), or by using statistical inference techniques such as expectation/maximization, as used in probabilistic latent semantic indexing (PLSI). Every dimension in this lower dimensional space is meant to represent an abstract concept that is generated automatically by the LSA technique.
  • SVD singular value decomposition
  • PLSI probabilistic latent semantic indexing
  • Every dimension in this lower dimensional space is meant to represent an abstract concept that is generated automatically by the LSA technique.
  • LSA idea has given rise to numerous other variants that generally share the attributes described above.
  • a disadvantage of the LSA family of techniques is that by themselves, they are not able to take advantage of the large volumes of crowd sourced data that have become available through the popularity of websites such as Wikipedia.
  • Embodiments include a method, system, and computer program product for searching, recommending, and exploring documents through conceptual associations.
  • a method includes receiving a plurality of documents and extracting concepts from each of the documents. A degree of relation between each of the documents and concepts in a knowledge base is calculated. The method also includes, in response to receiving a query, determining one or more concepts from the query. For each of the concepts, a list of documents having a highest degree of relation to the concept is retrieved. The method also includes outputting a list that is responsive to the one or more retrieved lists.
  • Embodiments include a method, system, and computer program product for implementing an inverted table for storing and querying conceptual indices.
  • the method includes creating a conceptual inverted index (CII) based on conceptual indices (CIs).
  • the CII includes CII entries, each of which corresponds to a separate concept in a concept graph. For each CII entry, the creating includes, with respect to the concept corresponding to the CII, populating the CII entry with pointers to documents selected from the CIs having likelihoods of being related to the concept that are greater than a threshold value and the corresponding likelihoods of the documents.
  • the method also includes receiving a query that includes one of the concepts in the concept graph as a search term, searching the CII for the search term, and generating query results from the searching.
  • the query results include at least a subset of the pointers to documents.
  • Each of the CIs is associated with a corresponding one of the documents and includes a CI entry for each concept in the concept graph, and each of the CI entries specifies a value indicating a likelihood that the one of the documents is related to the concept in the concept graph.
  • FIG. 1 depicts a high level view of a system for performing semantic searching in accordance with an embodiment
  • FIG. 2 depicts a process for automatically linking text to concepts in a knowledge based in accordance with an embodiment
  • FIG. 3 depicts a user interface for displaying search results in accordance with an embodiment
  • FIG. 4 depicts a process for new concept definition in accordance with an embodiment
  • FIG. 5 depicts a process for computing the relevance of a document to concepts not specified in the document in accordance with an embodiment
  • FIG. 6 depicts a process for storing and querying conceptual indices in an inverted table in accordance with an embodiment
  • FIG. 7 depicts an external view of a user interface of a researcher profile in accordance with an embodiment
  • FIG. 8 depicts a user interface of an external view of a researcher publication page in accordance with an embodiment
  • FIG. 9 depicts a user interface of an internal view of a portion of a researcher patent page in accordance with an embodiment
  • FIG. 10 depicts a user interface of a portion of a researcher project page in accordance with an embodiment
  • FIG. 11 depicts a user interface of a primary profile editor in accordance with an embodiment
  • FIG. 12 depicts a user interface for a portion of an editor for a researcher publication page in accordance with an embodiment
  • FIGs. 13A, 13B, and 13C depict a user interface for searching and receiving search results in accordance with an embodiment
  • FIG. 14 depicts a user interface for displaying a result set in accordance with an embodiment
  • FIG. 15 depicts a user interface for displaying search results in accordance with an embodiment
  • FIG. 16 depicts a user interface for displaying a portion of search results in accordance with an embodiment
  • FIG. 17 depicts a user interface in accordance with an embodiment
  • FIG. 18 depicts a user interface for displaying a portion of search results in accordance with an embodiment
  • FIG. 19 depicts a user interface for displaying a portion of search results in accordance with an embodiment
  • FIG. 20 depicts a high-level block diagram of a question-answer (QA) framework where embodiments of semantic searching can be implemented in accordance with an embodiment
  • FIG. 21 depicts a processing system for performing semantic searching in accordance with an embodiment.
  • Embodiments described herein can be utilized for searching, recommending, and exploring documents through conceptual associations. Embodiments include the use of semantic search technologies that combine information contained in knowledge bases and unstructured data sources to provide improved search results. [0009] Embodiments described herein can also be utilized to automatically link text (e.g., from a document) to concepts in a knowledge base. For example, the sentence "The computer programmer learned how to write Java in school.” contains three concepts:
  • An embodiment of the automatic text linker described herein can discover these three concepts and link them to the most relevant concepts that it can find in a knowledge base.
  • Embodiments described herein can further be utilized to automatically add new entries into an existing knowledge base based on contents of a document. This can allow the knowledge base to continue to be updated and remain relevant as new documents are added to a corpus.
  • Embodiments described herein can further be utilized to compute the relevance of a document to concepts that are not specified in the document.
  • Concepts extracted from the document can be used along with a concept graph to determine whether the document is related to other concepts.
  • Embodiments described herein can include the use of efficient data structures for storing and querying deep conceptual indices.
  • a document can be received, concepts extracted from the document, confidence levels calculated for the various concepts, and then a representation of the document can be created in a concept space that connects the document to all possible concepts (not only the ones that were found explicitly in the document).
  • Embodiments described herein are directed to how this information can be organized in a computer system so that it can be efficiently queried against and maintained.
  • Embodiments described herein can further include providing a user interface for summarizing the relevance of a document to a conceptual query.
  • the relevance of a plurality of documents to a conceptual query can be summarized, including assigning the documents to groups based on how relevant the documents are to concepts specified in the query.
  • the term "concept” refers to an abstract idea or general notion that can be specified using a collection of names or labels, and a corresponding description. Additionally, sample sentences describing the concept may be included in the description of the concept. Concepts, such as, for example, "To be or not to be”, “singular value decomposition", "New York Yankees” or "iPhone 6" may be encoded in a web page (e.g., Wikipedia).
  • query refers to a request for information from a data source.
  • a query can typically be formed by specifying a concept or a set of concepts in a user interface directly or indirectly by stating a query in natural language from which concepts are then extracted.
  • ceptual query refers to a type of query that is specified by listing one or more concepts in a concept graph.
  • corpus refers to a collection of one or more documents represented using text (e.g., unstructured text).
  • cognid graph refers to a visual representation that may be used to define a space of concepts and their relationships.
  • a concept graph is an example of a knowledge base in which knowledge is represented with nodes in the graph as concepts and edges in the graph representing known relations between the concepts.
  • a concept graph can be derived from crowd-sourced data sources such as wikis, which focus on defining concepts (e.g., Wikipedia).
  • a concept graph can additionally be augmented with concepts found in new unstructured data sources.
  • Embodiments described herein rely on the ability to compute an estimate of how relevant a concept in a concept graph is to another concept in the concept graph.
  • One way to accomplish this is through the use of Markov chain techniques.
  • a Markov chain can be built by regarding each of the nodes in the concept graph as a state in the Markov chain, where the links in the concept graph are an indication of the next states that may be visited from that state. The probability of going to a state conditional upon being in another state can be made to depend on a weight that may exist for an edge in the concept graph.
  • a and B and a requirement to compute how relevant these concepts are to each other.
  • An initialization probability vector is initialized by setting to 1.0 the probability of being in the state related to concept A, and then the Markov chain is iterated. Computationally, this is accomplished by first multiplying the transition probability matrix of the Markov chain times the current state probability vector. After iterating, the resulting vector is linearly mixed with the initialization probability vector so as to emulate a
  • This recursion is iterated for, for example, L steps.
  • the resulting vector can be regarded as a measure of how relevant concept A is to all other concepts in the graph.
  • u A 0 is a uniform distribution. This is an example of one technique and other normalization techniques are also possible.
  • the link type may then be used to compute an additional link weight, which can be used to modify the link's probability in the transition probability matrix (by, for example, making the probability of the link proportional to the link weight and) which then affects the overall path score.
  • One possibility is to make the link weight for some link types equal to zero, thereby effectively erasing the link from the graph.
  • the capability of affecting the path scores using link types and weights allows the definition of a family of techniques for measuring the relevance of a concept to another concept, instead of a single one. Embodiments described herein may utilize each or a combination of these techniques.
  • Additional techniques that can be utilized by embodiments for computing the relevance of one concept to another concept can include the use of not only the links between edges, but also information in the nodes of the concept graph such as the name or names of the concept and the description of the concept in order to measure the relevance to another concept.
  • Additional mechanisms that can be utilized by embodiments for computing the relevance of a concept to another concept include other known semantic similarity or relatedness techniques. Examples of such techniques can be found, for example, in articles such as: "Using Information Content to Evaluate Semantic Similarity in a Taxonomy” by Philip Resnik (1995); “An Information-Theoretic Definition of Similarity” by Dekang Lin (1998); “Algorithmic Detection of Semantic Similarity”, by Ana Gabriela et al. (2005); “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy” by J. J. Jiang et al.
  • FIG. 1 a high level view of a system for performing semantic searching is generally shown in accordance with an embodiment.
  • the system shown in FIG. 1 can be utilized for searching, recommending, and exploring documents through conceptual associations.
  • a document 102 (e.g., from a corpus of documents) is input to a natural language processing (LP) engine 104 and text from the document 102 is automatically linked to concepts in a knowledge base 106.
  • the NLP engine 104 can also define new concepts to be added to the knowledge base 106 based on contents of the document 102.
  • the concepts from the knowledge base (both existing and newly defined if any) that are found in the document 102 are referred to herein as the "extracted concepts" or as the "concepts extracted from the document.”
  • the system can also compute the relevance of the document 102 to concepts in the knowledge base 106 that are not specified in the document 102 using the extracted concepts and the knowledge base 106.
  • a degree of association between the extracted concepts other concepts e.g., as indicated by the knowledge base 106 is used to determine a relevance of the document 102 to concepts not extracted from the document 102.
  • Output from block 108 can include, for each concept in the knowledge base 106, a likelihood that the document is related to the concept.
  • the output from block 108 can include the relation information for just those concepts in the knowledge base 106 that meet a threshold such as the "most related" or those with a likelihood over a specified threshold.
  • the system can also generate a reverse index 110 that includes for each concept in the knowledge based 106, a likelihood that the document 102 is related to the concept.
  • the document scores shown in the embodiment of the reverse index 110 in FIG. 1 indicate the likelihood that the document is related to the concept.
  • the system can also generate an explanations index 114 which can be used to summarize the relevance of the document 102 to a query.
  • a query is the main mechanism with which an agent external to our system interacts with our system.
  • a query is simply some input that is passed to the system so that relevant documents are returned and suggested, together with an explanation of why they are so.
  • Most of the queries described herein are "conceptual queries", in which the query includes one or more concepts in knowledge base or concept graph. However conceptual queries may have been derived from a simple string of text as well.
  • Embodiments described herein can be used to automatically link text (e.g., from a document) to concepts in a knowledge base.
  • text e.g., from a document
  • concepts in a knowledge base e.g., from a document
  • An embodiment of the automatic text linker described herein can discover these three concepts and link them to the closest (most relevant) concepts that it can find in a knowledge base.
  • the term "closest concept” refers to a concept that is "most relevant.”
  • a generative model is one which describes how to randomly generate data given some hidden parameters.
  • the key mechanism for generating output relies on the notion of a concept.
  • a human externalizes a sequence of words that come to mind while thinking of a concept or a collection of concepts.
  • a conceptual generative model can be specified by considering concepts as hidden parameters and the output text as the observations of a hidden Markov model.
  • the task of deciding whether to link a portion of text to a concept in a knowledge base of concept graphs can be treated as a "differential" test; that is, some text is linked to a concept only if the probability of the text assuming that the author was "thinking" about that concept when writing the text is sufficiently higher than the probability that the author was thinking of no particular concept (i.e., a generic language model) when writing the text.
  • a higher level of confidence can be associated with linking a portion of text to a particular concept when the probability of the text assuming that the author was "thinking" about that concept is sufficiently higher than the probability that the author was thinking of other competing concepts. For example, the sentence "I wish I could drive a Maserati" will have a much higher probability pi of having been produced by someone thinking of the concept
  • a set of data sources can include: one data source that is a collection of names for the concept;
  • Another data source that is a text that describes what this concept is; and a further data source that is a series of examples to which the concept is referred.
  • Other sets of data sources may be used by embodiments and may vary based on particular applications and concepts being analyzed.
  • each of the data sources in the example set of data sources described above is associated with a bag-of-words model where a word is chosen independently and identically distributed from the bag-of-words.
  • each of the data sources is chosen with a certain probability in an independent, identically distributed fashion.
  • More complex models for generating words can also be used to provide improved linking between a portion (e.g., one or more words) of text and a concept.
  • the model for selecting words can include the following: first select a name to be uttered with some probability; and then utter the sequence of words as given by the concept name selected.
  • Additional variants can include one or more words in the sequence of words in the concept name being skipped, or morphed into a variant (for example, add a plural or convert into singular, or capitalize, or lower case, or add a typo).
  • a further variant includes the possibility of uttering the words in the sequence of words out of order.
  • a word in the data source can be selected at random using a probability distribution that gives a higher probability to words that occur earlier in the description and a lower probability to words that occur later in the description.
  • One possible model is to choose a positive integer x from a distribution that assigns a probability to x that is proportional to (l/x) a up to a maximum possible integer which is the count of words in a document, where a is a parameter that can be estimated through standard statistical parameter estimation methods. After choosing this integer, the corresponding word is uttered. After this, the simplest model simply repeats this process again.
  • More complex models may choose a "phrase length" which results in uttering not only the word in position x, but also subsequent words in the order in which they appear in the data source up to a certain phrase length.
  • Other models may not utter exactly the same words, but variants or synonyms that can be obtained through the use of thesauri or other data sources.
  • a person is thinking of more than one concept when uttering language can also be considered.
  • a sentence, or collection of sentences, being uttered is regarded as having "epochs" inside of it.
  • the epochs can be generated using a single concept model, however switching from an epoch to another epoch can result in language being uttered from a different single concept model.
  • the model for choosing which concept to utter words from can also be treated as a Markov model, where the chances that a word is uttered from a certain concept after uttering a word coming from that concept's data sources is higher than the chances of uttering a word from another concept's data sources.
  • the model described above can be seen as a form of a hierarchical hidden Markov model, where in the top level, a concept is chosen, then in the next level, a data source within the concept is chosen, and then in the next level, a word from the data source is uttered.
  • the observables of this hierarchical hidden Markov model are the words generated and the un-observables are the various choices described above leading to the selection of the word.
  • Each of these conceptual generative language models can be associated with a technique for estimating the probability that a given sequence of words (e.g., a sentence) is uttered for the given conceptual generative language model.
  • These probability estimates can be obtained using computational techniques that will reflect the underlying conceptual generative language model; for example for simple bag-of-words, counting the frequency of the words in a given sentence and then using the probability of a word in the bag-of-words suffices to compute a probability estimate.
  • HMM hidden Markov model
  • the computational technique used in this case falls under the general class of dynamic programming methods; and it is important to note that in most instances, an exact calculation of the probability is unnecessary and this can be used to further simplify the calculations.
  • This maximum can be regarded as a maximum a posteriori (MAP) estimate of the originating concept, or a maximum likelihood (ML) estimate if the prior probability is uninformative. From the earlier description, it stands to reason that p( T
  • a basic result from the data compression literature is that the problem of assigning probabilities to strings of data is in many ways similar to the problem of finding efficient representations of the text, and thus data modeling techniques have an implication on data compression algorithms (through the use of algorithms such as arithmetic coding) and vice versa, algorithms for data compression can be used to build estimates of probabilities of strings.
  • embodiments are not limited to the ways to perform the analysis described herein. For example, one could extract X features from the text (each feature could be a subsection of the text) and then the analysis above can be applied described above for these features, instead of the entire text at the same time. Then instead of having, for example, one quantity, one has X quantities. One way to aggregate these X quantities is to multiply them. One way to select these features is to choose as a feature a maximal sequence of words in T that also appears in a data source for a concept. Another possibility is to choose a maximal sequence of characters in T that also appears in a data source for a concept. [0052] Turning now to FIG. 2, a process for automatically linking text to concepts in a knowledge base is generally shown in accordance with an embodiment.
  • the linking is performed using differential analysis.
  • a text string is received.
  • the text string can be a collection of words, a sentence, a paragraph, or a whole document.
  • data sources are selected based on contents of the text string.
  • Each data source can correspond to a concept in the knowledge base and can be used to build language models.
  • Each of the data sources can include one or more collections of names for a corresponding concept, a description for the corresponding concept, sentences referring to the corresponding concept and/or paragraphs referring to the corresponding concept.
  • probabilities that the text string is associated with each of the language models as well as a generic language model are calculated.
  • the generic language model is derived from a generic data source not specified to any of the concepts in the knowledge base.
  • the text string is associated (e.g., via a link) to a concept in a knowledge base based on a comparison of the probabilities as described herein.
  • the comparison can include calculating link confidence scores for each concept based on a differential analysis of the probabilities.
  • the differential analysis can include comparing the probability that the text string is output by a language model built using a data source to the probability that the text string is output by a generic language model.
  • the differential can also include comparing the probability that the text string is output by a language model built using a data source to a probability that the text string is output by a language model built using a competing data source.
  • the text string can be linked to additional concepts in the knowledge base.
  • the link can apply to just a subset of the text string, with the subset indicated in the link.
  • the words in the subset can be consecutive or non-consecutive in the text string.
  • a link can be created from the text string to one of the concepts in the knowledge base based on a link confidence score of the concept being more than a threshold value away from a prescribed threshold.
  • An key issue to be addressed can include is the ability to do conceptual searching for things that are not already previously present in an existing concept graph, or alternately, the ability to automatically add entries in to an existing knowledge base. A process of accomplishing this will now be described in accordance with an embodiment.
  • the corpus of data and the concept graph are entirely unrelated.
  • the reference corpus of data is the external web pages of IBM researchers (http://researcher.ibm.com) which are a collection of web pages in which researchers describe their research interests, education, projects, list their papers, etc.
  • Wikipedia is used as the knowledge base. Wikipedia articles may be assigned to nodes in the concept graph, and the edges between nodes may be hyperlinks referencing a Wikipedia article from another Wikipedia article.
  • text is extracted from the reference corpus which may contain information describing aspects of the "Mambo".
  • One possible way of accomplishing this is to do a text search for the word "Mambo” in the reference corpus and then extract text containing these mentions of the word "Mambo” from the corpus of reference.
  • the text search can be done using a number of techniques known in the art. When doing this text search, variants of the word "Mambo” may be searched for as well, for example “mambo” (the lower case version of "Mambo") or "MAMBO" may be also searched.
  • the text containing these mentions is analyzed using a technique for extracting concepts from this text.
  • the embodiments described herein provide one such technique; however, it will be understood that other techniques may be used as well.
  • the extracted concepts can be assumed to be already previously present in the concept graph. In FIG. 3, the concepts that were extracted from the text containing these mentions are shown at 306.
  • the extracted concepts can be used in at least two distinct ways, namely, for a system query and for concept graph expansion.
  • the extracted concepts may be applied as a query to a system that can perform conceptual searching or recommendations. Such a system does not have to return documents related to the concept graph or the corpus of reference (although it could also do that).
  • the key assumption though is that the system can accept as an input query the concepts extracted in the second step above, and that it is capable of returning documents conceptually associated with the input query, independently of the origin of these documents.
  • the concept graph expansion use a new node in the concept graph is added with the name "mambo", and is connected to other nodes in the concept graph using the extracted concepts. Additionally, the text from which the concepts were extracted, their provenance and the confidence of the extracted concepts may also be added as additional metadata in the new concept definition.
  • the text containing the mentions of the string of text may contain references to concepts that are unrelated to the potential new concept.
  • This can problem can be remedied through the following technique.
  • the concepts extracted from the text containing the mentions can undergo a process of "de-noising".
  • de-noising refers to the act of estimating the cross-relevance of the concepts extracted (possibly sometimes even removing them). If there is a subset of concepts extracted that have some degree of associations with each other, then the confidence that those extracted concepts can be used in the new concept definition increases. Alternately, if there are extracted concepts for which other related extracted concepts cannot be found, then the confidence that those extracted concepts can be used reliably in the definition of the new concept diminishes.
  • the relation between any two concepts can be attained by exploiting a concept graph.
  • the concept graph may be utilized to determine these relations (e.g., whether they appear explicitly in the form of links in the concept graph between the two concepts, or implicitly through an exploration and weighting of the various paths that connect the concepts in the concept graph).
  • a technique is described for estimating how close is a specific concept in a collection of extracted concepts to the rest of the extracted concepts by analyzing the top M closest concepts and then merging those M scores using information combining techniques.
  • a sample of top concepts is shown generally at 308 in FIG. 3.
  • the reference corpus of data may contain references to the same string of text that refer to two or more senses.
  • the pages of the IBM researchers may sometimes refer to "MAMBO", a local club with interests on Latin- American music.
  • a search for "Mambo” may (after lower/upper casing normalization is applied) result in documents containing both references to the project as well as references to the club.
  • metadata that can be used to separate these two senses (for example, some pages are known to be about projects, and some pages are known to be about hobbies)
  • the extracted concepts can be separated in two groups (or more than two groups, depending on the situation). If no such metadata is available, or if the metadata is unreliable so that it needs to be aided by further analysis, embodiments of the invention may utilize clustering algorithms to attempt to separate these two intermixed senses, as will now be described.
  • the two clusters are regarded as two distinct senses of the same string of text.
  • the string of text is interpreted as a system query in which documents are returned from some corpus through conceptual associations.
  • the goal is to augment a concept graph with a new concept definition.
  • the contextual information can be summarized as a collection of concepts from a concept graph.
  • the concept graph may be employed along with the various techniques described in this embodiment to measure the closeness between the concepts in the contextual information and each of the groups of concepts that define a sense of the string of text. Once this closeness is established, the most likely sense can be computed.
  • this sense is selected as the intended sense and a conceptual query is performed for the sense.
  • a user is presented with the option of selecting which sense of the string of text was meant, via a dialog on a user interface. Once the sense has been disambiguated by the user, the conceptual query can be performed as indicated above.
  • an important challenge is that of automatically extracting strings of text which may be candidates for new concepts.
  • One technique for automatic extraction can include processing the reference corpus using natural language processing technology and extracting all noun phrases that it contains. These noun phrases are natural candidates for incorporation as new concepts into the concept graph. The pool of noun phrases is then ordered according to some criterion of importance. Generally speaking, noun phrases that appear more frequently can be given higher priority, as there is evidence they are discussed with certain frequency, and also the quality of the definitions that can be derived is higher.
  • One challenge involves deciding when the task of automatically defining a concept is moot because the underlying concept is already present in a concept graph. This may be resolved by comparing the names of the concepts in a concept graph with the string of text, and when finding a sufficiently close match, then computing the conceptual closeness between the concepts extracted in the process of defining the string of text (for a single sense, in case where there are multiple senses found for the string of text in the reference corpus) and the concepts to which the close match concept found earlier is linked to in the concept graph. If there is a sufficiently high correlation between these two sets of concepts, then it is determined that it is not necessary to define a new concept, and instead the original concept in the concept graph may be sufficient.
  • One embodiment provides the capability to determine the meaning (or meanings) of a hashtag presented in a user's social media application (e.g., TWITTER).
  • a user's social media application e.g., TWITTER.
  • TWITTER a user's social media application
  • analysis of this larger amount of text has a higher probability of successfully being able to define the meaning (or meanings, in case it has multiple senses) because analysis of the tweets' text using the embodiments described herein will result in linkages into an existing concept graph with higher probability than if a single tweet was employed.
  • the concept graph could be derived using Wikipedia, freebase, or other data sources.
  • the embodiments described herein offer great flexibility.
  • the reference corpus and the concept graph were not related.
  • the concept graph may be closely related.
  • the concept graph may be derived from the articles of Wikipedia (e.g., where each node of the concept graph would have a name given by a Wikipedia article and its links to other nodes would be determined by the hyperlinks connection articles in Wikipedia).
  • the reference corpus could be Wikipedia itself, including the full texts of each Wikipedia article.
  • there is a very large number of noun phrases in Wikipedia that are not present as Wikipedia article names can be prime candidates for new concepts in a concept graph, demonstrating that the embodiments are also applicable and meaningful when the concept graph and the reference corpus are closely related to each other.
  • a process for automatic new concept definition is generally shown in accordance with an embodiment.
  • a string of text is received, and at block 404, a corpus of data can be searched to locate additional text related to the string of text.
  • the searching can include searching for both the string of text and for variants of the string of text.
  • concepts can be extracted from the additional text.
  • the extracted concepts can include a subset of concepts in a concept graph and can be de-noised to estimate the cross-relevance of the extracted concepts.
  • the corpus of data and the concept graph are separate entities. Processing then continues at block 408 or block 410.
  • the extracted concepts can be used to link the string of text to the concept graph. It is determined whether the string of text should be linked to an existing concept in the concept graph.
  • the determining can include determining similarities between the string of text and a name of the existing concept, and then deciding whether the string of text should be linked to the existing concept based on the similarities.
  • the determining can also include determining a conceptual closeness between the extracted concepts and the concepts to which the existing concept is linked to in the concept graph, and then deciding whether the string of text should be linked to the existing concept based on the conceptual closeness.
  • the linking can be performed if it was determined that the string of text should be linked to the existing concept in the concept graph.
  • a new concept that is associated with the string of text can be added to the concept graph if it was determined that the strong of text should not be linked to the existing concept in the concept graph.
  • Adding the new concept to the concept graph can include linking the new concept to at least one of the extracted concepts from the additional text.
  • the extracted concepts can be clustered into one or more clusters with related concepts. Based on their being two or more clusters, the following can be performed independently for each cluster: determining whether a new concept should be added to the concept graph based on the similarities of the string of text to the names of existing concepts and the similarities between the concepts in the cluster to the concepts linked to an existing concept; and based on determining that a new concept should be added, adding the new concept to the concept graph.
  • An embodiment includes computing the relevance of a document to concepts not specified in the document. Techniques described herein can be used for performing conceptual document integration techniques to create a deep vector representation of a document.
  • Embodiments disclosed herein relate to the use of a concept graph to define a space of concepts and their relations.
  • a concept graph can be derived from crowd- sourced data sources such as wikis which focus on defining concepts (e.g., Wikipedia) and can additionally be augmented with concepts found in new unstructured data sources.
  • concepts that exist in a concept graph are extracted from a document and then a vector representation of the document is created in "concept space" in which every dimension is a concept of the concept graph.
  • a concept space is a collection of two entities. One of the entities is a vector space with N dimensions. The other entity is a concept graph with N concepts, giving a precise meaning to each of the dimensions of the vector space.
  • the extracting of concepts from a document can be done using "wikification” techniques which are known in the art.
  • the vector representation described herein can have a value assigned to a vector entry that represents the relevance of the concept to the document or vice versa, and thus it has a much higher level semantic significance.
  • the document may have nonzero scores for concepts that are actually not present in the document, but that are inferred to be relevant to the document using reasoning afforded to embodiments by the concept graph.
  • a confidence score can be a measurement or estimate of how sure the correct concept was extracted by the concept extraction process.
  • the goal is to obtain the relation between the document D and a general space of concepts (e.g., all or a subset of a concept graph).
  • This representation can connect the document D with concepts not necessarily present in the original description of it, via exploitation of deep conceptual connections as seen in a concept graph (hence the name "deep concept vector representation").
  • This representation is referred to herein as including a posteriori inferences about the document which have now incorporated additional knowledge as related to concept.
  • One type of relation to obtain is to compute how relevant a document would be for a query comprised of one concept.
  • An embodiment includes taking a priori data about a document and then employing a concept graph to improve the conceptual understanding of that document.
  • the concepts cl, c2, c3, ... ck may have come with confidence scores si, s2, ... sk from the annotation techniques associated with the concept extraction.
  • these confidence scores can be further refined by taking into account the relations between the scores as made available through computations over a concept graph.
  • An individual concept can be regarded as a document with a single extracted concept, and thus via a slight notation overload, one can denote by r(ci) the vector with N entries that describe the likelihood that the concept ci is related to any other concept in the concept space.
  • Computing the vector r(ci) can be done with a variety of techniques, including ones based on Markov chain simulations.
  • LR likelihood ratio
  • LLR log( p / (1-p)
  • LR A ⁇ -l ⁇ (x) and LLR A ⁇ -l ⁇ (x) the corresponding inverse functions.
  • f() is a function that defines how to count the contribution of a matrix entry, can be seen as a general statistic that can be computed by specializing the function f(.).
  • Other choices for f are possible depending on the application.
  • the row vector thus computed is an estimate for how much the general area of a concept is being mentioned, taking into account information from all other extracted concepts. A large value in an entry of the row vector implies a larger presence of the underlying concept, and vice- versa.
  • the updated confidence scores wl, w2, w3, wK on the concepts can be defined as a function of the corresponding elements of v:
  • This function g( . ) is an application dependent quantity that that can be used to control how much concept reinforcement is promoted and how much concept verbosity is
  • This general class of functions falls within the scope of our embodiments and also account for reinforcement and verbosity.
  • Equation (4) The computation shown above in Equation (4) is performed for some weights wl ... wk that are derived in ways as described above. This computation is a vector with N entries, since each of the r(ci) is also a vector with N entries. Recall that this vector encodes the likelihood that a concept ci is related to each other concepts in a concept graph.
  • Equation (4) is the a posteriori likelihood inferences sought by embodiments, as it connects the document to all other concepts in a concept graph by processing the a priori data with the concept graph information.
  • An alternative embodiment can be utilized for processing the a priori information about the document and producing the a posteriori likelihoods relating a document other concepts.
  • An advantage to this embodiment is that it can allow processing of an arbitrary large set of concepts in a document with a finite amount of memory in the processing system.
  • Another advantage is that it is able to differentiate in the final a posteriori likelihood vector the contributions on the concepts explicitly present in the document as per the initial concept extraction.
  • the second key idea is that in the output likelihood vector, the K concepts cl, ... ck receive a special treatment beyond the above. This is important so as to ensure that document queries that contain an extracted concept that is also a query are ranked appropriately.
  • direct_ ⁇ j ⁇ lambda d r(j)[j] + gamma d
  • the compounded contribution of the "side" hits is computed by selecting the highest M among the list [y_ ⁇ l,j ⁇ , y_ ⁇ 2, j ⁇ y_ ⁇ k,j ⁇ ], computing their likelihood ratios, multiplying them, and then computing the likelihood value sidej from the resulting likelihood ratio.
  • lambda d, gamma d, a side, b side can vary with i as a function of the popularity of the concept ci.
  • FIG. 5 a process for computing the relevance of a document to concepts not specified in the document is generally shown in accordance with an
  • a concept graph is accessed.
  • a relevance of the document to concepts in the concept graph is computed.
  • a priori information about a document is received.
  • the a priori information can include concepts previously extracted from the document.
  • the a priori information can also include confidence scores corresponding to each of the concepts extracted from the document.
  • a posteriori information about the document is generated that indicates the likelihood that the document is related to each of the concepts in the concept graph.
  • the a posteriori information can be generated by combining the a priori information and the concept graph.
  • the a posteriori information can be responsive to paths connecting each of the concepts in the concept graph to each concept in a selected subset of the concepts extracted from the document.
  • the selected subset of the concepts extracted from the document can vary based on the concept in the concept graph and can include a specified number of the concepts extracted from the document that are the most related to the concept in the concept graph.
  • the a posteriori information can include a weighted averaging of vectors associated with extracted concepts. The weight for an extracted concept can be responsive to a degree to which the extracted concept is related to the other extracted concepts and/or to a frequency in which the extracted concept appears in the document.
  • a relevance of the document to concepts not extracted from the document can also be computed.
  • the document can be in a corpus of documents and the processing can further include utilizing the a posteriori information to search the corpus of documents.
  • the process can also include outputting a threshold number of concepts from the concept graph having the highest likelihood, the output based on the results of the combining.
  • Embodiments described herein further include the use of efficient data structures for storing and querying deep conceptual indices.
  • a document can be received, concepts extracted from the document, confidence scores calculated for the various concepts (with the goal of, for example, promoting concept reinforcement and creating diminishing returns by penalizing exceedingly verbose descriptions), and then a
  • representation of the document is created in a concept space that connects the document to all possible concepts (not only the ones that were found explicitly in the document).
  • Embodiments described herein are directed to how this information can be organized in a computer system so that it can be efficiently queried against and maintained.
  • a query can typically be formed by specifying a concept or a set of concepts in a user interface directly or indirectly by stating a query in natural language from which concepts are then extracted.
  • one or more concepts can be selected as the query.
  • the computational task involved in an embodiment is that of examining the score that has been assigned to every document for the query q, and returning the documents in the order implied by that score. This can be achieved by building an inverted table that maintains for every concept, a list of documents and associated their score. This allows for quick retrieval of this list upon receiving the query concept.
  • the inverted table utilized by embodiments described herein is of a very different nature than inverted tables used in traditional information retrieval mechanisms, where for every keyword (or generally some text surface form) a list of documents containing that keyword is listed.
  • a document is associated with a variety of concepts that are not present in the document to start with.
  • the list of documents can be maintained in sorted form (e.g., sorted by score) and can be organized as a hash table as well, where they key is the document and the value is the score.
  • a document is associated with potentially a large number of concepts, and hence it can become important to design and use data structures that are capable of very fast insertion, deletion and updating of scores.
  • Scores can be augmented with a version number, and an additional metadata structure can be added to tracks what is the most recent version of a document. Incrementing the version number results in automatically and immediately invalidating all the scores of a document in the inverted table with a version number lesser than the new version number.
  • the conceptual index versioning system described above can also allow a seamless experience in transitioning scores for a document to new scores. This can be accomplished as follows: the new scores for the document are uploaded with the new version number, however the version number for the document indicated in the metadata is not updated until all the scores for the new version have been added to the inverted table.
  • the inverted hash table data structure can also be augmented with a garbage collection mechanism which periodically deletes concept scores of versions that are no longer valid in the system.
  • the garbage collection mechanism can delete scores (even if current, matching the existing version) starting from the lowest scores, whenever there is existing or predicted space pressure in the table.
  • Scores for D2 can equal ⁇ ⁇ "Vanilla ice cream” : 0.1 ⁇ , ⁇ "Induction cooking” : 1.0 ⁇ ,
  • ⁇ "Kitchen 0.8 ⁇
  • ⁇ "Vanilla Planifolia 0.01 ⁇
  • ⁇ " Dairy 0.15 ⁇
  • ⁇ "LCD screen 0.1 ⁇ ⁇ .
  • the current version number for the scores of the documents can be indicated as ⁇ "Dl" : vl, "D2" : vl ⁇ .
  • Embodiments can be utilized for queries that include multiple concepts. Multiple concept queries can be present when a user explicitly (or implicitly, via the use of natural language) describes two or more concepts with the hope of retrieving documents that are relevant to a reasonable fraction of the given concepts. Another multiple concept query scenario is when a user introduces a *document* as an example query, and in this case the concepts that are included in the document can be used to formulate a multiple concept query. Yet another scenario is when documents and concepts are mixed as a query.
  • the information retrieval system may not be fully calibrated and as a result the various vectors in [ s_ ⁇ i, l ⁇ s_ ⁇ i,2 ⁇ s_ ⁇ i,3 ⁇ ... s_ ⁇ i,L ⁇ ] may not be compatible with each other.
  • a score of 0.8 for a query on "information retrieval” may imply a different relevance than a score of 0.8 for a query on "Federal Housing Administration", even for the same document.
  • s_ ⁇ i,L ⁇ can lead to undesirable or unexpected outcomes in the combined ranking.
  • a query in which a document-by-example is provided.
  • These queries are nothing other than conceptual queries, since the document that is being used as a query can be analyzed, extracting one or more concepts from this document and using those concepts in the query.
  • the document is first analyzed to extract a list of concepts for the document that are deemed to be highly relevant, and then using the techniques previously described can be used to perform a multiple concept query.
  • Embodiments deduce the identity of the important concepts in a document by restricting the concepts to be considered to be only those initially extracted from the document. Nonetheless, the entire concept graph is used to analyze how these concepts relate/reinforce each other, using the techniques described previously herein. The results of that scoring are then used in order to select the top concepts to be used in the multiple concept query.
  • the fundamental entity in the query is a concept, and documents in the document-by- example are simply converted into a one or more concept query, it is also possible to create queries that combine concepts and documents simultaneously, as shown in FIGs. 13B and 13C below.
  • the query has exactly one document and one concept being specified.
  • it may be the name of a person, and an additional concept to further sharpen the query.
  • the person is associated with a document, which in turn is associated with 50 concepts.
  • FIG. 6 a process for storing and querying conceptual indices in an inverted table is generally shown in accordance with an embodiment.
  • a query is received, and at block 604, data is accessed that links text in a document to concepts in a concept graph.
  • a measure of closeness between the query and each of the concepts is computed, and at block 608, a selected threshold number of concepts that are closest to the query are output.
  • a method can include creating a conceptual inverted index based on conceptual indices.
  • the conceptual inverted index includes conceptual inverted index entries, each of which corresponds to a separate concept in a concept graph.
  • the creating can include, with respect to the concept corresponding to the conceptual inverted index, populating the conceptual inverted index entry with pointers to documents selected from the conceptual index having likelihoods of being related to the concept that are greater than a threshold value and the corresponding likelihoods of the documents.
  • the method can also include receiving a query that includes one of the concepts in the concept graph as a search term, searching the conceptual inverted index for the search term, and generating query results from the searching.
  • the query includes a plurality of concepts as the search term.
  • the query results can include at least a subset of the pointers to documents.
  • the subset includes pointers to those documents having likelihoods greater than a second threshold of being related to the concept included in the search term.
  • the query results can include a pointer to a document that does not explicitly mention the search term.
  • Each of the conceptual indices can be associated with a corresponding one of the documents and includes a conceptual index entry for each concept in the concept graph, and each of the conceptual index entries specifies a value indicating a likelihood that the one of the documents is related to the concept in the concept graph.
  • a conceptual index entry can indicate that a document is related to a concept in the concept graph that is not mentioned in the document.
  • the concept not mentioned in the document can be related to the concept included in the query.
  • a document is related to a concept in the concept graph if a concept extracted from the document is connected to the concept in the concept graph via a path in the concept graph.
  • each document is associated with a valid version number and generating query results includes verifying that any pointers to documents correspond to documents that match the associated valid version numbers. Pointers to documents that do not match the associated valid version numbers can be removed from the conceptual inverted index.
  • the resulting system can be used for recommending which patent examiners to use for a given patent application. If the documents ingested are descriptions of the expertise of a body of researchers in an organization, the resulting system can be used to as an "expertise locator" to help locate skills in an organization with the goal of making its processes more efficient.
  • Extraction of candidate new concepts from this reference corpus may happen by extracting noun phrases from this reference corpus. These noun phrases can then be regarded as a string of text for which a new concept definition may be considered. The process of analyzing a string of text for this goal has been described elsewhere herein.
  • the a priori information for each document can be analyzed in conjunction with a concept graph in order to obtain a representation of the document in a concept space, obtaining a posteriori information for each document. This analysis is described elsewhere herein.
  • the data produced so far is organized in a "conceptual" reverse index (or conceptual inverted index) which allows the fast retrieval of relevant documents given a conceptual query.
  • the system may produce an explanations index that can be used to help construct explanations to a user about the relevance of a document to a conceptual query.
  • the mechanisms for building and maintaining these indices are described elsewhere in this embodiment.
  • the explanations can be computed on the fly at the same time the query is made, as described in this embodiment.
  • the system exhibits a user interface in which an input can be entered.
  • the input mechanism can allow the user to specify one or more concepts from the concept graph, one or more documents (so that similar documents can be returned), or a combination of both.
  • a plain string of text may be entered, which is then passed to a mechanism that extracts concepts from it such as the text to concept graph linker described in this embodiment.
  • the system After having input a query and a collection of concepts made available to the system, which is referred to herein as a conceptual query, the system proceeds to lookup the concept based reverse index to find documents that are relevant to the one or more concepts in the query. As the reverse index is organized as a table that is looked up via single concept queries, multiple lookups are made and these lookups are aggregated to form a single presentation.
  • the documents returned are presented with an explanation of why they are relevant to the query by emphasizing those concepts within the extracted concepts in the document that are most relevant to the query and/or showing related text containing those concepts. Additionally, the documents themselves maybe clustered in groups of documents that relate to the query in similar ways, as described earlier in this embodiment.
  • the various concepts being emphasized may also be presented with a hyperlink; clicking on this hyperlink may have the effect of changing the query to include the clicked concept.
  • Embodiments can provide a user interface for summarizing the relevance of a document to a query.
  • embodiments can provide a system for searching, recommending, and exploring documents through conceptual association.
  • the system is referred to herein as the "Researcher System” and it can serve as an external and/or internal Web face for people and projects across a corporation's (or other group) research locations (or other types of locations).
  • Embodiments can support a technique for automatically extracting and using Wikipedia based concepts from the Researcher System's underlying content store to support the finding of potential experts through the novel forms of fuzzy matching described herein.
  • the Researcher System can simplify content creation and linking in order to make it easier for researchers (or other types of employees) to describe their professional skills and interests and, in so doing, to make them easier to find.
  • Embodiments also include searches and user interfaces for finding researchers that are conceptually related to a query despite the absence of string matches between the query and the underlying repository.
  • the Researcher System contains descriptions of nearly 2,000 professionals, along with their projects, tens of thousands of their papers and patents, and descriptions of their personal interests. Key aspects of the Researcher System are described herein including how its design supports rapid authoring, by individual researchers without central editorial control, of relatively coherent and richly linked content. [00143] Also described herein is the use of the content in the Researcher System as a source of high value information about peoples' skills and interests. An example shows how embodiments can be used to efficiently index this repository and map search queries to this index to find people with related skills and interests. Unlike systems relying on explicit tag creation or fixed ontologies, embodiments include the ability to automatically create something like tags using the conceptual space provided by Wikipedia. Terms within search queries are mapped to concepts within this space and the conceptual distance from the query to all the people in the repository is computed.
  • Embodiments of the Researcher System create characterizations of skills and interests by automatic analysis of textual content. These characterizations make use of the vast collection of over five million concepts currently found within Wikipedia. Since each concept is extremely unlikely to be directly associated with any one person, embodiments make use of Wikipedia' s link structure to find concepts near the concepts in a query. And, once a potentially relevant person is found, the Researcher System can facilitate navigation to other related people, helping the searcher find people who might also be of interest. To accomplish this, the use of various components described in this invention may be employed. First, an automatic linking of the text describing the researchers to a concept graph derived from Wikipedia is performed. The text of these researchers is processed to discover new concepts that may be added to the concept graph.
  • the information about the automatic linking which is regarded as as "a priori" information about the documents, in combination with the concept graph, is used to associate each of the documents (representing a person) to each concept in a concept graph, thus obtaining the representation of the document in a concept space.
  • searches are based on concepts expressed using a search tool outlined below. Recall is substantially improved by this technique since many concepts tend to lie near the concept or concepts expressed in the query and many people tend to have many associated concepts near these same concepts. This rich fuzzy matching almost always results in a set of relevant people being returned.
  • Embodiments of the Researcher System also provide, for ease of content authoring and immediate publishing of external content in order to facilitate growth in the number of pages and increased page freshness. Embodiments also provide for links between people to enhance navigation in the Researcher System. Thus, once a searcher finds a person of interest, crosslinks to related people or projects are also found. Embodiments provide ease of crosslink creation (including automatic crosslink creation where possible).
  • FIGs. 7-10 primary views (user interfaces) of people and projects in a deployment of the Researcher System are generally shown in accordance with embodiments.
  • Embodiments support both internal views (limited to those inside the corporation) and external views (visible to all).
  • FIG. 7 depicts an external view of a user interface 700 of a researcher profile in accordance with an embodiment.
  • photo, name, job role, location, and contact information are displayed at the top with the email address and phone number being rendered in a way that makes scraping for spam purposes prohibitively expensive, if not impossible.
  • a user interface 800 illustrates a Profile Overview, Publications Page, and Patents Page associated with the researcher and are arrayed as a collection of tabbed pages. Other tabs may be created to hold additional content, as shown in FIG. 10 (e.g., Risk Perception").
  • the Profile can contain any HTML content, created using an editor shown in FIG. 11. In this fairly typical example, links to related information, both information hosted in a repository of the Researcher System and that residing elsewhere on the open Web, are included in the Profile description.
  • the user interface 800 provides an external view of a portion of this researcher's publication page in accordance with an embodiment.
  • Publications can be added using a mechanism such as that shown in FIG. 12 below and can be automatically sorted by year.
  • Abstracts when available, can also be automatically pulled from external digital libraries. Links to these external documents, and links to co-authors pages can also be automatically inserted. This support quick navigation to both a paper and its collaborating authors.
  • FIG. 9 a user interface 900 of an internal view of a portion of this researcher's Patents page is shown in accordance with an embodiment.
  • Internal in this case, means that the viewer is known (by various automatic means) to be an employee of the corporation.
  • FIG.10 shows a user interface 1000 of a portion of one of this researcher's projects in accordance with an embodiment.
  • the user interface 1000 shown in FIG. 10 is an internal view of a project page showing the "join/edit group" button in accordance with an embodiment.
  • the upper left of the page holds an optionally displayed photomontage of all group members, allowing direct navigation to project collaborators.
  • FIG. 10 shows the internal view of this project, in particular, the view provided to an internal corporate employee who is not currently in the project. As such, a "Join/Edit Group” button is displayed making it trivially easy to join the project and have it
  • FIGs. 11 and 12 show how content (beyond the links to related content discussed above) can be authored in an embodiment of the Researcher System.
  • FIG. 11 a user interface 1100 of an editor for a researcher's profile page is generally shown in accordance with an embodiment.
  • FIG. 11 shows the primary Profile editor that can be restricted to only being available to the particular researcher and/or to their designee.
  • buttons to preview and externalize the page appear at the top of this page and can also appear at the top of every content creation page.
  • the editor shown in FIG. 11 has a close relationship to the external view. Content layout is essentially the same as in the view with only a few additional editor controls surrounding each major section.
  • embodiments include the provision of both a "View Profile” button (for previewing) and an “Allow external viewing of your profile” checkbox that makes the content immediately visible on the open Web. This can help motivate people to make quick updates rather than waiting for some major change or batching their smaller updates because of a slow process of staging and vetting.
  • edit controls allow the researcher photo on the left, and the name, job role, location, and contact information on the right to be easily updated.
  • a page is first created, both the researcher's photo and this top-level information can be automatically extracted from a corporate lightweight directory access protocol (LDAP) server to simplify page creation. Future editing sessions can update this information from the server as needed unless the owner explicitly changes this content.
  • Reasons for changing the job role from what is stored in the LDAP server can include, for example, the addition of a university affiliation not captured in the LDAP record.
  • Reasons for changing the location include, for example, specifying geographic information for a remote worker who is nominally associated with one of the global research laboratories but is actually physically located somewhere else.
  • the what-you-see-is- what-you-get (WYSIWYG) editor for Profile page content uses a tool such as TinyMCE with only a few modifications to limit the feature set and keep authored headings appropriately situated within the final generated page structure.
  • the tool can also be used to convert previously authored content (say from a Linkedln profile or a CV file) to HTML.
  • the "Add/Change" button for Professional Associations allows the viewing and selection from among more than 200 previously entered Professional Associations (along with the creation of new ones).
  • Professional Interests differ from these two categories in that it is only possible to select from a fixed ontology of around 40 high-level disciplines and interest areas. Each discipline and interest area already has a page in the system that is edited by the area owner to ensure it depicts the area (e.g., Human Computer Interaction) appropriately.
  • an optional More Information section for any links that are thought by the author to be best displayed as a simple list. This section can also serve as an overflow area for any page tabs that cannot be rendered within a single row (a rare occurrence).
  • FIG. 12 a user interface 1200 for a portion of an editor for a researcher's publications page is generally shown in accordance with an embodiment.
  • An embodiment is not fully automated because it may require explicit selection from a list of possible papers, to account for the difficulty of unambiguously identifying authors in current external paper repository implementations. However, this process takes only a few minutes, even for researchers with a large number of publications and patents. Once a paper or patent is identified, the automatic provisioning of available abstracts, keywords, paper links, and crosslinks between co-authors and co-inventors makes this modest investment worthwhile. Anther embodiment can be fully automated when repositories are used that have unique author identifiers and the provision of suitable APIs by external paper repositories may eventually make this feasible.
  • Embodiments of the Researcher System utilize the semantic searching technologies described herein. This can extend a rudimentary text search capability that allows finding people by portions of their name, their nickname, and so on (along with a faceted search for finding people by disciplines and laboratories), to allowing people to be found based on the similarity between the concepts mentioned in their Researcher pages and the conceptual content of a query.
  • FIG. 13A shows a user interface 1300 of the result of typing "color blindness" into a search field.
  • FIGs. 13B and 13C as described previously depict how queries that combine concepts and documents simultaneously are created.
  • the second cluster includes a researcher with expertise in retinal cells. He too would either know a fair bit about color blindness from the perspective of retinal physiology or would likely know people who do.
  • the clusters described above are explicitly called out by the user interface. To accomplish this task, the system takes the concept space
  • clustering algorithms for example k- means
  • the task of summarizing the most relevant concepts in a document to a query can be accomplished as follows. After a query is input, and a set of documents is returned through the use of the conceptual inverted index, the concepts extracted for these documents are also retrieved. Next, a computation is made to estimate the degree to which the concept or concepts in the query are relevant to each of the extracted concepts for all of the documents returned. This can be achieved by means of the Markov chain techniques described earlier in this invention. Once this estimate is obtained, the concepts within the documents can be ranked, and a desired number of them can be chosen for a word/concept cloud display. The displaying of these concepts can then utilize the relevancy information in order to affect their size on a display, or the color in which they are displayed. In addition, the extracted concepts not only could be displayed in a cloud, they could also be displayed in the context of the text that contains them, by rendering the text and the concepts in differentiating ways.
  • explanations index After selecting those concepts, explanations can be stored for that document only for the selected concepts. These explanations are then retrieved at query time for the documents for which we want to display a summary. As before, excerpts of text from the document can also be highlighted with the extracted concepts, adding differentiating visual elements to distinguish the ones that are the most relevant to the query.
  • t_E be the concepts extracted from the document, and suppose that the information for how relevant ti is for cj is given by some function f(ti, cj) (the function f( , ) can be based on Markov chain techniques as described elsewhere herein).
  • Many aggregation techniques are possible. For example, the minimum of these numbers may be computed. Or the average of these numbers may be computed. Or some
  • f( . , .) in this embodiment is a likelihood (a probability) then the log likelihood ratio of each number in the list may be computed and then the numbers may be summed; the log likelihood ratio for a likelihood p is given by log ( p / (1- p) ).
  • FIG. 15 shows a user interface 1500 with portion of the results returned from the action of clicking "retinal ganglion cells" in the word cloud of Edward Daniels in FIG. 14 (Edward, of course, also being returned in these new search results). It is clear that these are people working on different, but potentially related, concepts including "neural encoding” and “neural networks”, either of which might be a bit closer to what a searcher is actually trying to find.
  • Embodiments of the Researcher System ingest and index the externalized content within the Researcher System, including all profiles, secondary pages, and the titles of publications and patents. When content is added or changed, it is automatically indexed within a few minutes of its appearance.
  • An important aspect of this embodiment of the conceptual indexing system is that the indexing of the corpus of data is assisted by an external data source (e.g., Wikipedia).
  • Wikipedia an external data source
  • a wikification system is used that parses the content of a researcher's pages and phrases are matched against titles of Wikipedia articles.
  • an internal representation of related concepts can be computed with which the researcher may be familiar.
  • the information retrieval literature provides multiple approaches for estimating this.
  • An embodiment uses a variant of the class of techniques that compute relatedness between a concept and other concepts by computing the number of short link paths between them. For each researcher the concepts found in the first stage are augmented by using closely related concepts based on this metric.
  • weights can be computed for the individual concepts associated with each researcher.
  • the primary reason is that writing styles vary significantly from researcher to researcher.
  • a secondary reason is that the annotator in the first stage sometimes makes mistakes and thus further de-noising is needed.
  • the log-normal behavior implies that measuring expertise by simple linear accumulation of evidence is likely to disproportionally favor individuals with very lengthy descriptions.
  • Embodiments of scoring mechanisms do give credit to researchers who have more content, however they also detect the verbosity in a researcher's description in order to adjust scores appropriately.
  • a fuzzy matching approach as described herein using Wikipedia concepts can be used to increase the likelihood of finding people related to a query.
  • embodiments can be used to increase the precision of search results.
  • FIG. 18 shows user interface 1800 with a portion of the results that may returned from searching for the concept "design". While different kinds of design-related researchers are returned (with people related to interactive design and design methodologies being near the top), the aspect of design as applied to memory systems and thin clients does not rank as highly. As such, the precision of the search can be increased.
  • the system for finding people described in here can be augmented with social networking capabilities.
  • a mechanism can be added for inputting into a system a "thought" which the system processes as a conceptual query.
  • the documents in the system are documents that are representative of a person's interests and/or skills. Therefore, the system is capable of creating a list of people that may be interested in the thought input by assessing the degree to which the thought is relevant to a person.
  • the goal of the system is to enable an interaction between the initiator of the thought to the rest of the community in the system, for example in the form of a chat (discussion) focused on the aforementioned thought.
  • the closeness between the thought and the interests or skills of a person is only one of multiple parameters that are considered in order to select people that may be potentially interested in the discussion. Additional parameters include whether a person is presently logged into the system, whether a person is presently or recently active in the system, whether a person has engaged with discussion requests or not, and the number of discussions in which a person has engaged recently.
  • recommended articles may also be given when a user is already reading an article, by analyzing the article, extracting concepts in the article and then finding articles that are close to the article being read in a conceptual sense (as opposed to a text similarity sense, which is a common way to do document similarity analysis).
  • a navigation facility for news articles may be added by displaying concepts related to articles with hyperlinks, clicking on the hyperlinks results in redoing a conceptual search with the concept that was clicked, and thus displaying new content.
  • the sequence of articles and concepts clicked can then be used to create a user profile to understand what interests the user in a conceptual sense.
  • These concepts can then be used as contextual information to aid further article recommendations even in the absence of any specific query from the user.
  • the lists of articles returned by the system in the various cases described are further re-ranked to account for the date in which the article was written, so that more recent articles receive an increase in their ranking. This provides a hybrid ranking system considering conceptual relevance and recency.
  • a method for summarizing the relevance of a document to a conceptual query can include receiving a conceptual query that includes one or more concepts.
  • Concepts extracted from the document are accessed, and a degree to which the conceptual query is related to each of the extracted concepts is computed (e.g. based on paths in a concept graph connecting the concepts in the conceptual query to each of the extracted concepts).
  • a summary can be created by selecting a threshold number of the concepts having a greatest degree of relation to the conceptual query, and then output (e.g., to a user interface).
  • the summary can include excerpts of text from documents highlighting extracted concepts that have a degree of relevance to the conceptual query.
  • Each of the concepts in the summary can be associated with a hyperlink and based on a user selecting the hyperlink, a new list of summaries related to the concept in the hyperlink can be output.
  • a relevance of an extracted concept to a conceptual query can be summarized by at least one of changing a font size of the extracted concept and changing a color of the extracted concept.
  • a least one concept in the summary is not a concept in the conceptual query.
  • the computing the degree of relation can include iterating a Markov chain derived from the concept graph.
  • the computing can be performed at indexing time for conceptual queries comprised of a single concept, and at least a subset of the results of the computing stored in an explanations index. Creating the summaries can include retrieving the explanations index.
  • the conceptual query can include two or more concepts and creating a summary can be responsive to a degree of relation between each of the concepts in the conceptual query and each of the extracted concepts.
  • an embodiment for summarizing the relevance of documents to a conceptual query can include: receiving the conceptual query; accessing extracted concepts for each of the documents; computing a degree to which each of the documents are related to one another (e.g., as part of a clustering algorithm), the computing responsive to paths in the concept graph connecting the extracted concepts in one document to extracted concepts in another document; assigning the documents to one or more groups based on the computing, wherein a pair of documents having a first score that specifies a degree of relation are more likely to be in the same group than a pair of documents having a second score specifying a degree of relation that is lower than the first score; and outputting results of the assigning.
  • FIG. 20 a high-level block diagram of a question-answer (QA) framework 2000 where embodiments described herein can be utilized is generally shown.
  • QA question-answer
  • the QA framework 2000 can be implemented to generate an answer 2004 (and a confidence level associated with the answer) to a given question 2002.
  • general principles implemented by the framework 2000 to generate answers 2004 to questions 2002 include massive parallelism, the use of many experts, pervasive confidence estimation, and the integration of shallow and deep knowledge.
  • the QA framework 2000 shown in FIG. 20 is implemented by the WatsonTM product from IBM.
  • the QA framework 2000 shown in FIG. 20 defines various stages of analysis in a processing pipeline. In an embodiment, each stage admits multiple implementations that can produce alternative results. At each stage, alternatives can be independently pursued as part of a massively parallel computation. Embodiments of the framework 2000 don't assume that any component perfectly understands the question 2002 and can just look up the right answer 2004 in a database. Rather, many candidate answers can be proposed by searching many different resources, on the basis of different interpretations of the question (e.g., based on a category of the question.) A commitment to any one answer is deferred while more and more evidence is gathered and analyzed for each answer and each alternative path through the system. [00192] As shown in FIG.
  • the question and topic analysis 2010 is performed and used in question decomposition 2012.
  • Hypotheses are generated by the hypothesis generation block 2014 which uses input from the question decomposition 2012, as well data obtained via a primary search 2016 through the answer sources 2006 and candidate answer generation 2018 to generate several hypotheses.
  • Hypothesis and evidence scoring 2026 is then performed for each hypothesis using evidence sources 2008 and can include answer scoring 2020, evidence retrieval 2022 and deep evidence scoring 2024.
  • a synthesis 2028 is performed of the results of the multiple hypothesis and evidence scorings 2026.
  • Input to the synthesis 2028 can include answer scoring 2020, evidence retrieval 2022, and deep evidence scoring 2024.
  • Learned models 2030 can then be applied to the results of the synthesis 2028 to generate a final confidence merging and ranking 2032.
  • An answer 2004 (and a confidence level associated with the answer) is then output.
  • Semantic analytics play a key role in information extraction by the QA framework 2000 shown in FIG. 20.
  • Embodiments of the concept-driven analytics disclosed herein can be utilized by the QA framework 2000 to provide relevant search and recommendation results, as well providing rich document exploration capabilities.
  • a document e.g., as question 2002
  • concept extraction e.g., analysis 2010, decomposition 2012
  • the extracted concepts may be used to determine relationships against a concept space (e.g., answer sources 2006).
  • Factors such as the number of paths between concepts, as well as the length of the paths thereof, can be used by the scoring mechanism 2020, to determine these relationships, and generate corresponding answers 2004.
  • the scoring mechanism 2020 can be used by the scoring mechanism 2020, to determine these relationships, and generate corresponding answers 2004.
  • patterns of associations or relationships for a given concept or group of concepts may be collected over time and stored, e.g., as models 2030, which are applied to subsequently derived extracted concepts.
  • the processing system 2100 has one or more central processing units (processors) 2101a, 2101b, 2101c, etc. (collectively or generically referred to as processor(s) 2101).
  • processors 2101 are coupled to system memory 2114 and various other components via a system bus 2113.
  • ROM Read only memory
  • BIOS basic input/output system
  • the system memory 2114 can include ROM 2102 and random access memory (RAM) 2110, which is read-write memory coupled to system bus 2113 for use by processors 2101.
  • FIG. 21 further depicts an input/output (I/O) adapter 2107 and a network adapter 2106 coupled to the system bus 2113.
  • I/O adapter 2107 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 2103 and/or tape storage drive 2105 or any other similar component.
  • I/O adapter 2107, hard disk 2103, and tape storage drive 2105 are collectively referred to herein as mass storage 2104.
  • Software 2120 for execution on processing system 2100 may be stored in mass storage Y2104.
  • Network adapter 2106 interconnects system bus 21 13 with an outside network 2116 enabling processing system 2100 to communicate with other such systems.
  • a screen (e.g., a display monitor) 21 15 is connected to system bus 2113 by display adapter 2112, which may include a graphics controller to improve the performance of graphics intensive applications and a video controller.
  • adapters 2107, 2106, and 2112 may be connected to one or more I/O buses that are connected to system bus 2113 via an intermediate bus bridge (not shown).
  • Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI).
  • PCI Peripheral Component Interconnect
  • Additional input/output devices are shown as connected to system bus 2113 via user interface adapter 2108 and display adapter 2112.
  • a keyboard 2109, mouse 2140, and speaker 2111 can be interconnected to system bus
  • user interface adapter 2108 may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit.
  • processing system 2100 includes processing capability in the form of processors 2101, and, storage capability including system memory
  • system memory 2114 and mass storage 2104 collectively store an operating system such as the AIX® operating system from IBM Corporation to coordinate the functions of the various components shown in FIG. 21.
  • AIX® operating system from IBM Corporation to coordinate the functions of the various components shown in FIG. 21.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Selon un aspect, la recherche, la recommandation et d'exploration de documents au moyen d'associations conceptuelles comprennent un procédé permettant de recevoir une pluralité de documents et d'extraire des concepts de chacun des documents. Un degré de relation entre chacun des documents et concepts dans une base de connaissances est calculé. Le procédé consiste également, en réponse à la réception d'une demande, à déterminer un ou plusieurs concepts à partir de la demande. Pour chacun des concepts, une liste de documents ayant un degré de relation le plus élevé avec le concept est récupérée. Le procédé consiste également à générer une liste qui répond à la liste ou aux listes récupérées. Selon un aspect, le stockage et l'interrogation d'index conceptuels (CI) consistent à créer un index inversé conceptuel (CII) à partir des CI. Les CII comprennent des entrées CII, chacune d'elles correspondant à un concept dans un graphique conceptuel. La création des CII consiste à renseigner chaque entrée avec des pointeurs vers des documents sélectionnés parmi les CI dont les probabilités d'être reliés au concept sont supérieures à une valeur de seuil, ainsi que les probabilités correspondantes. Un aspect consiste également à recevoir une demande qui comprend un concept dans le graphique conceptuel, et à générer des résultats de demande à partir d'une recherche qui comprend au moins un sous-ensemble de pointeurs vers des documents. Chacun des CI est associé à un document correspondant et comprend une entrée CI pour chaque concept dans le graphique conceptuel, et chacune des entrées CI spécifie une valeur indiquant une probabilité que le document soit associé au concept dans le graphique conceptuel.
PCT/IB2015/055280 2014-07-14 2015-07-13 Système de recherche, de recommandation et d'exploration de documents par associations conceptuelles et table inversée pour stocker et interroger des index conceptuels WO2016009321A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US14/330,358 2014-07-14
US14/330,358 US10503761B2 (en) 2014-07-14 2014-07-14 System for searching, recommending, and exploring documents through conceptual associations
US14/330,438 US9703858B2 (en) 2014-07-14 2014-07-14 Inverted table for storing and querying conceptual indices
US14/330,438 2014-07-14

Publications (1)

Publication Number Publication Date
WO2016009321A1 true WO2016009321A1 (fr) 2016-01-21

Family

ID=55077963

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2015/055280 WO2016009321A1 (fr) 2014-07-14 2015-07-13 Système de recherche, de recommandation et d'exploration de documents par associations conceptuelles et table inversée pour stocker et interroger des index conceptuels

Country Status (1)

Country Link
WO (1) WO2016009321A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3330872A1 (fr) * 2016-12-01 2018-06-06 Spotify AB Système et procédé d'analyse sémantique de textes de chanson dans un environnement de contenu multimédia
CN110168579A (zh) * 2016-11-23 2019-08-23 启创互联公司 用于利用机器学习分类器来使用知识表示的系统和方法
US20210271654A1 (en) * 2020-02-28 2021-09-02 International Business Machines Corporation Contrasting Document-Embedded Structured Data and Generating Summaries Thereof
US11354510B2 (en) 2016-12-01 2022-06-07 Spotify Ab System and method for semantic analysis of song lyrics in a media content environment
CN114730317A (zh) * 2019-08-14 2022-07-08 株式会社谜谜克思 使用示意图的构思平台设备及方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169780A (zh) * 2006-10-25 2008-04-30 华为技术有限公司 一种基于语义本体的检索系统和方法
CN101685455A (zh) * 2008-09-28 2010-03-31 华为技术有限公司 数据检索的方法和系统
US8024329B1 (en) * 2006-06-01 2011-09-20 Monster Worldwide, Inc. Using inverted indexes for contextual personalized information retrieval
CN103744984A (zh) * 2014-01-15 2014-04-23 北京理工大学 一种利用语义信息检索文档的方法
CN104182464A (zh) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 一种基于语义的文本检索方法
CN104239513A (zh) * 2014-09-16 2014-12-24 西安电子科技大学 一种面向领域数据的语义检索方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024329B1 (en) * 2006-06-01 2011-09-20 Monster Worldwide, Inc. Using inverted indexes for contextual personalized information retrieval
CN101169780A (zh) * 2006-10-25 2008-04-30 华为技术有限公司 一种基于语义本体的检索系统和方法
CN101685455A (zh) * 2008-09-28 2010-03-31 华为技术有限公司 数据检索的方法和系统
CN103744984A (zh) * 2014-01-15 2014-04-23 北京理工大学 一种利用语义信息检索文档的方法
CN104182464A (zh) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 一种基于语义的文本检索方法
CN104239513A (zh) * 2014-09-16 2014-12-24 西安电子科技大学 一种面向领域数据的语义检索方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110168579A (zh) * 2016-11-23 2019-08-23 启创互联公司 用于利用机器学习分类器来使用知识表示的系统和方法
EP3330872A1 (fr) * 2016-12-01 2018-06-06 Spotify AB Système et procédé d'analyse sémantique de textes de chanson dans un environnement de contenu multimédia
US10360260B2 (en) 2016-12-01 2019-07-23 Spotify Ab System and method for semantic analysis of song lyrics in a media content environment
US11354510B2 (en) 2016-12-01 2022-06-07 Spotify Ab System and method for semantic analysis of song lyrics in a media content environment
CN114730317A (zh) * 2019-08-14 2022-07-08 株式会社谜谜克思 使用示意图的构思平台设备及方法
US20210271654A1 (en) * 2020-02-28 2021-09-02 International Business Machines Corporation Contrasting Document-Embedded Structured Data and Generating Summaries Thereof
US11500840B2 (en) * 2020-02-28 2022-11-15 International Business Machines Corporation Contrasting document-embedded structured data and generating summaries thereof

Similar Documents

Publication Publication Date Title
US10956461B2 (en) System for searching, recommending, and exploring documents through conceptual associations
US10496683B2 (en) Automatically linking text to concepts in a knowledge base
US10572521B2 (en) Automatic new concept definition
US9734196B2 (en) User interface for summarizing the relevance of a document to a query
US9805139B2 (en) Computing the relevance of a document to concepts not specified in the document
US9773054B2 (en) Inverted table for storing and querying conceptual indices
Deveaud et al. Accurate and effective latent concept modeling for ad hoc information retrieval
Kardan et al. A novel approach to hybrid recommendation systems based on association rules mining for content recommendation in asynchronous discussion groups
Liao et al. Unsupervised approaches for textual semantic annotation, a survey
KR20160030996A (ko) 컴퓨터-인간 대화형 학습에서의 대화형 세그먼트 추출
WO2016009321A1 (fr) Système de recherche, de recommandation et d'exploration de documents par associations conceptuelles et table inversée pour stocker et interroger des index conceptuels
Zhang et al. A semantics-based method for clustering of Chinese web search results
ARNAOUT et al. Completeness, Recall, and Negation in Open-World Knowledge Bases: A Survey
Trani Improving the Efficiency and Effectiveness of Document Understanding in Web Search.
Wu Whisk: Web hosted information into summarized knowledge
Balog et al. Understanding Information Needs
Gassler The SnoopyConcept: Leveraging Recommendations for Knowledge Curation
Urbansky Automatic extraction and assessment of entities from the web
Grady Identifying experts and authoritative documents in social bookmarking systems
Tylenda Methods and tools for summarization of entities and facts in knowledge bases
Penev Search in personal spaces
Novo et al. Combination of web usage, content and structure information for diverse web mining applications in the tourism context and the context of users with disabilities
Wang Fuzzy Content Mining for Targeted Advertisement
JIN Extraction and Application of Social Networks from the World Wide Web

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15821421

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15821421

Country of ref document: EP

Kind code of ref document: A1