WO2016210203A1 - Learning entity and word embeddings for entity disambiguation - Google Patents

Learning entity and word embeddings for entity disambiguation Download PDF

Info

Publication number
WO2016210203A1
WO2016210203A1 PCT/US2016/039129 US2016039129W WO2016210203A1 WO 2016210203 A1 WO2016210203 A1 WO 2016210203A1 US 2016039129 W US2016039129 W US 2016039129W WO 2016210203 A1 WO2016210203 A1 WO 2016210203A1
Authority
WO
WIPO (PCT)
Prior art keywords
concurrence
graphs
training
disambiguation
objective function
Prior art date
Application number
PCT/US2016/039129
Other languages
French (fr)
Inventor
Zheng Chen
Jianwen Zhang
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201510422856.2A external-priority patent/CN106294313A/en
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to EP16739296.8A priority Critical patent/EP3314461A1/en
Priority to US15/736,223 priority patent/US20180189265A1/en
Publication of WO2016210203A1 publication Critical patent/WO2016210203A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • embeddings include a mapping or mappings of entities and words from training data to vectors of real numbers in a low dimensional space, relative to a size of the training data (e.g., continuous vector space).
  • a device for training disambiguation models in continuous vector space comprises a machine learning component deployed thereon and configured to pre-process training data to generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data, define a probabilistic model for the one or more concurrence graphs, define an objective function based on the probabilistic model and the one or more concurrence graphs, and train at least one disambiguation model based on feature vectors generated through an optimized version of the objective function.
  • a machine learning system comprising training data including free text and a plurality of document anchors, a preprocessing component configured to pre-process at least a portion of the training data to generate one or more concurrence graphs of named entities, words, and document anchors, and a training component configured to generate vector embeddings of entities and words based on the one or more concurrence graphs, wherein the training component is further configured to train at least one disambiguation model based on the vector embeddings.
  • a device for training disambiguation models in continuous vector space comprising a pre-processing component deployed thereon and configured to prepare training data for machine learning through extraction of a plurality of observations, wherein the training data comprises a corpus of text and a plurality of document anchors, generate a mapping table based on the plurality of observations of the training data, and generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data and based on the mapping table.
  • FIG. 1 is a diagram showing aspects of an illustrative operating environment and several logical components provided by the technologies described herein;
  • FIG. 2 is a flowchart showing aspects of one illustrative routine for pre- processing training data, according to one implementation presented herein;
  • FIG. 3 is a flowchart showing aspects of one illustrative routine for training embeddings of entities and words, according to one implementation presented herein;
  • FIG. 4 is a flowchart showing aspects of one illustrative routine for generating features in vector space and training a disambiguation model in vector space, according to one implementation presented herein;
  • FIG. 5 is a flowchart showing aspects of one illustrative routine for runtime prediction and identification of named entities, according to one implementation presented herein;
  • FIG. 6 is a computer architecture diagram showing an illustrative computer hardware and software architecture.
  • the following detailed description is directed to technologies for learning entity and word embeddings for entity disambiguation in a machine learning system.
  • the use of the technologies and concepts presented herein enable accurate recognition and identification of named entities in a large amount of data. Furthermore, in some examples, the described technologies may also increase efficiency of runtime identification of named entities. These technologies employ a disambiguation model trained in continuous vector space. Moreover, the use of the technologies and concepts presented therein are computationally less-expensive than traditional bag-of-words-based machine learning algorithms, while also being more accurate than traditional models trained on bag-of- words-based machine learning algorithms.
  • a user implements or requests a search of a corpus of data for information regarding a particular named entity, it is desirable for returned results to be related to the requested named entity.
  • the request may identify the named entity explicitly, or through context of multiple words or a phrase included in the request. For example, if a user requests a search for "Michael Jordan, AAAI Fellow," the phrase “AAAI Fellow” includes context decipherable to determine that the "Michael Jordan” being requested is not a basketball player, but a computer scientist who is also a Fellow of the ASSOCIATION FOR THE ADVANCEMENT OF ARTIFICIAL INTELLIGENCE. Thus, it is more desirable for results related to computer science and Michael Jordan as compared to results related to basketball and Michael Jordan.
  • This example is non-limiting of all forms of named entities, and any named entity is applicable to this disclosure.
  • the phrases "named entity,” “entity,” and variants thereof, correspond to an entity having a rigid designator (e.g., a "name") that denotes that entity in one or more possible contexts.
  • a rigid designator e.g., a "name”
  • Mount Everest is a named entity having the rigid designator or name of "Mount Everest” or “Everest.”
  • Henry Ford is a person having the name “Henry Ford.”
  • Other named entities such as a Ford Model T, the city of Sacramento, and other named entities also utilize names to refer to particular people, locations, things, and other entities.
  • program modules include routines, programs, components, data structures, circuits, and other types of software and/or hardware structures that perform particular tasks or implement particular data types.
  • program modules include routines, programs, components, data structures, circuits, and other types of software and/or hardware structures that perform particular tasks or implement particular data types.
  • program modules include routines, programs, components, data structures, circuits, and other types of software and/or hardware structures that perform particular tasks or implement particular data types.
  • program modules include routines, programs, components, data structures, circuits, and other types of software and/or hardware structures that perform particular tasks or implement particular data types.
  • the subject matter described herein may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
  • FIG. 1 illustrates an operating environment and several logical components provided by the technologies described herein.
  • FIG. 1 is a diagram showing aspects of a system 100, for training a disambiguation model 127.
  • a corpus of training data 101 may include a large amount of free text 102 and a plurality of document anchors 103.
  • the large amount of free text 102 may include a number of articles, publications, Internet websites, or other forms of text associated with one or more topics.
  • the one or more topics may include one or more named entities, or may be related to one or more named entities.
  • the large amount of free text may include a plurality of web-based articles.
  • the large amount of free text may include a plurality of articles from a web-based encyclopedia, such as WIKIPEDIA. Other sources for the free text 102 are also applicable.
  • the document anchors 103 may include metadata or information related to a particular location in a document of the free text 102, and a short description of information located near or in the particular location of the document.
  • a document anchor may refer a reader to a particular chapter in an article.
  • Document anchors may also automatically advance a viewing pane in a web browser to a location in a web article.
  • document anchors may include "data anchors" if referring to data associated with other types of data, rather than particular documents.
  • document anchors and data anchors may be used interchangeably under some circumstances.
  • Other forms of anchors, including document anchors, data anchors, glossaries, outlines, table of contents, and other suitable anchors, are also applicable to the technologies described herein.
  • the training data 101 may be accessed by a machine learning system 120.
  • the machine learning system 120 may include a computer apparatus, computing device, or a system of networked computing devices in some implementations.
  • the machine learning system 120 may include more or fewer components than those particularly illustrated. Additionally, the machine learning system 120 may also be termed a machine learning component, in some implementations.
  • a number of pseudo-labeled observations 104 may be taken from the training data 101 by a pre-processing component 121.
  • the pre-processing component 121 may be a component configured to execute in the machine learning system 120.
  • the preprocessing component 121 may also be a component not directly associated with the machine learning system 120 in some implementations.
  • the pre-processing component 121 may generate one or more mapping tables 122, a number of concurrence graphs 123, and a tokenized text sequence 124.
  • the pre-processing operations and generation of the mapping tables 122, concurrence graphs 123, and tokenized text sequence 124 are described more fully below with reference to FIG. 2.
  • a training component 125 may train embeddings of entities and words for development of training data. The training of embeddings of entities and words is described more fully with reference to FIG. 3.
  • the training component 125 may also generate a number of feature vectors 126 in continuous vector space.
  • the feature vectors 126 may be used to train the disambiguation model 127 in vector space, as well. The generation of the feature vectors 126 and training of the disambiguation model 127 are described more fully with reference to FIG. 4.
  • a run-time prediction component 128 may utilize the disambiguation model 127 to identify named entities in a corpus of data. Run-time prediction and identification of named entities is described more fully with reference to FIG. 5.
  • FIG. 2 is a flowchart showing aspects of one illustrative method 200 for pre-processing training data, according to one implementation presented herein.
  • the method 200 may begin pre-processing at block 201, and cease pre-processing at block 214. Individual components of the method 200 are described below with reference to the machine learning system 120 shown in FIG. 1.
  • the pre-processing component 121 may prepare the training data 101 for machine learning at block 202.
  • the training data 101 may include the pseudo- labeled observations 104 retrieved from the free text 102 and the document anchors 103, as described above.
  • the vocabulary V is derived from the free text 102 v v v 2 , ⁇ , v n , by replacing all document anchors 103 with corresponding entities.
  • the contexts of v t E V are the words or entities surrounding it within an L-sized window ⁇ Vi- L , ••• , Vi- 1 , v i+1 , •••• , Vi +L ⁇ . Subsequently, a vocabulary of contexts U WO rd u ⁇ -entity can be established.
  • each word or entity v E V, fi G U is associated with a vector ⁇ ⁇ , ⁇ E E d , respectively.
  • the pre-processing component Upon preparation of the training data 101 based on the pseudo-labeled observations 104 as described above, the pre-processing component generates the one or more mapping tables 122, at block 204.
  • the mapping table or tables 122 include tables configured to train a model to associate a correct candidate or an incorrect candidate. Therefore, the mapping table or tables 122 may be used to train the disambiguation model 127 with both positive and negative examples for any particular phrase mentioning a candidate entity.
  • the pre-processing component 121 also generates an entity-word concurrence graph from the document anchors 103 and text surrounding the document anchors 103, at block 206, an entity-entity concurrence graph from titles of articles as well as the document anchors 13, at block 208, and an entity -word concurrence graph from titles of articles and words contained in the articles, at block 210.
  • a concurrence graph may also be termed a share-topic graph.
  • a concurrence graph may be representative of a co-occurrence relationship between named entities.
  • inlinks(e) denotes the set of entities that link to e .
  • Other concurrence graphs based on entity-entity concurrence or entity-word concurrence may also be generated as explained above, in some implementations.
  • the pre-processing component 121 may generate a tokenized text sequence 124, at block 212.
  • the tokenized text sequence 124 may be a clean sequence that represents text, or portions of text, from the free text 102 as sequences of normalized tokens.
  • any suitable tokenizer may be implemented to create the sequence 124 without departing from the scope of this disclosure.
  • the method 200 may cease at block 214.
  • the training component 125 may receive the mapping table 122, concurrence graphs 123, and the tokenized text sequence 124 as input.
  • operation of the training component is described more fully with reference to FIG. 3.
  • FIG. 3 is a flowchart showing aspects of one illustrative method 300 for training embeddings of entities and words, according to one implementation presented herein. As shown, the method 300 may begin at block 301.
  • the training component 125 may initially define a probabilistic model for concurrences at block 302.
  • the probabilistic model may be based on each concurrence graph 123 based on vector representations of named entities and words, as described in detail above.
  • word and entity representations are learned to discriminate the surrounding word (or entity) within a short text sequence.
  • the connections between words and entities are created by replacing all document anchors with their referent entities.
  • a vector of ⁇ ⁇ is trained to perform well at predicting the vector of each surrounding term ⁇ from a sliding window.
  • a phrase may include "Michael I. Jordan is newly elected as AAAI fellow.” According to this example, the vector of "Michael I.
  • Jordan in the corpus-vocabulary V is trained to predict the vectors of "is”,..., "AAAI” and “fellow” in the context- vocabulary .
  • T> the collection of word (or entity) and context pairs extracted from the phrases.
  • a corpus- context pair v, ⁇ ) G T>, (v G V, ⁇ G U) may be considered.
  • the training component may model the conditional probability ⁇ ( ⁇ ) using a softmax function defined by Equation 1, below:
  • the training component 125 may also define an objective function for the concurrences, at block 304.
  • the objective function may be an objective function defined by learning as the likelihood of generating concurrences.
  • the objective function based on Equation 1, above may be defined as set forth in Equation 2, below:
  • the training component 125 may encourage a gap between appeared concurrences in the training data and candidate occurrences that have not appeared, at block 306.
  • the training component 125 may further optimize the objective function at block 308, and the method 300 may cease at block 310.
  • FIG. 4 is a flowchart showing aspects of one illustrative method 400 for generating feature vectors 126 in vector space and training the disambiguation model 127 in vector space, according to one implementation presented herein.
  • the method 400 begins training in vector space at block 401.
  • the training component 125 defines templates to generate features, at block 402.
  • the templates may be defined as templates for automatically generating features.
  • the first template may be based on a local context score.
  • the local context score template is a template to automatically generate features for neighboring or "neighborhood" words.
  • the second template may be based on a topical coherence score.
  • the topical coherence score template is a template to automatically generate features based on an average-semantic- relatedness, or the assumption that unambiguous named entities may be helpful in identifying mentions of named entities in a more ambiguous context.
  • the training component 125 computes a score for each template, at block 404.
  • the score computed is based on each underlying assumption for the associated template.
  • the local context template may have a score computed based on local contexts of mentions of a named entity.
  • mi denotes the candidate entity set of mention m ⁇ .
  • multiple local context scores may be computed by changing the context window size ⁇ T ⁇ .
  • Equation 4 Equation 4
  • the training component 125 After computing scores for each template, the training component 125 generates features from the templates, based on the computed scores, at block 306.
  • Generating the features may include, for example, generating individual features for constructing one or more feature vectors based on a number of disambiguation decisions.
  • a function for the disambiguation decisions is defined by Equation 5, presented below:
  • the disambiguation model 127 may be used to more accurately predict the occurrence of a particular named entity.
  • runtime prediction of named entities is described more fully with reference to FIG. 5.
  • FIG. 5 is a flowchart showing aspects of one illustrative method 500 for runtime prediction and identification of named entities, according to one implementation presented herein.
  • Run-time prediction begins at block 501, and may be performed by run-time prediction component 128, or may be performed by another portion of the system 100.
  • run-time prediction component 128 receives a search request identifying one or more named entities, at block 502.
  • the search request may originate at a client computing device, such as through a Web browser on a computer, or from any other suitable device.
  • Example computing devices are described in detail with reference to FIG. 6.
  • the run-time prediction component 128 may identify candidate entries of web articles or other sources of information, at block 504.
  • the candidate entries are identified from a database or a server.
  • the candidate entries are identified from the Internet.
  • the run-time prediction component 128 may retrieve feature vectors 126 of words and/or named entities, at block 506.
  • the feature vectors 126 may be stored in memory, in a computer readable storage medium, or may be stored in any suitable manner.
  • the feature vectors 126 may be accessible by the run-time prediction component 126 for run-time prediction and other operations.
  • the run-time prediction component 128 may compute features based on the retrieved vectors of words and named entities contained in the request, at block 508. Feature computation may be similar to the computations described above with reference to the disambiguation model 127 and Equation 5. The words and named entities may be extracted from the request.
  • the run-time prediction component 128 applies the disambiguation model to the computed features, at block 510.
  • the run-time prediction component 128 may rank the candidate entries based on the output of the disambiguation model, at block 512.
  • the ranking may include ranking the candidate entries based on a set of probabilities that any one candidate entry is more likely to reference the named entity than other candidate entries. Other forms of ranking may also be applicable.
  • the run-time prediction component 128 may output the ranked entries at block 514.
  • the method 500 may continually iterate as new requests are received, or alternatively, may cease after outputting the ranked entries.
  • the logical operations described above with reference to FIGS. 2-5 may be implemented (1) as a sequence of computer implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system.
  • the implementation is a matter of choice dependent on the performance and other requirements of the computing system.
  • the logical operations described herein are referred to variously as states operations, structural devices, acts, or modules. These operations, structural devices, acts and modules may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof. It should also be appreciated that more or fewer operations may be performed than shown in the figures and described herein. These operations may also be performed in a different order than those described herein.
  • FIG. 6 shows an illustrative computer architecture for a computer 600 capable of executing the software components and methods described herein for pre-processing, training, and runtime prediction in the manner presented above.
  • the computer architecture shown in FIG. 6 illustrates a conventional desktop, laptop, or server computer and may be utilized to execute any aspects of the software components presented herein described as executing in the system 100 or any components in communication therewith.
  • the computer architecture shown in FIG. 6 includes one or more processors 602, a system memory 608, including a random access memory 614 (RAM) and a read-only memory (ROM) 616, and a system bus 604 that couples the memory to the processor(s) 602.
  • the processor(s) 602 can include a central processing unit (CPU) or other suitable computer processors.
  • the computer 600 further includes a mass storage device 610 for storing an operating system 618, application programs, and other program modules, which are described in greater detail herein.
  • the mass storage device 610 is connected to the processor(s) 602 through a mass storage controller (not shown) connected to the bus 604.
  • the mass storage device 610 is an example of computer-readable media for the computer 600.
  • computer-readable media can be any available computer storage media or communication media that can be accessed by the computer 600.
  • Communication media includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any delivery media.
  • modulated data signal means a signal that has one or more of its characteristics changed or set in a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of communication media.
  • computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
  • computer storage media includes, but is not limited to, RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DVD), High Definition DVD (HD-DVD), BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and which can be accessed by the computer 600.
  • the phrase "computer storage media,” and variations thereof, does not include waves or signals per se and/or communication media.
  • the computer 600 may operate in a networked environment using logical connections to remote computers through a network such as the network 620.
  • the computer 600 may connect to the network 620 through a network interface unit 606 connected to the bus 604.
  • the network interface unit 606 may also be utilized to connect to other types of networks and remote computer systems.
  • the computer 600 may also include an input/output controller 612 for receiving and processing input from a number of other devices, including a keyboard, mouse, or electronic stylus (not shown in FIG. 6).
  • an input/output controller may provide output to a display screen, a printer, or other type of output device (also not shown in FIG. 6).
  • a number of program modules and data files may be stored in the mass storage device 610 and RAM 614 of the computer 600, including an operating system 618 suitable for controlling the operation of a networked desktop, laptop, or server computer.
  • the mass storage device 610 and RAM 814 may also store one or more program modules or other data, such as the disambiguation model 127, the feature vectors 126, or any other data described above.
  • the mass storage device 610 and the RAM 614 may also store other types of program modules, services, and data.
  • a device for training disambiguation models in continuous vector space comprising a machine learning component deployed thereon and configured to:
  • pre-process training data to generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data
  • a machine learning system comprising:
  • training data including free text and a plurality of document anchors
  • pre-processing component configured to pre-process at least a portion of the training data to generate one or more concurrence graphs of named entities, associated data, and data anchors;
  • a training component configured to generate vector embeddings of entities and words based on the one or more concurrence graphs, wherein the training component is further configured to train at least one disambiguation model based on the vector embeddings.
  • a system as recited in clause F further comprising a run-time prediction component configured to identify candidate entries using the at least one disambiguation model.
  • a database or server storing a plurality of entries; and a run-time prediction component configured to identify candidate entries from the plurality of entries using the at least one disambiguation model, and to rank the identified candidate entries using the at least one disambiguation model.
  • the probabilistic model is based on a softmax function or normalized exponential function
  • the objective function is a function of a number of negative examples included in the training data.
  • a device for training disambiguation models in continuous vector space comprising a pre-processing component deployed thereon and configured to:
  • training data for machine learning through extraction of a plurality of observations, wherein the training data comprises a corpus of text and a plurality of document anchors;
  • mapping table based on the plurality of observations of the training data
  • a device as recited in clause K further comprising a machine learning component deployed thereon and configured to:
  • a run-time prediction component configured to identify candidate entries from the plurality of entries using the at least one disambiguation model, and to rank the identified candidate entries using the at least one disambiguation model.
  • All of the methods and processes described above may be embodied in, and fully or partially automated via, software code modules executed by one or more general purpose computers or processors.
  • the code modules may be stored in any type of computer-readable storage medium or other computer storage device. Some or all of the methods may additionally or alternatively be embodied in specialized computer hardware.
  • Conditional language such as, among others, "can,” “could,” or “may,” unless specifically stated otherwise, means that certain examples include, while other examples do not include, certain features, elements and/or steps. Thus, such conditional language does not imply that certain features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether certain features, elements and/or steps are included or are to be performed in any particular example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Technologies are described herein for learning entity and word embeddings for entity disambiguation. An example method includes pre-processing training data to generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data, defining a probabilistic model for the one or more concurrence graphs, defining an objective function based on the probabilistic model and the one or more concurrence graphs, and training at least one disambiguation model based on feature vectors generated through an optimized version of the objective function.

Description

LEARNING ENTITY AND WORD EMBEDDINGS FOR ENTITY
DISAMBIGUATION
BACKGROUND
[0001] Generally, it is a relatively easy task for a person to recognize a particular named entity that is named in a web article or another document, through identification of context or personal knowledge about the named entity. However, this task may be difficult for a machine to compute without a robust machine learning algorithm. Conventional machine learning algorithms, such as bag-of-words-based learning algorithms, suffer from drawbacks that reduce the accuracy in named entity identification. For example, conventional machine learning algorithms may ignore semantics of words, phrases, and/or names. The ignored semantics are a result of a one-hot approach implemented in most bag-of-words-based learning algorithms, where semantically related words are deemed equidistant to semantically unrelated words in some scenarios.
[0002] Furthermore, conventional machine learning algorithms for entity disambiguation may be computational expensive, and may be generally difficult to implement in a real-word setting. As an example, in a real-world setting, entity linking for identification of named entities may be of high practical importance. Such identification can benefit human end-user systems in that information about related topics and relevant knowledge from a large base of information is more readily accessible from a user interface. Furthermore, much more enriched information may be automatically identified through the use of a computer system. However, as conventional machine learning algorithms lack the computational efficiency to accurately identify named entities across the large base of information, conventional systems may not adequately present relevant results to users, thereby presenting more generalized results that require extensive review by a user requesting information.
SUMMARY
[0003] The techniques discussed herein facilitate the learning of entity and word embeddings for entity disambiguation. As described herein, various methods and systems of learning entity and word embeddings are provided. As further described herein, various methods of run-time processing using a novel disambiguation model accurately identify named entities across a large base on information. Generally, embeddings include a mapping or mappings of entities and words from training data to vectors of real numbers in a low dimensional space, relative to a size of the training data (e.g., continuous vector space).
[0004] According to one example, a device for training disambiguation models in continuous vector space comprises a machine learning component deployed thereon and configured to pre-process training data to generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data, define a probabilistic model for the one or more concurrence graphs, define an objective function based on the probabilistic model and the one or more concurrence graphs, and train at least one disambiguation model based on feature vectors generated through an optimized version of the objective function.
[0005] According to another example, a machine learning system, the system comprising training data including free text and a plurality of document anchors, a preprocessing component configured to pre-process at least a portion of the training data to generate one or more concurrence graphs of named entities, words, and document anchors, and a training component configured to generate vector embeddings of entities and words based on the one or more concurrence graphs, wherein the training component is further configured to train at least one disambiguation model based on the vector embeddings.
[0006] According to yet another example, a device for training disambiguation models in continuous vector space, comprising a pre-processing component deployed thereon and configured to prepare training data for machine learning through extraction of a plurality of observations, wherein the training data comprises a corpus of text and a plurality of document anchors, generate a mapping table based on the plurality of observations of the training data, and generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data and based on the mapping table.
[0007] The above-described subject matter may also be implemented in other ways, such as a computer-controlled apparatus, a computer process, a computing system, or as an article of manufacture such as a computer-readable storage medium, for example. Although the technologies presented herein are primarily disclosed in the context of cross- language speech recognition, the concepts and technologies disclosed herein are also applicable in other forms including development of a lexicon for speakers sharing a single language or dialect. Other variations and implementations may also be applicable. These and various other features will be apparent from a reading of the following Detailed Description and a review of the associated drawings. [0008] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended that this Summary be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same reference numbers in different figures indicate similar or identical items.
[0010] FIG. 1 is a diagram showing aspects of an illustrative operating environment and several logical components provided by the technologies described herein;
[0011] FIG. 2 is a flowchart showing aspects of one illustrative routine for pre- processing training data, according to one implementation presented herein;
[0012] FIG. 3 is a flowchart showing aspects of one illustrative routine for training embeddings of entities and words, according to one implementation presented herein;
[0013] FIG. 4 is a flowchart showing aspects of one illustrative routine for generating features in vector space and training a disambiguation model in vector space, according to one implementation presented herein;
[0014] FIG. 5 is a flowchart showing aspects of one illustrative routine for runtime prediction and identification of named entities, according to one implementation presented herein; and
[0015] FIG. 6 is a computer architecture diagram showing an illustrative computer hardware and software architecture.
DETAILED DESCRIPTION
[0016] The following detailed description is directed to technologies for learning entity and word embeddings for entity disambiguation in a machine learning system. The use of the technologies and concepts presented herein enable accurate recognition and identification of named entities in a large amount of data. Furthermore, in some examples, the described technologies may also increase efficiency of runtime identification of named entities. These technologies employ a disambiguation model trained in continuous vector space. Moreover, the use of the technologies and concepts presented therein are computationally less-expensive than traditional bag-of-words-based machine learning algorithms, while also being more accurate than traditional models trained on bag-of- words-based machine learning algorithms.
[0017] As an example scenario useful in understanding the technologies described herein, if a user implements or requests a search of a corpus of data for information regarding a particular named entity, it is desirable for returned results to be related to the requested named entity. The request may identify the named entity explicitly, or through context of multiple words or a phrase included in the request. For example, if a user requests a search for "Michael Jordan, AAAI Fellow," the phrase "AAAI Fellow" includes context decipherable to determine that the "Michael Jordan" being requested is not a basketball player, but a computer scientist who is also a Fellow of the ASSOCIATION FOR THE ADVANCEMENT OF ARTIFICIAL INTELLIGENCE. Thus, it is more desirable for results related to computer science and Michael Jordan as compared to results related to basketball and Michael Jordan. This example is non-limiting of all forms of named entities, and any named entity is applicable to this disclosure.
[0018] As used herein, the phrases "named entity," "entity," and variants thereof, correspond to an entity having a rigid designator (e.g., a "name") that denotes that entity in one or more possible contexts. For example, Mount Everest is a named entity having the rigid designator or name of "Mount Everest" or "Everest." Similarly, the person Henry Ford is a person having the name "Henry Ford." Other named entities such as a Ford Model T, the city of Sacramento, and other named entities also utilize names to refer to particular people, locations, things, and other entities. Still further, particular people, places or things may be named entities in some contexts, including contexts where a single designator denotes a well-defined set, class, or category of objects rather than a single unique object. However, generic names such as "shopping mall" or "park" may not refer to particular entities, and therefore may not be considered names of named entities.
[0019] While the subject matter described herein is presented in the general context of program modules that execute in conjunction with the execution of an operating system and application programs on a computer system, those skilled in the art will recognize that other implementations may be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, circuits, and other types of software and/or hardware structures that perform particular tasks or implement particular data types. Moreover, those skilled in the art will appreciate that the subject matter described herein may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
[0020] In the following detailed description, references are made to the accompanying drawings that form a part hereof, and which are shown by way of illustration as specific implementations or examples. Referring now to the drawings, aspects of a computing system and methodology for cross-language speech recognition and translation will be described in detail.
[0021] FIG. 1 illustrates an operating environment and several logical components provided by the technologies described herein. In particular, FIG. 1 is a diagram showing aspects of a system 100, for training a disambiguation model 127. As shown in the system 100, a corpus of training data 101 may include a large amount of free text 102 and a plurality of document anchors 103.
[0022] Generally, the large amount of free text 102 may include a number of articles, publications, Internet websites, or other forms of text associated with one or more topics. The one or more topics may include one or more named entities, or may be related to one or more named entities. According to one example, the large amount of free text may include a plurality of web-based articles. According to one example, the large amount of free text may include a plurality of articles from a web-based encyclopedia, such as WIKIPEDIA. Other sources for the free text 102 are also applicable.
[0023] The document anchors 103 may include metadata or information related to a particular location in a document of the free text 102, and a short description of information located near or in the particular location of the document. For example, a document anchor may refer a reader to a particular chapter in an article. Document anchors may also automatically advance a viewing pane in a web browser to a location in a web article. Additionally, document anchors may include "data anchors" if referring to data associated with other types of data, rather than particular documents. Furthermore, document anchors and data anchors may be used interchangeably under some circumstances. Other forms of anchors, including document anchors, data anchors, glossaries, outlines, table of contents, and other suitable anchors, are also applicable to the technologies described herein.
[0024] The training data 101 may be accessed by a machine learning system 120. The machine learning system 120 may include a computer apparatus, computing device, or a system of networked computing devices in some implementations. The machine learning system 120 may include more or fewer components than those particularly illustrated. Additionally, the machine learning system 120 may also be termed a machine learning component, in some implementations.
[0025] A number of pseudo-labeled observations 104 may be taken from the training data 101 by a pre-processing component 121. The pre-processing component 121 may be a component configured to execute in the machine learning system 120. The preprocessing component 121 may also be a component not directly associated with the machine learning system 120 in some implementations.
[0026] Using the pseudo-labeled observations 104, the pre-processing component 121 may generate one or more mapping tables 122, a number of concurrence graphs 123, and a tokenized text sequence 124. The pre-processing operations and generation of the mapping tables 122, concurrence graphs 123, and tokenized text sequence 124 are described more fully below with reference to FIG. 2.
[0027] Upon pre-processing at least a portion of the training data 101 to create the mapping tables 122, concurrence graphs 123, and tokenized text sequence 124, a training component 125 may train embeddings of entities and words for development of training data. The training of embeddings of entities and words is described more fully with reference to FIG. 3.
[0028] The training component 125 may also generate a number of feature vectors 126 in continuous vector space. The feature vectors 126 may be used to train the disambiguation model 127 in vector space, as well. The generation of the feature vectors 126 and training of the disambiguation model 127 are described more fully with reference to FIG. 4.
[0029] Upon training the disambiguation model 127, a run-time prediction component 128 may utilize the disambiguation model 127 to identify named entities in a corpus of data. Run-time prediction and identification of named entities is described more fully with reference to FIG. 5.
[0030] Hereinafter, a more detailed discussion of the operation of the pre-processing component 121 is provided with reference to FIG. 2. FIG. 2 is a flowchart showing aspects of one illustrative method 200 for pre-processing training data, according to one implementation presented herein. The method 200 may begin pre-processing at block 201, and cease pre-processing at block 214. Individual components of the method 200 are described below with reference to the machine learning system 120 shown in FIG. 1.
[0031] As shown in FIG. 2, the pre-processing component 121 may prepare the training data 101 for machine learning at block 202. The training data 101 may include the pseudo- labeled observations 104 retrieved from the free text 102 and the document anchors 103, as described above.
[0032] Preparation of the training data 101 can include an assumption for a vocabulary of words and entities V = Vword U Ventity , where Vword denotes a set of words and ^entity denotes a set of entities. The vocabulary V is derived from the free text 102 vv v2,■■■ , vn , by replacing all document anchors 103 with corresponding entities. The contexts of vt E V are the words or entities surrounding it within an L-sized window {Vi-L,•••, Vi-1, vi+1,•••, Vi+L}. Subsequently, a vocabulary of contexts UWOrd u ^-entity can be established. In this manner, the terms in V are the same as those in U, because if term tt is the context of tj, then tj is also the context of tt . In this particular implementation, each word or entity v E V, fi G U is associated with a vector ων, μ E Ed , respectively.
[0033] Upon preparation of the training data 101 based on the pseudo-labeled observations 104 as described above, the pre-processing component generates the one or more mapping tables 122, at block 204. The mapping table or tables 122 include tables configured to train a model to associate a correct candidate or an incorrect candidate. Therefore, the mapping table or tables 122 may be used to train the disambiguation model 127 with both positive and negative examples for any particular phrase mentioning a candidate entity.
[0034] The pre-processing component 121 also generates an entity-word concurrence graph from the document anchors 103 and text surrounding the document anchors 103, at block 206, an entity-entity concurrence graph from titles of articles as well as the document anchors 13, at block 208, and an entity -word concurrence graph from titles of articles and words contained in the articles, at block 210. For example, a concurrence graph may also be termed a share-topic graph. A concurrence graph may be representative of a co-occurrence relationship between named entities.
[0035] As an example, the pre-processing component may construct a share-topic graph where G = (V, E) denotes the share-topic graph, where node set V contains all entities in the free text 102, with each node representing an entity. Furthermore, E is a subset of V x V , a f the set { (ej, ej .
Figure imgf000009_0001
Additionally, inlinks(e) denotes the set of entities that link to e . [0036] Other concurrence graphs based on entity-entity concurrence or entity-word concurrence may also be generated as explained above, in some implementations. Upon generating the concurrence graphs, the pre-processing component 121 may generate a tokenized text sequence 124, at block 212. The tokenized text sequence 124 may be a clean sequence that represents text, or portions of text, from the free text 102 as sequences of normalized tokens. Generally, any suitable tokenizer may be implemented to create the sequence 124 without departing from the scope of this disclosure.
[0037] Upon completing any or all of the pre-processing sequences described above with reference to blocks 201-212, the method 200 may cease at block 214. As shown in FIG. 1, the training component 125 may receive the mapping table 122, concurrence graphs 123, and the tokenized text sequence 124 as input. Hereinafter, operation of the training component is described more fully with reference to FIG. 3.
[0038] FIG. 3 is a flowchart showing aspects of one illustrative method 300 for training embeddings of entities and words, according to one implementation presented herein. As shown, the method 300 may begin at block 301. The training component 125 may initially define a probabilistic model for concurrences at block 302.
[0039] The probabilistic model may be based on each concurrence graph 123 based on vector representations of named entities and words, as described in detail above. According to one example, word and entity representations are learned to discriminate the surrounding word (or entity) within a short text sequence. The connections between words and entities are created by replacing all document anchors with their referent entities. For example, a vector of ων is trained to perform well at predicting the vector of each surrounding term μ from a sliding window. As an example, a phrase may include "Michael I. Jordan is newly elected as AAAI fellow." According to this example, the vector of "Michael I. Jordan" in the corpus-vocabulary V is trained to predict the vectors of "is",..., "AAAI" and "fellow" in the context- vocabulary . Additionally, the collection of word (or entity) and context pairs extracted from the phrases may be denoted as T>.
[0040] As an example of a probabilistic model appropriate in this context, a corpus- context pair v, μ) G T>, (v G V, μ G U) may be considered. The training component may model the conditional probability ρ(μ\ν) using a softmax function defined by Equation 1, below:
(Equation 1)
Figure imgf000010_0001
[0041] Upon defining the objective function, the training component 125 may also define an objective function for the concurrences, at block 304. Generally, the objective function may be an objective function defined by learning as the likelihood of generating concurrences. For example, the objective function based on Equation 1, above, may be defined as set forth in Equation 2, below:
c
log a(G%a>v) + ^ %~ρη£¾(μ) [log o"(-S¾, ων)] (Equation 2)
i=i
[0042] In Equation 2, σ(χ) = l/(l + exp(— )) and c is the number of negative examples to be discriminated for each positive example. Given the objective function, the training component 125 may encourage a gap between appeared concurrences in the training data and candidate occurrences that have not appeared, at block 306. The training component 125 may further optimize the objective function at block 308, and the method 300 may cease at block 310.
[0043] As described above, by training embeddings of entities and words in creation of a probabilistic model and an objective function, features may be generated to train the disambiguation model 127 to better identify named entities. Hereinafter, further operational details of the training component 125 are described with reference to FIG. 4.
[0044] FIG. 4 is a flowchart showing aspects of one illustrative method 400 for generating feature vectors 126 in vector space and training the disambiguation model 127 in vector space, according to one implementation presented herein. The method 400 begins training in vector space at block 401. Generally, the training component 125 defines templates to generate features, at block 402. The templates may be defined as templates for automatically generating features.
[0045] According to one implementation, at least two templates are defined. The first template may be based on a local context score. The local context score template is a template to automatically generate features for neighboring or "neighborhood" words. The second template may be based on a topical coherence score. The topical coherence score template is a template to automatically generate features based on an average-semantic- relatedness, or the assumption that unambiguous named entities may be helpful in identifying mentions of named entities in a more ambiguous context.
[0046] Utilizing the generated templates, the training component 125 computes a score for each template, at block 404. The score computed is based on each underlying assumption for the associated template. For example, the local context template may have a score computed based on local contexts of mentions of a named entity. An example equation to compute the local context score may be implemented as Equation 3, below: cs(mt, eit T) = (Equation 3)
Figure imgf000012_0001
[0047] In Equation 3, mi) denotes the candidate entity set of mention m^. Additionally, multiple local context scores may be computed by changing the context window size \T \ .
[0048] With regard to a topical coherence template, a document level disambiguation context C may be computed based on Equation 4, presented below: (Equation 4)
Figure imgf000012_0002
[0049] In Equation 4, d is an analyzed document and T>(d) = {ev e2,■■■ , em} is the set of unambiguous entities identified in document d. After computing scores for each template, the training component 125 generates features from the templates, based on the computed scores, at block 306.
[0050] Generating the features may include, for example, generating individual features for constructing one or more feature vectors based on a number of disambiguation decisions. A function for the disambiguation decisions is defined by Equation 5, presented below:
argmax x
Vrrii M. r, (Equation 5)
1+exp I-1 '
[0051] In Equation 5, F =
Figure imgf000012_0003
i denotes the feature vector, while the basic features are local context scores csim^ e^ ) and topical coherence scores tcim^ e^) . Furthermore, additional features can also be combined utilizing Equation 5. But generally, the training component is configured to optimize the parameters ?, such that the correct entity has a higher score over irrelevant entities. During optimization of the parameters ?, the training component 125 defines the disambiguation model 127 and trains the disambiguation model 127 based on the feature vectors 126, at block 408. The method 400 ceases at block 410.
[0052] As described above, the disambiguation model 127 may be used to more accurately predict the occurrence of a particular named entity. Hereinafter, runtime prediction of named entities is described more fully with reference to FIG. 5.
[0053] FIG. 5 is a flowchart showing aspects of one illustrative method 500 for runtime prediction and identification of named entities, according to one implementation presented herein. Run-time prediction begins at block 501, and may be performed by run-time prediction component 128, or may be performed by another portion of the system 100.
[0054] Initially, run-time prediction component 128 receives a search request identifying one or more named entities, at block 502. The search request may originate at a client computing device, such as through a Web browser on a computer, or from any other suitable device. Example computing devices are described in detail with reference to FIG. 6.
[0055] Upon receipt of the search request, the run-time prediction component 128 may identify candidate entries of web articles or other sources of information, at block 504. According to one implementation, the candidate entries are identified from a database or a server. According to another implementation, the candidate entries are identified from the Internet.
[0056] Thereafter, the run-time prediction component 128 may retrieve feature vectors 126 of words and/or named entities, at block 506. For example, the feature vectors 126 may be stored in memory, in a computer readable storage medium, or may be stored in any suitable manner. The feature vectors 126 may be accessible by the run-time prediction component 126 for run-time prediction and other operations.
[0057] Upon retrieval, the run-time prediction component 128 may compute features based on the retrieved vectors of words and named entities contained in the request, at block 508. Feature computation may be similar to the computations described above with reference to the disambiguation model 127 and Equation 5. The words and named entities may be extracted from the request.
[0058] Thereafter, the run-time prediction component 128 applies the disambiguation model to the computed features, at block 510. Upon application of the disambiguation model, the run-time prediction component 128 may rank the candidate entries based on the output of the disambiguation model, at block 512. The ranking may include ranking the candidate entries based on a set of probabilities that any one candidate entry is more likely to reference the named entity than other candidate entries. Other forms of ranking may also be applicable. Upon ranking, the run-time prediction component 128 may output the ranked entries at block 514. The method 500 may continually iterate as new requests are received, or alternatively, may cease after outputting the ranked entries.
[0059] It should be appreciated that the logical operations described above with reference to FIGS. 2-5 may be implemented (1) as a sequence of computer implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are referred to variously as states operations, structural devices, acts, or modules. These operations, structural devices, acts and modules may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof. It should also be appreciated that more or fewer operations may be performed than shown in the figures and described herein. These operations may also be performed in a different order than those described herein.
[0060] FIG. 6 shows an illustrative computer architecture for a computer 600 capable of executing the software components and methods described herein for pre-processing, training, and runtime prediction in the manner presented above. The computer architecture shown in FIG. 6 illustrates a conventional desktop, laptop, or server computer and may be utilized to execute any aspects of the software components presented herein described as executing in the system 100 or any components in communication therewith.
[0061] The computer architecture shown in FIG. 6 includes one or more processors 602, a system memory 608, including a random access memory 614 (RAM) and a read-only memory (ROM) 616, and a system bus 604 that couples the memory to the processor(s) 602. The processor(s) 602 can include a central processing unit (CPU) or other suitable computer processors. A basic input/output system containing the basic routines that help to transfer information between elements within the computer 600, such as during startup, is stored in the ROM 616. The computer 600 further includes a mass storage device 610 for storing an operating system 618, application programs, and other program modules, which are described in greater detail herein.
[0062] The mass storage device 610 is connected to the processor(s) 602 through a mass storage controller (not shown) connected to the bus 604. The mass storage device 610 is an example of computer-readable media for the computer 600. Although the description of computer-readable media contained herein refers to a mass storage device 600, such as a hard disk, compact disk read-only-memory (CD-ROM) drive, solid state memory (e.g., flash drive), it should be appreciated by those skilled in the art that computer-readable media can be any available computer storage media or communication media that can be accessed by the computer 600.
[0063] Communication media includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics changed or set in a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of communication media.
[0064] By way of example, and not limitation, computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer storage media includes, but is not limited to, RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DVD), High Definition DVD (HD-DVD), BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and which can be accessed by the computer 600. As used herein, the phrase "computer storage media," and variations thereof, does not include waves or signals per se and/or communication media.
[0065] According to various implementations, the computer 600 may operate in a networked environment using logical connections to remote computers through a network such as the network 620. The computer 600 may connect to the network 620 through a network interface unit 606 connected to the bus 604. The network interface unit 606 may also be utilized to connect to other types of networks and remote computer systems. The computer 600 may also include an input/output controller 612 for receiving and processing input from a number of other devices, including a keyboard, mouse, or electronic stylus (not shown in FIG. 6). Similarly, an input/output controller may provide output to a display screen, a printer, or other type of output device (also not shown in FIG. 6).
[0066] As mentioned briefly above, a number of program modules and data files may be stored in the mass storage device 610 and RAM 614 of the computer 600, including an operating system 618 suitable for controlling the operation of a networked desktop, laptop, or server computer. The mass storage device 610 and RAM 814 may also store one or more program modules or other data, such as the disambiguation model 127, the feature vectors 126, or any other data described above. The mass storage device 610 and the RAM 614 may also store other types of program modules, services, and data. EXAMPLE CLAUSES
A. A device for training disambiguation models in continuous vector space, comprising a machine learning component deployed thereon and configured to:
pre-process training data to generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data;
define a probabilistic model for the one or more concurrence graphs;
define an objective function based on the probabilistic model and the one or more concurrence graphs; and
train at least one disambiguation model based on feature vectors generated through an optimized version of the objective function.
B. A device as recited in clause 1, wherein the probabilistic model is based on a softmax function or normalized exponential function.
C. A device as recited in either of clauses A and B, wherein the softmax function includes a conditional probability of a vector of named entities concurring with a vector of words.
D. A device as recited in any of clauses A-C, wherein the objective function is a function of a number of negative examples included in the pre-processed training data.
E. A device as recited in any of clauses A-D, wherein the optimized version of the objective function is optimized to encourage a gap between concurrences defined in the concurrence graphs.
F. A machine learning system, the system comprising:
training data including free text and a plurality of document anchors;
a pre-processing component configured to pre-process at least a portion of the training data to generate one or more concurrence graphs of named entities, associated data, and data anchors; and
a training component configured to generate vector embeddings of entities and words based on the one or more concurrence graphs, wherein the training component is further configured to train at least one disambiguation model based on the vector embeddings.
G. A system as recited in clause F, further comprising a run-time prediction component configured to identify candidate entries using the at least one disambiguation model.
H. A system as recited in either of clauses F and G, further comprising:
a database or server storing a plurality of entries; and a run-time prediction component configured to identify candidate entries from the plurality of entries using the at least one disambiguation model, and to rank the identified candidate entries using the at least one disambiguation model.
I. A system as recited in any of clauses F-H, wherein the training component is further configured to:
define a probabilistic model for the one or more concurrence graphs; and define an objective function based on the probabilistic model and the one or more concurrence graphs, wherein the vector embeddings are created based on the probabilistic model and an optimized version of the objective function.
J. A system as recited in any of clauses F-I, wherein:
the probabilistic model is based on a softmax function or normalized exponential function; and
the objective function is a function of a number of negative examples included in the training data.
K. A device for training disambiguation models in continuous vector space, comprising a pre-processing component deployed thereon and configured to:
prepare training data for machine learning through extraction of a plurality of observations, wherein the training data comprises a corpus of text and a plurality of document anchors;
generate a mapping table based on the plurality of observations of the training data; and
generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data and based on the mapping table.
L. A device as recited in clause K, further comprising a machine learning component deployed thereon and configured to:
define a probabilistic model for the one or more concurrence graphs;
define an objective function based on the probabilistic model and the one or more concurrence graphs; and
train at least one disambiguation model based on feature vectors generated through an optimized version of the objective function.
M. A device as recited in either of clauses K and L, wherein the probabilistic model is based on a softmax function or normalized exponential function. N. A device as recited in any of clauses K-M, wherein the softmax function includes a conditional probability of a vector of named entities concurring with a vector of words.
O. A device as recited in any of clauses K-N, wherein the objective function is a function of a number of negative examples included in the pre-processed training data.
P. A device as recited in any of clauses K-O, wherein the optimized version of the objective function is optimized to encourage a gap between concurrences defined in the concurrence graphs.
Q. A device as recited in any of clauses K-P, wherein the pre-processing component is further configured to generate a clean tokenized text sequence from the plurality of observations.
R. A device as recited in any of clauses K-Q, further comprising a run-time prediction component configured to identify candidate entries using the at least one disambiguation model.
S. A device as recited in any of clauses K-R, wherein the device is in operative communication with a database or server storing a plurality of entries, the device further comprising:
a run-time prediction component configured to identify candidate entries from the plurality of entries using the at least one disambiguation model, and to rank the identified candidate entries using the at least one disambiguation model.
T. A device as recited in any of clauses K-S, wherein the run-time prediction component is further configured to:
receive a search request identifying a desired named entity;
identify the candidate entries based on the search request;
retrieve vectors of words and named entities related to the search request;
compute features based on the vectors of words and named entities;
apply the at least one disambiguation model to the computed features; and rank the candidate entries based on the application of the at least one
disambiguation model.
CONCLUSION
[0067] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and steps are disclosed as example forms of implementing the claims.
[0068] All of the methods and processes described above may be embodied in, and fully or partially automated via, software code modules executed by one or more general purpose computers or processors. The code modules may be stored in any type of computer-readable storage medium or other computer storage device. Some or all of the methods may additionally or alternatively be embodied in specialized computer hardware.
[0069] Conditional language such as, among others, "can," "could," or "may," unless specifically stated otherwise, means that certain examples include, while other examples do not include, certain features, elements and/or steps. Thus, such conditional language does not imply that certain features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether certain features, elements and/or steps are included or are to be performed in any particular example.
[0070] Conjunctive language such as the phrases "and/or" and "at least one of X, Y or Z," unless specifically stated otherwise, mean that an item, term, etc. may be either X, Y, or Z, or a combination thereof.
[0071] Any routine descriptions, elements or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code that include one or more executable instructions for implementing specific logical functions or elements in the routine. Alternate implementations are included within the scope of the examples described herein in which elements or functions may be deleted, or executed out of order from that shown or discussed, including substantially synchronously or in reverse order, depending on the functionality involved as would be understood by those skilled in the art.
[0072] It should be emphasized that many variations and modifications may be made to the above-described examples, the elements of which are to be understood as being among other acceptable examples. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.

Claims

1. A device for training disambiguation models in continuous vector space, comprising a machine learning component deployed thereon and configured to:
pre-process training data to generate one or more concurrence graphs of named entities, words, and document anchors extracted from the training data;
define a probabilistic model for the one or more concurrence graphs;
define an objective function based on the probabilistic model and the one or more concurrence graphs; and
train at least one disambiguation model based on feature vectors generated through an optimized version of the objective function.
2. The device of claim 1, wherein the probabilistic model is based on a softmax function or normalized exponential function.
3. The device of claim 2, wherein the softmax function includes a conditional probability of a vector of named entities concurring with a vector of words.
4. The device of claim 1, wherein the objective function is a function of a number of negative examples included in the pre-processed training data.
5. The device of claim 1, wherein the optimized version of the objective function is optimized to encourage a gap between concurrences defined in the concurrence graphs.
6. A machine learning system, the system comprising:
training data including free text and a plurality of document anchors;
a pre-processing component configured to pre-process at least a portion of the training data to generate one or more concurrence graphs of named entities, associated data, and data anchors; and
a training component configured to generate vector embeddings of entities and words based on the one or more concurrence graphs, wherein the training component is further configured to train at least one disambiguation model based on the vector embeddings.
7. The machine learning system of claim 6, further comprising a run-time prediction component configured to identify candidate entries using the at least one disambiguation model.
8. The machine learning system of claim 6, further comprising:
a database or server storing a plurality of entries; and a run-time prediction component configured to identify candidate entries from the plurality of entries using the at least one disambiguation model, and to rank the identified candidate entries using the at least one disambiguation model.
9. The machine learning system of claim 6, wherein the training component is further configured to:
define a probabilistic model for the one or more concurrence graphs; and define an objective function based on the probabilistic model and the one or more concurrence graphs, wherein the vector embeddings are created based on the probabilistic model and an optimized version of the objective function.
10. The machine learning system of claim 9, wherein:
the probabilistic model is based on a softmax function or normalized exponential function; and
the objective function is a function of a number of negative examples included in the training data.
PCT/US2016/039129 2015-06-26 2016-06-24 Learning entity and word embeddings for entity disambiguation WO2016210203A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16739296.8A EP3314461A1 (en) 2015-06-26 2016-06-24 Learning entity and word embeddings for entity disambiguation
US15/736,223 US20180189265A1 (en) 2015-06-26 2016-06-24 Learning entity and word embeddings for entity disambiguation

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CNPCT/CN2015/082445 2015-06-26
CN2015082445 2015-06-26
CN201510422856.2 2015-07-17
CN201510422856.2A CN106294313A (en) 2015-06-26 2015-07-17 Study embeds for entity and the word of entity disambiguation

Publications (1)

Publication Number Publication Date
WO2016210203A1 true WO2016210203A1 (en) 2016-12-29

Family

ID=56413845

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/039129 WO2016210203A1 (en) 2015-06-26 2016-06-24 Learning entity and word embeddings for entity disambiguation

Country Status (1)

Country Link
WO (1) WO2016210203A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577659A (en) * 2017-07-18 2018-01-12 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
EP3514702A1 (en) * 2018-01-17 2019-07-24 Beijing Baidu Netcom Science And Technology Co., Ltd. Text processing method and device based on ambiguous entity words
CN111523326A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN112434533A (en) * 2020-11-16 2021-03-02 广州视源电子科技股份有限公司 Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANTOINE BORDES ET AL: "Translating Embeddings for Modeling Multi-relational Data", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 26 (NIPS 2013), 5 December 2013 (2013-12-05), Lake Tahoe, Stateline, NV, USA, XP055298868 *
HONGZHAO HUANG ET AL: "Leveraging Deep Neural Networks and Knowledge Graphs for Entity Disambiguation", 28 April 2015 (2015-04-28), XP055298881, Retrieved from the Internet <URL:http://arxiv.org/pdf/1504.07678.pdf> [retrieved on 20160831] *
TOMAS MIKOLOV ET AL: "Efficient Estimation of Word Representations in Vector Space", 7 September 2013 (2013-09-07), XP055192736, Retrieved from the Internet <URL:http://arxiv.org/abs/1301.3781> *
ZHEN WANG ET AL: "Knowledge Graph and Text Jointly Embedding", PROCEEDINGS OF THE 2014 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), DOHA, QATAR, 25 October 2014 (2014-10-25), Stroudsburg, PA, USA, pages 1591 - 1601, XP055298877, DOI: 10.3115/v1/D14-1167 *
ZHENGYAN HE ET AL: "Learning Entity Representation for Entity Disambiguation", PROCEEDINGS OF THE 51ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 4 August 2013 (2013-08-04), Sofia, Bulgaria, pages 30 - 34, XP055298942 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577659A (en) * 2017-07-18 2018-01-12 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
EP3514702A1 (en) * 2018-01-17 2019-07-24 Beijing Baidu Netcom Science And Technology Co., Ltd. Text processing method and device based on ambiguous entity words
JP2019125343A (en) * 2018-01-17 2019-07-25 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Text processing method and apparatus based on ambiguous entity words
KR20190094078A (en) * 2018-01-17 2019-08-12 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. A text processing method and device based on ambiguous entity words
KR102117160B1 (en) 2018-01-17 2020-06-01 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. A text processing method and device based on ambiguous entity words
US11455542B2 (en) 2018-01-17 2022-09-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Text processing method and device based on ambiguous entity words
CN111523326A (en) * 2020-04-23 2020-08-11 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111523326B (en) * 2020-04-23 2023-03-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
US11704492B2 (en) 2020-04-23 2023-07-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, electronic device, and storage medium for entity linking by determining a linking probability based on splicing of embedding vectors of a target and a reference text
CN112434533A (en) * 2020-11-16 2021-03-02 广州视源电子科技股份有限公司 Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium

Similar Documents

Publication Publication Date Title
US20180189265A1 (en) Learning entity and word embeddings for entity disambiguation
US11216504B2 (en) Document recommendation method and device based on semantic tag
Sordoni et al. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion
US10025819B2 (en) Generating a query statement based on unstructured input
US8073877B2 (en) Scalable semi-structured named entity detection
CN108319627B (en) Keyword extraction method and keyword extraction device
Boytsov et al. Off the beaten path: Let's replace term-based retrieval with k-nn search
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
Cheng et al. Contextual text understanding in distributional semantic space
CN111539197B (en) Text matching method and device, computer system and readable storage medium
US20110251984A1 (en) Web-scale entity relationship extraction
US9734238B2 (en) Context based passage retreival and scoring in a question answering system
CN111417940A (en) Evidence search supporting complex answers
CN114911892A (en) Interaction layer neural network for search, retrieval and ranking
US11263400B2 (en) Identifying entity attribute relations
CN102314440B (en) Utilize the method and system in network operation language model storehouse
CN109948140B (en) Word vector embedding method and device
WO2016210203A1 (en) Learning entity and word embeddings for entity disambiguation
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
US10198497B2 (en) Search term clustering
Vekariya et al. A novel approach for semantic similarity measurement for high quality answer selection in question answering using deep learning methods
WO2023033942A1 (en) Efficient index lookup using language-agnostic vectors and context vectors
JP5497105B2 (en) Document retrieval apparatus and method
AU2018226420A1 (en) Voice assisted intelligent searching in mobile documents
CN113821588A (en) Text processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16739296

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016739296

Country of ref document: EP