US20160110446A1 - Method for disambiguated features in unstructured text - Google Patents

Method for disambiguated features in unstructured text Download PDF

Info

Publication number
US20160110446A1
US20160110446A1 US14/979,703 US201514979703A US2016110446A1 US 20160110446 A1 US20160110446 A1 US 20160110446A1 US 201514979703 A US201514979703 A US 201514979703A US 2016110446 A1 US2016110446 A1 US 2016110446A1
Authority
US
United States
Prior art keywords
feature
server
features
cluster
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/979,703
Inventor
Scott Lightner
Franz Weckesser
Sanjay BODDHU
Rakesh DAVE
Robert FLAGG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qbase LLC
Original Assignee
Qbase LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qbase LLC filed Critical Qbase LLC
Priority to US14/979,703 priority Critical patent/US20160110446A1/en
Assigned to Qbase, LLC reassignment Qbase, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BODDHU, SANJAY, FLAGG, ROBERT, DAVE, RAKESH, LIGHTNER, SCOTT, WECKESSER, FRANZ
Publication of US20160110446A1 publication Critical patent/US20160110446A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30684
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • G06F17/30705
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the present disclosure generally relates to data management; and, more specifically, to data management systems and methods that extract and store material from source items received over a network.
  • Searching for information about entities i.e., people, locations, organizations
  • sources such as a network
  • the method may include multiple modules, such as one or more feature extraction modules, one or more disambiguation modules, one or more scoring modules, and one or more linking modules.
  • Disambiguating features will be supported in part by extracting topics from the ambient document of the feature, employing a multi-component extension of Latent Dirichlet Allocation (MC-LDA) topic models.
  • MC-LDA Latent Dirichlet Allocation
  • each component is modeled around each secondary feature stored in the existing knowledge base or extracted on the incoming document.
  • the linking or disambiguation process is modeled as topic inference from the MC-LDA, which provides automated weight estimation during the MC-LDA training and applies them readily during inference.
  • the exemplary method may improve the accuracy of entity disambiguation beyond what may be achieved by considering no document linking. Taking account of document linkage may allow better disambiguation by considering document and entity relationships implied by links.
  • a method comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is a match, determining, by the disambiguation module of the in-
  • a non-transitory computer readable medium having stored thereon computer executable instructions comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is
  • FIG. 1 is a flowchart of a method for disambiguating features in unstructured text, according to an exemplary embodiment.
  • FIG. 2 is a flowchart of the steps performed by a disambiguation module employed in the method for disambiguating features, according to an exemplary embodiment.
  • FIG. 3 is a flowchart of the steps performed by a link on-the-fly module employed in the method for disambiguating features, according to an exemplary embodiment.
  • FIG. 4 is an illustrative diagram of a system employed for implementing the method for disambiguating features, according to an exemplary embodiment.
  • FIG. 5 shows a graphical representation of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic model, according to an exemplary embodiment.
  • MC-LDA Latent Dirichlet Allocation
  • FIG. 6 illustrates an embodiment of the Gibbs sampling equations for multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
  • FIG. 7 illustrates an embodiment of the implementation of a stochastic variational inference algorithm for training and inference in multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
  • FIG. 8 is a table illustrating a sample topic for a multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
  • Document refers to a discrete electronic representation of information having a start and end.
  • Multi-Document refers to a document with its tokens, different types of named entities, and key phrases organized into separate “bag-of-surface-forms” components.
  • Database refers to any system including any combination of clusters and modules suitable for storing one or more collections and suitable to process one or more queries.
  • Corpus refers to a collection of one or more documents.
  • Live corpus or “Document Stream”, refers to a corpus that is constantly fed as new documents are uploaded into a network.
  • Feature refers to any information which is at least partially derived from a document.
  • Feature attribute refers to metadata associated with a feature; for example, location of a feature in a document, confidence score, among others.
  • Cluster refers to a collection of features.
  • Entity knowledge base refers to a base containing features/entities.
  • Link on-the-fly module refers to any linking module that performs data linkage as data is requested from the system rather than as data is added to the system.
  • Memory refers to any hardware component suitable for storing information and retrieving said information at a sufficiently high speed.
  • Module refers to a computer software component suitable for carrying out one or more defined tasks.
  • “Sentiment” refers to subjective assessments associated with a document, part of a document, or feature.
  • Topic refers to a set of thematic information which is at least partially derived from a corpus.
  • Topic ID refers to an identifier that refers to a specific instance of a topic.
  • Topic Collection refers to a specific set of topics derived from the corpus, with each topic having a unique identifier (“unique ID”).
  • Topic Classification refers to the assignation of specific topic identifiers as features of a document.
  • Query refers to a request to retrieve information from one or more suitable databases.
  • the present disclosure describes a method for disambiguating features in an unstructured text.
  • the exemplary embodiments discuss practices for disambiguating features according to this disclosure, it is intended that the systems and methods described herein can be configured for any suitable use within the scope of this disclosure.
  • An aspect of the present disclosure includes a method that may allow an increased accuracy in feature and entity disambiguation, therefore, increased accuracy in text analytics.
  • the disclosed method for disambiguating features may be employed in an initial corpus of data to perform a document ingestion and a feature extraction that may allow a topic classification and other text analytics on each document included in the initial corpus.
  • Each feature may be identified and recorded as name, type, positional information in the document, and confidence score, among others.
  • FIG. 1 is a flowchart of a method 100 illustrating a plurality of steps for disambiguating features in unstructured text.
  • method 100 for disambiguating features may initiate as a new document input, step 102 , is made in an existing knowledge base. Subsequently, a feature extraction step 104 may be performed on the document.
  • a feature may be related to different feature attributes, such as a topic identifier (“topic ID”), a document identifier (“document ID”), feature type, feature name, confidence score, and feature position, among others.
  • topic ID topic identifier
  • document ID document identifier
  • feature type feature name
  • confidence score confidence score
  • document input in step 102 may be fed from a massive corpus or live corpus (such as the internet or network connected corpus) that in turn, may be fed every second.
  • a massive corpus or live corpus such as the internet or network connected corpus
  • one or more feature recognition and extraction algorithms may be employed during feature extraction step 104 to analyze the unstructured text of document input step 102 .
  • a score may be assigned to each extracted feature. The score may indicate the level of certainty of the feature being correctly extracted with the correct attributes.
  • one or more primary features may be identified from a document input in step 102 .
  • Each primary feature may have been associated with a set of feature attributes and one or more secondary features.
  • Each secondary feature may be associated with a set of feature attributes.
  • one or more secondary features may have one or more tertiary features, each one having its own set of feature attributes.
  • the relative weight or relevance of each of the features within document input at step 102 may be determined. Additionally, the relevance of the association between features may be determined using a weighted scoring model.
  • the features extracted from document input at step 102 and all their related information may be loaded into an in-memory data base (MemDB), during inclusion of features in MemDB, step 106 , as part of a feature disambiguation request step 108 .
  • MemDB in-memory data base
  • the MemDB forms part of a disambiguation computer server environment having one or more processors executing the steps discussed in connection with FIGS. 1-8 .
  • the MemDB is a computer module that may include one or more search controllers, multiple search nodes, collections of compressed data, and a disambiguation sub module.
  • One search controller may be selectively associated with one or more search nodes.
  • Each search node may be capable of independently performing a fuzzy key search through a collection of compressed data and returning a set of scored results to its associated search controller.
  • Feature disambiguation step 108 may be performed by a disambiguation sub module within the MemDB.
  • Feature disambiguation 108 process may include machine generated topic IDs, which may be employed to classify features, documents, or corpora. The relatedness of individual features and specific topic IDs may be determined using disambiguating algorithms. In some documents, the same feature may be related to one or more topic IDs, depending on the context of the different occurrences of the feature within the document.
  • the set of features (like topics, proximity terms and entities, key phrases, events and facts) extracted from one document may be compared with sets of features from other documents, using disambiguating algorithms to define with a certain level of accuracy if two or more features across different documents are a single feature or if they are distinct features.
  • co-occurrence of two or more features across the collection of documents in the database may be analyzed to improve the accuracy of feature disambiguation process 108 .
  • global scoring algorithms may be used to determine the probability of features being the same.
  • a knowledge base may be generated within the MemDB. This knowledge base may be used to temporarily store clusters of relevant disambiguated primary features and their related secondary features.
  • the new disambiguated set of features may be compared with the existing knowledge base in order to determine the relationship between features and determine if there is a match between the new features and already extracted features.
  • the knowledge base may be updated and a feature ID of the matching features may be returned to the user and/or requesting application or process and further based on the frequency of matches a prominence measure could be attached with the feature ID, which captures its popularity index in the given corpus.
  • a unique feature ID is assigned to the disambiguated entity or feature, and the unique feature ID is associated with the cluster of defining features and stored in within the knowledge base of the MemDB.
  • the feature ID of disambiguated feature may be returned to the source through the system interface.
  • the feature ID of disambiguated feature may include secondary features, cluster of features, relevant feature attributes or other requested data. Disambiguation sub module employed for feature disambiguation step 108 is described in more detail in FIG. 2 below.
  • FIG. 2 is a flowchart of a process 200 performed by a disambiguation sub module employed in unstructured texts for feature disambiguation step 108 of the method 100 ( FIG. 1 ), according to an embodiment.
  • Disambiguation process 200 may begin after inclusion of features in MemDB in step 106 of FIG. 1 .
  • the extracted features provided in step 202 may be used to perform a candidate search in step 204 , in which a search for the extracted features may be performed through all candidate records, including co-occurring features.
  • candidates may be primary features with a set of associated secondary features that may be used in feature disambiguation process 108 .
  • the disambiguation results may be improved by the co-occurrence of topic IDs and relatedness among topic IDs.
  • the relatedness of topic IDs, even across different topic models can be discovered from a large corpus where the topic IDs have been assigned.
  • Related topic IDs can be used during records linkage step 206 to provide linkage to documents that may not contain the exact topic ID but do contain one or more related topic IDs. This approach may improve the recall of relevant features to be included in the records linkage step 206 and improve disambiguation results in certain cases.
  • Cluster comparison step 208 may include the assignment of relative matching scores to clusters of disambiguated features, different thresholds of acceptance may be defined for different applications. The defined levels of accuracy may determine which scores may be considered a positive match search and which scores may be considered a negative match search, step 210 .
  • Each new cluster may be given a unique ID and may be temporarily stored in a knowledge base. Each new cluster may include a new disambiguated primary feature and its set of secondary features. If a new cluster matches a cluster that is already stored in the knowledge base, the system updates knowledge base in step 212 , and return of a matched feature ID to the user and/or requesting application or process may be performed in step 214 .
  • Update of knowledge base 212 may imply the association of additional of secondary features to one primary feature, or the addition of feature attributes that were not previously associated with primary or secondary features.
  • step 216 the system performs a unique ID assignment, step 216 , to the primary feature of the cluster and updates knowledge base 212 . Afterwards, the system performs a return of matched ID process 214 . Records linkage step 206 is further explained in detail in FIG. 3 .
  • FIG. 3 is a flowchart of a process 300 performed by a link on-the-fly (“link OTF”) sub module employed in method 100 for disambiguating features, according to an embodiment.
  • Link OTF process 300 may be capable of constantly evaluating, scoring, linking, and clustering a feed of information.
  • Link OTF sub module may perform records linkage 206 using multiple algorithms.
  • Candidate search results of step 204 may be constantly fed into link OTF module 300 .
  • the input of data may be followed by a match scoring algorithm application, step 302 , where one or more match scoring algorithms may be applied simultaneously in multiple search nodes of the MemDB while performing fuzzy key searches for evaluating and scoring the relevant results, taking in account multiple feature attributes, such as string edit distances, phonetics, and sentiments, among others.
  • Linking algorithm application 304 may be added to compare all candidate records, identified during match scoring algorithm application step 302 , to each other.
  • Linking algorithm application 304 may include the use of one or more analytical linking algorithms capable of filtering and evaluating the scored results of the fuzzy key searches performed inside the multiple search nodes of the MemDB.
  • co-occurrence of two or more features across the collection of identified candidate records in the MemDB may be analyzed to improve the accuracy of the process.
  • Different weighted models and confidence scores associated with different feature attributes may be taken into account for linking algorithm application 304 .
  • the linked results may be arranged in clusters of related features and returned, as part of return of linked records clusters in step 306 .
  • FIG. 4 is an illustrative diagram of an embodiment of a system 400 for disambiguating features in unstructured text, as discussed above in connection with FIG. 1 .
  • the system 400 hosts an in-memory database and comprises one or more nodes.
  • the system 400 includes one or more processors executing computer instructions for a plurality of special-purpose computer modules 401 , 402 , 411 , 412 , and 414 (discussed below) to disambiguate features within one or more documents.
  • the document input modules 401 , 402 receive documents from internet based sources and/or a live corpus of documents. A large number of new documents may be uploaded by the second into the document input module 402 through a network connection 404 . Therefore, the source may be constantly getting new knowledge, updated by user workstations 406 , where such new knowledge is not pre-linked in a static way. Thus, the number of documents to be evaluated may be infinitely increasing.
  • MemDB 408 may facilitate a faster disambiguation process, may facilitate disambiguation process on-the-fly, which may facilitate reception of the latest information that is going to contribute to MemDB 408 .
  • Various methods for linking the features may be employed, which may essentially use a weighted model for determining which entity types are most important, which have more weight, and, based on confidence scores, determine how confident the extraction and disambiguation of the correct features has been performed, and that the correct feature may go into the resulting cluster of features. As shown in FIG. 4 , as more system nodes are working in parallel, the process may become more efficient.
  • a new document arrives into the system 400 via the document input module 401 , 402 through a network connection 404 .
  • feature extraction is performed via the extraction module 411 and, then, feature disambiguation may be performed on the new document via the feature disambiguation sub-module 414 of the MemDB 408 .
  • the extracted new features 410 may be included in the MemDB to pass through link OTF sub-module 412 ; where the features may be compared and linked, and a feature ID of disambiguated feature 110 may be returned to the user as a result from a query.
  • the resulting feature cluster defining the disambiguated feature may optionally be returned.
  • MemDB computer 408 can be a database storing data in records controlled by a database management system (DBMS) (not shown) configured to store data records in a device's main memory, as opposed to conventional databases and DBMS modules that store data in “disk” memory.
  • DBMS database management system
  • Conventional disk storage requires processors (CPUs) to execute read and write commands to a device's hard disk, thus requiring CPUs to execute instructions to locate (i.e., seek) and retrieve the memory location for the data, before performing some type of operation with the data at that memory location.
  • In-memory database systems access data that is placed into main memory, and then addressed accordingly, thereby mitigating the number of instructions performed by the CPUs and eliminating the seek time associated with CPUs seeking data on hard disk.
  • In-memory databases may be implemented in a distributed computing architecture, which may be a computing system comprising one or more nodes configured to aggregate the nodes' respective resources (e.g., memory, disks, processors).
  • a computing system hosting an in-memory database may distribute and store data records of the database among one or more nodes.
  • these nodes are formed into “clusters” of nodes.
  • these clusters of nodes store portions, or “collections,” of database information.
  • Various embodiments provide a computer executed feature disambiguation technique that employs an evolving and efficiently linkable feature knowledge base that is configured to store secondary features, such as co-occurring topics, key phrases, proximity terms, events, facts and trending popularity index.
  • the disclosed embodiments may be performed via a wide variety of linking algorithms that can vary from simple conceptual distance measure to sophisticated graph clustering approaches based on the dimensions of the involved secondary features that aid in resolving a given extracted feature to a stored feature in the knowledge base.
  • embodiments can introduce an approach to evolves the existing feature knowledge base by a capability that not only updates the secondary features of the existing feature entry, but also expands it by discovering new features that can be appended to the knowledge base.
  • Embodiments of the disambiguation approach can employ a topic modeling approach to provide an automated weighted (across all the secondary features) linking process (disambiguation) that is modeled as topic inference.
  • a topic modeling approach to provide an automated weighted (across all the secondary features) linking process (disambiguation) that is modeled as topic inference.
  • embodiments extend the conventional LDA topic modeling to build a novel topic modeling approach referred to as a multi-component LDA (MC-LDA) that can support any number of components (secondary features) as conditionally-independent.
  • MC-LDA multi-component LDA
  • Embodiments of the modeling approach also can automatically learn the weights of components during training and employ them for inference (linking) in connection with disambiguation.
  • the introduced MC-LDA approach for disambiguation can scale for any additional number of secondary features that could be introduced to increase disambiguation accuracy.
  • FIG. 5 shows a graphical representation of an embodiment of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic computer modeling approach employed by system 400 of FIG. 4 above.
  • each component block represents modeling each secondary feature across the knowledge base, for instance as executed via the MemDB 408 of FIG. 4 that is initialized with the parameters set forth in FIG. 5 .
  • FIG. 6 illustrates an embodiment of the Gibbs sampling equations for MC-LDA topic model employed in FIG. 5 above.
  • An embodiment of this sampling approach aids the system 400 of FIG. 4 in training the individual component (secondary feature) weights in an automated fashion and in an efficient manner.
  • FIG. 7 illustrates an embodiment of the computer executed implementation of a stochastic variational inference algorithm for training and inference in MC-LDA topic model of FIGS. 5-6 , for instance as executed via the MemDB 408 of the system 400 of FIG. 4 that is initialized with the parameters set forth in FIG. 7 .
  • An embodiment of this inference method applies readily to model the linking/disambiguation process as topic inference, by taking all the secondary features (extracted from the document of interest) as an input and providing weighted topics as the output. These weighted topics can then be used to compute a similarity score against the stored feature knowledge base entries.
  • FIG. 8 is a table illustrating a sample topic for a MC-LDA topic model.
  • FIG. 8 displays the top scoring surface forms for each component of the model, for instance as executed via the MemDB 408 of the system 400 of FIG. 4 , according to an embodiment.
  • Example #1 is an application of method 100 for disambiguating features in unstructured text, where the feature of interest (primary feature) is John Doe, a football player, and the user wants to monitor the news referencing John Doe.
  • a document input 102 mentioning John Doe may be uploaded into the network.
  • Features of document input 102 may be extracted and included into MemDB 408 for it to be disambiguated and linked to a cluster of secondary features associated to the primary feature (John Doe), and compared to existing cluster of similar features.
  • Method 100 may output different feature IDs and the feature IDs' associated clusters that include all related secondary features to John Doe; for example, John Doe, engineer; John Doe, teacher; and John Doe, football player.
  • Example #2 is an application of method 100 for disambiguating features in unstructured document, where the primary feature may be an image.
  • method 100 may include a feature extraction 104 , where a feature may be a general attribute, such as edges and shapes, among others; or a specific attribute, such as a tank, a person, and a clock, among others.
  • a new image may be input, where the image may have secondary features such as a specific shape (e.g., the shape of square, a person, or a car); the secondary features may be extracted and included in the MemDB 408 where a match may be found among all other images that has similar secondary features.
  • features may only include images, i.e. text may not be included as a feature.
  • Example #3 is an application of method 100 for disambiguating features in unstructured text, where the primary feature may be an event.
  • method 100 may allow a user to receive results associated to an event, such as an earthquake, a fire, or an epidemic outbreak, among others.
  • Method 100 may perform a feature extraction 104 and feature disambiguation 108 of the features to find the event's associated features and provide feature IDs of disambiguated features 110 .
  • Example #4 is an embodiment of method 100 , where prediction of one or more events that might occur may be made.
  • a user may previously indicate features and events of interest prior to operation, and, therefore, links between different features associated to the events of interest may be previously established.
  • method 100 may predict that the event of interest might occur, based on an increased number of occurrences of the associated features.
  • an alert may be sent to the user. For example, a user working for the health department from Thailand may choose to receive an alert for an epidemic outbreak of dengue.
  • method 100 may disambiguate all related comments from the social networks, and, taking in account the number of users 406 including related information, may predict and alert to the health department worker that an epidemic outbreak of dengue may be occurring. Therefore, the health department worker may have additional evidence and further actions may be taken into the affected community to keep the epidemic from spreading.
  • Example #5 is an application of method 100 for disambiguating features in unstructured text, where primary features may be geographic place names.
  • method 100 may be employed to disambiguate the name of a city, where different scoring weights may be associated with secondary features in the disambiguation sub-module.
  • method 100 may be employed to disambiguate Paris, Tex. from Paris, France.
  • Example #6 is an application of method 100 for disambiguating features in unstructured text, where primary features may be sentiments associated with a person, event, or company, among others; where sentiments may be positive or negative comments about a person, event, or company, among others, that may be fed from any suitable source, including social networks.
  • method 100 may be employed for a company to acknowledge the acceptance that it is having amongst the public.
  • Example #7 is an embodiment of method 100 , where method 100 may include human validation in order to increase the confidence score of a feature.
  • link OTF process 300 FIG. 4
  • the user may indicate if a disambiguated feature has been correctly disambiguated and indicate if two different clusters should be one, meaning that what method 100 (taking in account all feature and topic co-occurrence information) is indicating as two different primary features the user has knowledge may be the same. Therefore, the confidence score associated to that cluster may be higher, thus, the probability of the feature to be correctly disambiguated may be higher.
  • Example #8 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300 .
  • the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.85 within a period of 1000 ms.
  • Example #9 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300 .
  • the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.80 within a period of not exceeding 300 ms.
  • the algorithm used in this example provides an answer in a smaller period of time compared to the algorithm used in example #8 but generally returns a lower confidence score.
  • Example #10 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300 .
  • the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.90 within a period of generally exceeding 3000 ms.
  • the algorithm used in this example provides an answer with a confidence score generally greater than that returned by the algorithm used in example #8, but generally requires a significantly longer period of time.
  • Example #11 is an example of method 100 for disambiguating features in unstructured text to perform e-discovery on a large corpus of documents from a plurality of sources. Given a large corpus of documents from a plurality of sources, applying method 100 to disambiguate all features in those documents, enables the discovery of all features in the corpus. The collection of discovered features can be further utilized to discover all documents related to a feature and the discovery of related features.
  • process flow diagrams are provided merely as illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by one of skill in the art the steps in the foregoing embodiments may be performed in any order. Words such as “then,” “next,” etc. are not intended to limit the order of the steps; these words are simply used to guide the reader through the description of the methods.
  • process flow diagrams may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged.
  • a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
  • Embodiments implemented in computer software may be implemented in software, firmware, middleware, microcode, hardware description languages, or any combination thereof.
  • a code segment or machine-executable instructions may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements.
  • a code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents.
  • Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.
  • the functions When implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium.
  • the steps of a method or algorithm disclosed here may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium.
  • a non-transitory computer-readable or processor-readable media includes both computer storage media and tangible storage media that facilitate transfer of a computer program from one place to another.
  • a non-transitory processor-readable storage media may be any available media that may be accessed by a computer.
  • non-transitory processor-readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible storage medium that may be used to store desired program code in the form of instructions or data structures and that may be accessed by a computer or processor.
  • Disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.
  • the various components of the technology can be located at distant portions of a distributed network and/or the Internet, or within a dedicated secure, unsecured and/or encrypted system.
  • the components of the system can be combined into one or more devices or co-located on a particular node of a distributed network, such as a telecommunications network.
  • the components of the system can be arranged at any location within a distributed network without affecting the operation of the system.
  • the components could be embedded in a dedicated machine.
  • the various links connecting the elements can be wired or wireless links, or any combination thereof, or any other known or later developed element(s) that is capable of supplying and/or communicating data to and from the connected elements.
  • module as used herein can refer to any known or later developed hardware, software, firmware, or combination thereof that is capable of performing the functionality associated with that element.
  • determine, calculate and compute, and variations thereof, as used herein are used interchangeably and include any type of methodology, process, mathematical operation or technique.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Biophysics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for disambiguating features in unstructured text is provided. The disclosed method may not require pre-existing links to be present. The method for disambiguating features in unstructured text may use co-occurring features derived from both the source document and a large document corpus. The disclosed method may include multiple modules, including a linking module for linking the derived features from the source document to the co-occurring features of an existing knowledge base. The disclosed method for disambiguating features may allow identifying unique entities from a knowledge base that includes entities with a unique set of co-occurring features, which in turn may allow for increased precision in knowledge discovery and search results, employing advanced analytical methods over a massive corpus, employing a combination of entities, co-occurring entities, topic IDs, and other derived features.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent Ser. No. 14/557,794, entitled “Method For Disambiguating Features In Unstructured Text,” filed on Dec. 2, 2014, which claims the benefit of U.S. Provisional Application No. 61/910,739, entitled “Method For Disambiguating Features In Unstructured Text,” filed on Dec. 2, 2013, all of which are hereby incorporated by reference in its entirety.
  • This application is related to U.S. application Ser. No. 14/558,300, entitled “Event Detection Through Text Analysis Using Trained Event Template Models,” filed Dec. 2, 2014; and U.S. application Ser. No. 14/558,254, entitled “Design And Implementation Of Clustered In-Memory Database,” filed Dec. 2, 2014; each of which are hereby incorporated by reference in their entirety.
  • TECHNICAL FIELD
  • The present disclosure generally relates to data management; and, more specifically, to data management systems and methods that extract and store material from source items received over a network.
  • BACKGROUND
  • Searching for information about entities (i.e., people, locations, organizations) in large document collection, including sources such as a network, may often be ambiguous, which may lead to imprecise text processing functions, imprecise association of features during knowledge extraction, and, thus, imprecise data analysis.
  • State of the art systems use linkage based clustering and ranking in several algorithms, such as PageRank and the hyperlink-induced topic search (HITS) algorithm. The basic idea behind this and related approaches is that pre-existing links typically exist between related pages or concepts. A limitation of clustering-based techniques is that sometimes contextual information needed to disambiguate entities is not present in the context, leading to incorrectly disambiguated results. Similarly, documents about different entities in the same or superficially similar contexts may be incorrectly clustered together.
  • Other systems attempt to disambiguate entities by reference to one or more external dictionaries (or knowledgebase) of entities. In such systems, an entity's context is compared to possible matching entities in the dictionary and the closest match is returned. A limitation associated with current dictionary-based techniques stems from the fact that entities may increase in number at any moment and, therefore, no dictionary may include a representation of all of the world's entities. Thus, if a document's context is matched to an entity in the dictionary, then the technique has identified only the most similar entity in the dictionary, and not necessarily the correct entity, which may be outside the dictionary.
  • Most methods just use entities and key phrases in the disambiguation process. Therefore, there is still a need for accurate entity disambiguation techniques that allow a precise data analysis.
  • SUMMARY
  • Some embodiments describe a method for disambiguating features. The method may include multiple modules, such as one or more feature extraction modules, one or more disambiguation modules, one or more scoring modules, and one or more linking modules.
  • Disambiguating features will be supported in part by extracting topics from the ambient document of the feature, employing a multi-component extension of Latent Dirichlet Allocation (MC-LDA) topic models. Here, each component is modeled around each secondary feature stored in the existing knowledge base or extracted on the incoming document. Further, the linking or disambiguation process is modeled as topic inference from the MC-LDA, which provides automated weight estimation during the MC-LDA training and applies them readily during inference.
  • The exemplary method may improve the accuracy of entity disambiguation beyond what may be achieved by considering no document linking. Taking account of document linkage may allow better disambiguation by considering document and entity relationships implied by links.
  • In one embodiment, a method comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is a match, determining, by the disambiguation module of the in-memory database server computer, an existing unique identifier (“unique ID”) corresponding to each matching primary feature in the knowledgebase cluster and updating the knowledgebase cluster to include the new cluster; and when there is no match, creating, by the node, a new knowledgebase cluster and assigning a new unique ID to the primary feature of the new knowledgebase cluster; and transmitting, by the node, one of the existing unique ID and the new unique ID for the primary feature.
  • In another embodiment, a non-transitory computer readable medium having stored thereon computer executable instructions comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is a match, determining, by the node, an existing unique identifier (“unique ID”) corresponding to each matching primary feature in the knowledgebase cluster and updating the knowledgebase cluster to include the new cluster; and when there is no match, creating a new knowledgebase cluster and assigning a new unique ID to the primary feature of the new knowledgebase cluster; and transmitting, by the node, one of the existing unique ID and the new unique ID for the primary feature.
  • Additional features and advantages of an embodiment will be set forth in the description which follows, and in part will be apparent from the description. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the exemplary embodiments in the written description and claims hereof as well as the appended drawings.
  • It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present disclosure can be better understood by referring to the following figures. The accompanying drawings constitute a part of this specification and illustrate an embodiment of the invention and together with the specification, explain the invention The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosure. In the figures, reference numerals designate corresponding parts throughout the different views.
  • FIG. 1 is a flowchart of a method for disambiguating features in unstructured text, according to an exemplary embodiment.
  • FIG. 2 is a flowchart of the steps performed by a disambiguation module employed in the method for disambiguating features, according to an exemplary embodiment.
  • FIG. 3 is a flowchart of the steps performed by a link on-the-fly module employed in the method for disambiguating features, according to an exemplary embodiment.
  • FIG. 4 is an illustrative diagram of a system employed for implementing the method for disambiguating features, according to an exemplary embodiment.
  • FIG. 5 shows a graphical representation of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic model, according to an exemplary embodiment.
  • FIG. 6 illustrates an embodiment of the Gibbs sampling equations for multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
  • FIG. 7 illustrates an embodiment of the implementation of a stochastic variational inference algorithm for training and inference in multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
  • FIG. 8 is a table illustrating a sample topic for a multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
  • DEFINITIONS
  • As used herein, the following terms have the following definitions:
  • “Document” refers to a discrete electronic representation of information having a start and end.
  • “Multi-Document” refers to a document with its tokens, different types of named entities, and key phrases organized into separate “bag-of-surface-forms” components.
  • “Database” refers to any system including any combination of clusters and modules suitable for storing one or more collections and suitable to process one or more queries.
  • “Corpus” refers to a collection of one or more documents.
  • “Live corpus”, or “Document Stream”, refers to a corpus that is constantly fed as new documents are uploaded into a network.
  • “Feature” refers to any information which is at least partially derived from a document.
  • “Feature attribute” refers to metadata associated with a feature; for example, location of a feature in a document, confidence score, among others.
  • “Cluster” refers to a collection of features.
  • “Entity knowledge base” refers to a base containing features/entities.
  • “Link on-the-fly module” refers to any linking module that performs data linkage as data is requested from the system rather than as data is added to the system.
  • “Memory” refers to any hardware component suitable for storing information and retrieving said information at a sufficiently high speed.
  • “Module” refers to a computer software component suitable for carrying out one or more defined tasks.
  • “Sentiment” refers to subjective assessments associated with a document, part of a document, or feature.
  • “Topic” refers to a set of thematic information which is at least partially derived from a corpus.
  • “Topic Identifier”, or “topic ID”, refers to an identifier that refers to a specific instance of a topic.
  • “Topic Collection” refers to a specific set of topics derived from the corpus, with each topic having a unique identifier (“unique ID”).
  • “Topic Classification” refers to the assignation of specific topic identifiers as features of a document.
  • “Query” refers to a request to retrieve information from one or more suitable databases.
  • DETAILED DESCRIPTION
  • Reference will now be made in detail to the preferred embodiments, examples of which are illustrated in the accompanying drawings. The embodiments described above are intended to be exemplary. One skilled in the art recognizes that numerous alternative components and embodiments may be substituted for the particular examples described herein and still fall within the scope of the invention.
  • The present disclosure describes a method for disambiguating features in an unstructured text. Although the exemplary embodiments discuss practices for disambiguating features according to this disclosure, it is intended that the systems and methods described herein can be configured for any suitable use within the scope of this disclosure.
  • Existing knowledge bases include non-ambiguous features and their related features, which may lead to low confidence text analytics. An aspect of the present disclosure includes a method that may allow an increased accuracy in feature and entity disambiguation, therefore, increased accuracy in text analytics.
  • According to an embodiment, the disclosed method for disambiguating features may be employed in an initial corpus of data to perform a document ingestion and a feature extraction that may allow a topic classification and other text analytics on each document included in the initial corpus. Each feature may be identified and recorded as name, type, positional information in the document, and confidence score, among others.
  • FIG. 1 is a flowchart of a method 100 illustrating a plurality of steps for disambiguating features in unstructured text. According to an embodiment, method 100 for disambiguating features may initiate as a new document input, step 102, is made in an existing knowledge base. Subsequently, a feature extraction step 104 may be performed on the document. According to an embodiment, a feature may be related to different feature attributes, such as a topic identifier (“topic ID”), a document identifier (“document ID”), feature type, feature name, confidence score, and feature position, among others.
  • According to various embodiments, document input in step 102 may be fed from a massive corpus or live corpus (such as the internet or network connected corpus) that in turn, may be fed every second.
  • According to different embodiments, one or more feature recognition and extraction algorithms may be employed during feature extraction step 104 to analyze the unstructured text of document input step 102. A score may be assigned to each extracted feature. The score may indicate the level of certainty of the feature being correctly extracted with the correct attributes.
  • Additionally, during feature extraction step 104, one or more primary features may be identified from a document input in step 102. Each primary feature may have been associated with a set of feature attributes and one or more secondary features. Each secondary feature may be associated with a set of feature attributes. In some embodiments, one or more secondary features may have one or more tertiary features, each one having its own set of feature attributes.
  • Taking into account the feature attributes, the relative weight or relevance of each of the features within document input at step 102 may be determined. Additionally, the relevance of the association between features may be determined using a weighted scoring model.
  • Following feature extraction step 104, the features extracted from document input at step 102 and all their related information may be loaded into an in-memory data base (MemDB), during inclusion of features in MemDB, step 106, as part of a feature disambiguation request step 108.
  • In an embodiment, the MemDB forms part of a disambiguation computer server environment having one or more processors executing the steps discussed in connection with FIGS. 1-8. In one embodiment, the MemDB is a computer module that may include one or more search controllers, multiple search nodes, collections of compressed data, and a disambiguation sub module. One search controller may be selectively associated with one or more search nodes. Each search node may be capable of independently performing a fuzzy key search through a collection of compressed data and returning a set of scored results to its associated search controller.
  • Feature disambiguation step 108 may be performed by a disambiguation sub module within the MemDB. Feature disambiguation 108 process may include machine generated topic IDs, which may be employed to classify features, documents, or corpora. The relatedness of individual features and specific topic IDs may be determined using disambiguating algorithms. In some documents, the same feature may be related to one or more topic IDs, depending on the context of the different occurrences of the feature within the document.
  • The set of features (like topics, proximity terms and entities, key phrases, events and facts) extracted from one document may be compared with sets of features from other documents, using disambiguating algorithms to define with a certain level of accuracy if two or more features across different documents are a single feature or if they are distinct features. In some examples, co-occurrence of two or more features across the collection of documents in the database may be analyzed to improve the accuracy of feature disambiguation process 108. In some embodiments, global scoring algorithms may be used to determine the probability of features being the same.
  • In some embodiments, as part of the feature disambiguation process 108, a knowledge base may be generated within the MemDB. This knowledge base may be used to temporarily store clusters of relevant disambiguated primary features and their related secondary features. When new documents are loaded into the MemDB, the new disambiguated set of features may be compared with the existing knowledge base in order to determine the relationship between features and determine if there is a match between the new features and already extracted features.
  • If the features compared match, the knowledge base may be updated and a feature ID of the matching features may be returned to the user and/or requesting application or process and further based on the frequency of matches a prominence measure could be attached with the feature ID, which captures its popularity index in the given corpus. If the features compared do not match with any of the already extracted features, a unique feature ID is assigned to the disambiguated entity or feature, and the unique feature ID is associated with the cluster of defining features and stored in within the knowledge base of the MemDB. Subsequently, in step 110, the feature ID of disambiguated feature may be returned to the source through the system interface. In some embodiments, the feature ID of disambiguated feature may include secondary features, cluster of features, relevant feature attributes or other requested data. Disambiguation sub module employed for feature disambiguation step 108 is described in more detail in FIG. 2 below.
  • Disambiguation Sub Module
  • FIG. 2 is a flowchart of a process 200 performed by a disambiguation sub module employed in unstructured texts for feature disambiguation step 108 of the method 100 (FIG. 1), according to an embodiment. Disambiguation process 200 may begin after inclusion of features in MemDB in step 106 of FIG. 1. The extracted features provided in step 202 may be used to perform a candidate search in step 204, in which a search for the extracted features may be performed through all candidate records, including co-occurring features.
  • According to various embodiments, candidates may be primary features with a set of associated secondary features that may be used in feature disambiguation process 108.
  • The disambiguation results may be improved by the co-occurrence of topic IDs and relatedness among topic IDs. The relatedness of topic IDs, even across different topic models can be discovered from a large corpus where the topic IDs have been assigned. Related topic IDs can be used during records linkage step 206 to provide linkage to documents that may not contain the exact topic ID but do contain one or more related topic IDs. This approach may improve the recall of relevant features to be included in the records linkage step 206 and improve disambiguation results in certain cases.
  • Once sets of potentially related documents have been identified and sets of relevant primary and secondary features within these documents have been extracted, feature attributes, the relationships between features of the same document (meaningful context), the relative weight of the features and other variables may be used during records linkage process 206, to disambiguate primary and secondary features across documents. Then, each of the records may be linked to other records to determine clusters of disambiguated primary features and their related secondary features. The algorithms used for records linkage 206 may be capable of overcoming spelling errors or transliterations and other challenges of mining unstructured data sets.
  • Cluster comparison step 208 may include the assignment of relative matching scores to clusters of disambiguated features, different thresholds of acceptance may be defined for different applications. The defined levels of accuracy may determine which scores may be considered a positive match search and which scores may be considered a negative match search, step 210. Each new cluster may be given a unique ID and may be temporarily stored in a knowledge base. Each new cluster may include a new disambiguated primary feature and its set of secondary features. If a new cluster matches a cluster that is already stored in the knowledge base, the system updates knowledge base in step 212, and return of a matched feature ID to the user and/or requesting application or process may be performed in step 214. Update of knowledge base 212 may imply the association of additional of secondary features to one primary feature, or the addition of feature attributes that were not previously associated with primary or secondary features.
  • If the cluster being evaluated is assigned a score below the threshold of positive match search 210, the system performs a unique ID assignment, step 216, to the primary feature of the cluster and updates knowledge base 212. Afterwards, the system performs a return of matched ID process 214. Records linkage step 206 is further explained in detail in FIG. 3.
  • Link On-the-Fly Sub Module
  • FIG. 3 is a flowchart of a process 300 performed by a link on-the-fly (“link OTF”) sub module employed in method 100 for disambiguating features, according to an embodiment. Link OTF process 300 may be capable of constantly evaluating, scoring, linking, and clustering a feed of information. Link OTF sub module may perform records linkage 206 using multiple algorithms. Candidate search results of step 204 may be constantly fed into link OTF module 300. The input of data may be followed by a match scoring algorithm application, step 302, where one or more match scoring algorithms may be applied simultaneously in multiple search nodes of the MemDB while performing fuzzy key searches for evaluating and scoring the relevant results, taking in account multiple feature attributes, such as string edit distances, phonetics, and sentiments, among others.
  • Afterwards, a linking algorithm application step 304 may be added to compare all candidate records, identified during match scoring algorithm application step 302, to each other. Linking algorithm application 304 may include the use of one or more analytical linking algorithms capable of filtering and evaluating the scored results of the fuzzy key searches performed inside the multiple search nodes of the MemDB. In some examples, co-occurrence of two or more features across the collection of identified candidate records in the MemDB may be analyzed to improve the accuracy of the process. Different weighted models and confidence scores associated with different feature attributes may be taken into account for linking algorithm application 304.
  • After the linking algorithm application step 304, the linked results may be arranged in clusters of related features and returned, as part of return of linked records clusters in step 306.
  • FIG. 4 is an illustrative diagram of an embodiment of a system 400 for disambiguating features in unstructured text, as discussed above in connection with FIG. 1. The system 400 hosts an in-memory database and comprises one or more nodes.
  • According to an embodiment, the system 400 includes one or more processors executing computer instructions for a plurality of special- purpose computer modules 401, 402, 411, 412, and 414 (discussed below) to disambiguate features within one or more documents. As shown in FIG. 4, the document input modules 401, 402 receive documents from internet based sources and/or a live corpus of documents. A large number of new documents may be uploaded by the second into the document input module 402 through a network connection 404. Therefore, the source may be constantly getting new knowledge, updated by user workstations 406, where such new knowledge is not pre-linked in a static way. Thus, the number of documents to be evaluated may be infinitely increasing.
  • This evaluation may be achieved via the MemDB computer 408. MemDB 408 may facilitate a faster disambiguation process, may facilitate disambiguation process on-the-fly, which may facilitate reception of the latest information that is going to contribute to MemDB 408. Various methods for linking the features may be employed, which may essentially use a weighted model for determining which entity types are most important, which have more weight, and, based on confidence scores, determine how confident the extraction and disambiguation of the correct features has been performed, and that the correct feature may go into the resulting cluster of features. As shown in FIG. 4, as more system nodes are working in parallel, the process may become more efficient.
  • According to various embodiments, when a new document arrives into the system 400 via the document input module 401, 402 through a network connection 404, feature extraction is performed via the extraction module 411 and, then, feature disambiguation may be performed on the new document via the feature disambiguation sub-module 414 of the MemDB 408. In one embodiment, after feature disambiguation of the new document is performed, the extracted new features 410 may be included in the MemDB to pass through link OTF sub-module 412; where the features may be compared and linked, and a feature ID of disambiguated feature 110 may be returned to the user as a result from a query. In addition to the feature ID, the resulting feature cluster defining the disambiguated feature may optionally be returned.
  • MemDB computer 408 can be a database storing data in records controlled by a database management system (DBMS) (not shown) configured to store data records in a device's main memory, as opposed to conventional databases and DBMS modules that store data in “disk” memory. Conventional disk storage requires processors (CPUs) to execute read and write commands to a device's hard disk, thus requiring CPUs to execute instructions to locate (i.e., seek) and retrieve the memory location for the data, before performing some type of operation with the data at that memory location. In-memory database systems access data that is placed into main memory, and then addressed accordingly, thereby mitigating the number of instructions performed by the CPUs and eliminating the seek time associated with CPUs seeking data on hard disk.
  • In-memory databases may be implemented in a distributed computing architecture, which may be a computing system comprising one or more nodes configured to aggregate the nodes' respective resources (e.g., memory, disks, processors). As disclosed herein, embodiments of a computing system hosting an in-memory database may distribute and store data records of the database among one or more nodes. In some embodiments, these nodes are formed into “clusters” of nodes. In some embodiments, these clusters of nodes store portions, or “collections,” of database information.
  • Various embodiments provide a computer executed feature disambiguation technique that employs an evolving and efficiently linkable feature knowledge base that is configured to store secondary features, such as co-occurring topics, key phrases, proximity terms, events, facts and trending popularity index. The disclosed embodiments may be performed via a wide variety of linking algorithms that can vary from simple conceptual distance measure to sophisticated graph clustering approaches based on the dimensions of the involved secondary features that aid in resolving a given extracted feature to a stored feature in the knowledge base. Additionally, embodiments can introduce an approach to evolves the existing feature knowledge base by a capability that not only updates the secondary features of the existing feature entry, but also expands it by discovering new features that can be appended to the knowledge base.
  • Embodiments of the disambiguation approach can employ a topic modeling approach to provide an automated weighted (across all the secondary features) linking process (disambiguation) that is modeled as topic inference. To support the automated weighted linking process, embodiments extend the conventional LDA topic modeling to build a novel topic modeling approach referred to as a multi-component LDA (MC-LDA) that can support any number of components (secondary features) as conditionally-independent. Embodiments of the modeling approach also can automatically learn the weights of components during training and employ them for inference (linking) in connection with disambiguation. The introduced MC-LDA approach for disambiguation can scale for any additional number of secondary features that could be introduced to increase disambiguation accuracy.
  • FIG. 5 shows a graphical representation of an embodiment of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic computer modeling approach employed by system 400 of FIG. 4 above. In the illustrated embodiment, each component block represents modeling each secondary feature across the knowledge base, for instance as executed via the MemDB 408 of FIG. 4 that is initialized with the parameters set forth in FIG. 5.
  • FIG. 6 illustrates an embodiment of the Gibbs sampling equations for MC-LDA topic model employed in FIG. 5 above. An embodiment of this sampling approach aids the system 400 of FIG. 4 in training the individual component (secondary feature) weights in an automated fashion and in an efficient manner.
  • FIG. 7 illustrates an embodiment of the computer executed implementation of a stochastic variational inference algorithm for training and inference in MC-LDA topic model of FIGS. 5-6, for instance as executed via the MemDB 408 of the system 400 of FIG. 4 that is initialized with the parameters set forth in FIG. 7. An embodiment of this inference method applies readily to model the linking/disambiguation process as topic inference, by taking all the secondary features (extracted from the document of interest) as an input and providing weighted topics as the output. These weighted topics can then be used to compute a similarity score against the stored feature knowledge base entries.
  • FIG. 8 is a table illustrating a sample topic for a MC-LDA topic model. FIG. 8 displays the top scoring surface forms for each component of the model, for instance as executed via the MemDB 408 of the system 400 of FIG. 4, according to an embodiment.
  • Example #1 is an application of method 100 for disambiguating features in unstructured text, where the feature of interest (primary feature) is John Doe, a football player, and the user wants to monitor the news referencing John Doe. According to one embodiment, a document input 102 mentioning John Doe may be uploaded into the network. Features of document input 102 may be extracted and included into MemDB 408 for it to be disambiguated and linked to a cluster of secondary features associated to the primary feature (John Doe), and compared to existing cluster of similar features. Method 100 may output different feature IDs and the feature IDs' associated clusters that include all related secondary features to John Doe; for example, John Doe, engineer; John Doe, teacher; and John Doe, football player. Other primary features with similar secondary features may be considered, for example nicknames or short names. Then “JD” football player, from the same team as John Doe football player, with the same age and career may be considered the same primary feature. Therefore, all documents related to John Doe, football player, may be accessed easily.
  • Example #2 is an application of method 100 for disambiguating features in unstructured document, where the primary feature may be an image. According to one embodiment, method 100 may include a feature extraction 104, where a feature may be a general attribute, such as edges and shapes, among others; or a specific attribute, such as a tank, a person, and a clock, among others. For example, a new image may be input, where the image may have secondary features such as a specific shape (e.g., the shape of square, a person, or a car); the secondary features may be extracted and included in the MemDB 408 where a match may be found among all other images that has similar secondary features. According to the present embodiment, features may only include images, i.e. text may not be included as a feature.
  • Example #3 is an application of method 100 for disambiguating features in unstructured text, where the primary feature may be an event. According to one embodiment, when a query is made, method 100 may allow a user to receive results associated to an event, such as an earthquake, a fire, or an epidemic outbreak, among others. Method 100 may perform a feature extraction 104 and feature disambiguation 108 of the features to find the event's associated features and provide feature IDs of disambiguated features 110.
  • Example #4 is an embodiment of method 100, where prediction of one or more events that might occur may be made. According to one embodiment, a user may previously indicate features and events of interest prior to operation, and, therefore, links between different features associated to the events of interest may be previously established. As the associated features are appearing in the network in a high number of occurrences, method 100 may predict that the event of interest might occur, based on an increased number of occurrences of the associated features. When the imminent event is detected, an alert may be sent to the user. For example, a user working for the health department from Thailand may choose to receive an alert for an epidemic outbreak of dengue. As other users 406 from, for example, social networks upload comments including symptoms of dengue or inclusions into a hospital, method 100 may disambiguate all related comments from the social networks, and, taking in account the number of users 406 including related information, may predict and alert to the health department worker that an epidemic outbreak of dengue may be occurring. Therefore, the health department worker may have additional evidence and further actions may be taken into the affected community to keep the epidemic from spreading.
  • Example #5 is an application of method 100 for disambiguating features in unstructured text, where primary features may be geographic place names. According to an embodiment, method 100 may be employed to disambiguate the name of a city, where different scoring weights may be associated with secondary features in the disambiguation sub-module. For example, method 100 may be employed to disambiguate Paris, Tex. from Paris, France.
  • Example #6 is an application of method 100 for disambiguating features in unstructured text, where primary features may be sentiments associated with a person, event, or company, among others; where sentiments may be positive or negative comments about a person, event, or company, among others, that may be fed from any suitable source, including social networks. According to one embodiment, method 100 may be employed for a company to acknowledge the acceptance that it is having amongst the public.
  • Example #7 is an embodiment of method 100, where method 100 may include human validation in order to increase the confidence score of a feature. According to one embodiment, link OTF process 300 (FIG. 4) may be assisted by a user, where the user may indicate if a disambiguated feature has been correctly disambiguated and indicate if two different clusters should be one, meaning that what method 100 (taking in account all feature and topic co-occurrence information) is indicating as two different primary features the user has knowledge may be the same. Therefore, the confidence score associated to that cluster may be higher, thus, the probability of the feature to be correctly disambiguated may be higher.
  • Example #8 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300. In this example, the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.85 within a period of 1000 ms.
  • Example #9 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300. In this example, the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.80 within a period of not exceeding 300 ms. The algorithm used in this example provides an answer in a smaller period of time compared to the algorithm used in example #8 but generally returns a lower confidence score.
  • Example #10 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300. In this example, the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.90 within a period of generally exceeding 3000 ms. The algorithm used in this example provides an answer with a confidence score generally greater than that returned by the algorithm used in example #8, but generally requires a significantly longer period of time.
  • Example #11 is an example of method 100 for disambiguating features in unstructured text to perform e-discovery on a large corpus of documents from a plurality of sources. Given a large corpus of documents from a plurality of sources, applying method 100 to disambiguate all features in those documents, enables the discovery of all features in the corpus. The collection of discovered features can be further utilized to discover all documents related to a feature and the discovery of related features.
  • The foregoing method descriptions and the process flow diagrams are provided merely as illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by one of skill in the art the steps in the foregoing embodiments may be performed in any order. Words such as “then,” “next,” etc. are not intended to limit the order of the steps; these words are simply used to guide the reader through the description of the methods. Although process flow diagrams may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
  • The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed here may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
  • Embodiments implemented in computer software may be implemented in software, firmware, middleware, microcode, hardware description languages, or any combination thereof. A code segment or machine-executable instructions may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.
  • The actual software code or specialized control hardware used to implement these systems and methods is not limiting of the invention. Thus, the operation and behavior of the systems and methods were described without reference to the specific software code being understood that software and control hardware can be designed to implement the systems and methods based on the description here.
  • When implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium. The steps of a method or algorithm disclosed here may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium. A non-transitory computer-readable or processor-readable media includes both computer storage media and tangible storage media that facilitate transfer of a computer program from one place to another. A non-transitory processor-readable storage media may be any available media that may be accessed by a computer. By way of example, and not limitation, such non-transitory processor-readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible storage medium that may be used to store desired program code in the form of instructions or data structures and that may be accessed by a computer or processor. Disk and disc, as used here, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media. Additionally, the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.
  • It is to be appreciated that the various components of the technology can be located at distant portions of a distributed network and/or the Internet, or within a dedicated secure, unsecured and/or encrypted system. Thus, it should be appreciated that the components of the system can be combined into one or more devices or co-located on a particular node of a distributed network, such as a telecommunications network. As will be appreciated from the description, and for reasons of computational efficiency, the components of the system can be arranged at any location within a distributed network without affecting the operation of the system. Moreover, the components could be embedded in a dedicated machine.
  • Furthermore, it should be appreciated that the various links connecting the elements can be wired or wireless links, or any combination thereof, or any other known or later developed element(s) that is capable of supplying and/or communicating data to and from the connected elements. The term module as used herein can refer to any known or later developed hardware, software, firmware, or combination thereof that is capable of performing the functionality associated with that element. The terms determine, calculate and compute, and variations thereof, as used herein are used interchangeably and include any type of methodology, process, mathematical operation or technique.
  • The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined here may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown here but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed here.
  • The embodiments described above are intended to be exemplary. One skilled in the art recognizes that numerous alternative components and embodiments that may be substituted for the particular examples described herein and still fall within the scope of the invention.

Claims (20)

What is claimed is:
1. A method comprising:
in response to receiving, by a server, a search query from a client:
searching, by the server, a set of records comprising a co-occurring feature, wherein the server comprises a main memory hosting a database, wherein the database stores a first cluster, wherein the first cluster comprises a disambiguated primary feature with a unique identifier and a set of secondary features, wherein the first cluster comprises a first score;
identifying, by the server, a record in the set of records, wherein the record matches an extracted feature such that the extracted feature is a primary feature;
associating, by the server, the extracted feature with a topic identifier;
disambiguating, by the server, the primary feature based on a relatedness of the topic identifier;
identifying, by the server, the set of secondary features based on the relatedness;
disambiguating, by the server, the primary feature from the set of secondary features based on the relatedness;
accessing, by the server, the database;
linking, by the server, in real-time, during the accessing, the primary feature to the set of secondary features;
forming, by the server, a second cluster based on the linking, wherein the second cluster comprises a second score;
comparing, by the server, the first score against the second score;
determining, by the server, whether the first score matches the second score;
identifying, by the server, the unique identifier related to the primary feature in the first cluster based on the first score matching the second score;
amending, by the server, based on the identifying the unique identifier, the first cluster such that the first cluster includes the second cluster; and
sending, by the server, the unique identifier to the client.
2. The method of claim 1, further comprising:
comparing, by the server, each member of the set of records which matches the extracted feature against a data item;
assigning, by the server, a third score to the extracted feature based on the comparing of the each of the member.
3. The method of claim 2, further comprising:
associating, by the server, the extracted feature with a feature attribute.
4. The method of claim 3, wherein the feature attribute is weighted.
5. The method of claim 2, further comprising:
determining, by the server, a relatedness of the extracted feature based on the feature attribute.
6. The method of claim 1, wherein the primary feature is associated with a feature attribute.
7. The method of claim 1, wherein the extracted feature is associated with a lower-ordinal feature in accordance with a cluster hierarchy.
8. The method of claim 1, wherein the searching is in a fuzzy manner.
9. The method of claim 1, further comprising:
comparing, by the server, a first feature against a second feature, wherein the first feature comprises the extracted feature, wherein the first feature is provided via a first data source, wherein the second feature is provided via a second data source;
determining, by the server, if the first feature co-occurs in the second data source based on the comparing of the first feature against the second feature;
linking, by the server, at least one of the first data source or the second data source.
10. The method of claim 1, further comprising:
determining, by the server, a co-occurrence of the extracted feature in a plurality of data sources;
improving, by the server, a rate of accuracy of the disambiguating based on the determining of the co-occurrence of the extracted feature.
11. A method comprising:
in response to receiving, by a server, a search query from a client:
searching, by the server, based on the receiving, a set of records comprising a co-occurring feature, wherein the server comprises a main memory hosting a database, wherein the database stores a first cluster, wherein the first cluster comprises a disambiguated primary feature with a first unique identifier and a set of secondary features, wherein the first cluster comprises a first score;
identifying, by the server, a record in the set of records, wherein the record matches an extracted feature such that the extracted feature is a first primary feature;
associating, by the server, the extracted feature with a topic identifier;
disambiguating, by the server, the first primary feature based on a relatedness of the topic identifier;
identifying, by the server, the set of secondary features based on the relatedness;
disambiguating, by the server, the first primary feature from the set of secondary features based on the relatedness;
accessing, by the server, the database;
linking, by the server, in real-time, during the accessing, the first primary feature to the set of secondary features;
forming, by the server, a second cluster based on the linking, wherein the second cluster comprises a second score;
comparing, by the server, the first score against the second score;
determining, by the server, whether the first score matches the second score;
generating, by the server, a third cluster based on the first score not matching the second score, wherein the third cluster comprises a second primary feature;
assigning, by the server, a second unique identifier to the second primary feature;
sending, by the server, the second unique identifier to the client.
12. The method of claim 1, further comprising:
comparing, by the server, each member of the set of records which matches the extracted feature against a data item;
assigning, by the server, a third score to the extracted feature based on the comparing of the each of the member.
13. The method of claim 2, further comprising:
associating, by the server, the extracted feature with a feature attribute.
14. The method of claim 3, wherein the feature attribute is weighted.
15. The method of claim 2, further comprising:
determining, by the server, a relatedness of the extracted feature based on the feature attribute.
16. The method of claim 1, wherein at least one of the first primary feature or the second primary feature is associated with a feature attribute.
17. The method of claim 1, wherein the extracted feature is associated with a lower-ordinal feature in accordance with a cluster hierarchy.
18. The method of claim 1, wherein the searching is in a fuzzy manner.
19. The method of claim 1, further comprising:
comparing, by the server, a first feature against a second feature, wherein the first feature comprises the extracted feature, wherein the first feature is provided via a first data source, wherein the second feature is provided via a second data source;
determining, by the server, if the first feature co-occurs in the second data source based on the comparing of the first feature against the second feature;
linking, by the server, at least one of the first data source or the second data source.
20. The method of claim 1, further comprising:
determining, by the server, a co-occurrence of the extracted feature in a plurality of data sources;
improving, by the server, a rate of accuracy of the disambiguating based on the determining of the co-occurrence of the extracted feature.
US14/979,703 2013-12-02 2015-12-28 Method for disambiguated features in unstructured text Abandoned US20160110446A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/979,703 US20160110446A1 (en) 2013-12-02 2015-12-28 Method for disambiguated features in unstructured text

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201361910739P 2013-12-02 2013-12-02
US14/557,794 US9239875B2 (en) 2013-12-02 2014-12-02 Method for disambiguated features in unstructured text
US14/979,703 US20160110446A1 (en) 2013-12-02 2015-12-28 Method for disambiguated features in unstructured text

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/557,794 Continuation US9239875B2 (en) 2013-12-02 2014-12-02 Method for disambiguated features in unstructured text

Publications (1)

Publication Number Publication Date
US20160110446A1 true US20160110446A1 (en) 2016-04-21

Family

ID=53265533

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/557,794 Active US9239875B2 (en) 2013-12-02 2014-12-02 Method for disambiguated features in unstructured text
US14/979,703 Abandoned US20160110446A1 (en) 2013-12-02 2015-12-28 Method for disambiguated features in unstructured text

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/557,794 Active US9239875B2 (en) 2013-12-02 2014-12-02 Method for disambiguated features in unstructured text

Country Status (7)

Country Link
US (2) US9239875B2 (en)
EP (1) EP3077919A4 (en)
JP (1) JP6284643B2 (en)
KR (1) KR20160124742A (en)
CN (1) CN106164890A (en)
CA (1) CA2932399A1 (en)
WO (1) WO2015084724A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991171A (en) * 2017-03-25 2017-07-28 贺州学院 Topic based on Intelligent campus information service platform finds method
WO2018005203A1 (en) * 2016-06-28 2018-01-04 Microsoft Technology Licensing, Llc Leveraging information available in a corpus for data parsing and predicting
US10200397B2 (en) 2016-06-28 2019-02-05 Microsoft Technology Licensing, Llc Robust matching for identity screening

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9201744B2 (en) 2013-12-02 2015-12-01 Qbase, LLC Fault tolerant architecture for distributed computing systems
US9424294B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Method for facet searching and search suggestions
US9025892B1 (en) 2013-12-02 2015-05-05 Qbase, LLC Data record compression with progressive and/or selective decomposition
US9659108B2 (en) 2013-12-02 2017-05-23 Qbase, LLC Pluggable architecture for embedding analytics in clustered in-memory databases
US9547701B2 (en) 2013-12-02 2017-01-17 Qbase, LLC Method of discovering and exploring feature knowledge
US9348573B2 (en) * 2013-12-02 2016-05-24 Qbase, LLC Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes
US9424524B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Extracting facts from unstructured text
EP3077927A4 (en) 2013-12-02 2017-07-12 Qbase LLC Design and implementation of clustered in-memory database
US9355152B2 (en) 2013-12-02 2016-05-31 Qbase, LLC Non-exclusionary search within in-memory databases
US10572935B1 (en) * 2014-07-16 2020-02-25 Intuit, Inc. Disambiguation of entities based on financial interactions
US10176457B2 (en) * 2015-02-05 2019-01-08 Sap Se System and method automatically learning and optimizing sequence order
US11157920B2 (en) * 2015-11-10 2021-10-26 International Business Machines Corporation Techniques for instance-specific feature-based cross-document sentiment aggregation
US10810408B2 (en) 2018-01-26 2020-10-20 Viavi Solutions Inc. Reduced false positive identification for spectroscopic classification
US11656174B2 (en) 2018-01-26 2023-05-23 Viavi Solutions Inc. Outlier detection for spectroscopic classification
US11009452B2 (en) 2018-01-26 2021-05-18 Viavi Solutions Inc. Reduced false positive identification for spectroscopic quantification
CN109344256A (en) * 2018-10-12 2019-02-15 中国科学院重庆绿色智能技术研究院 A kind of Press release subject classification and checking method
KR102037453B1 (en) 2018-11-29 2019-10-29 부산대학교 산학협력단 Apparatus and Method for Numeral Classifier Disambiguation using Word Embedding based on Subword Information
CN110110046B (en) * 2019-04-30 2021-10-01 北京搜狗科技发展有限公司 Method and device for recommending entities with same name
US11636355B2 (en) * 2019-05-30 2023-04-25 Baidu Usa Llc Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process
CN110942765B (en) * 2019-11-11 2022-05-27 珠海格力电器股份有限公司 Method, device, server and storage medium for constructing corpus

Family Cites Families (98)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2343097A (en) 1996-03-21 1997-10-10 Mpath Interactive, Inc. Network match maker for selecting clients based on attributes of servers and communication links
US6178529B1 (en) 1997-11-03 2001-01-23 Microsoft Corporation Method and system for resource monitoring of disparate resources in a server cluster
US6353926B1 (en) 1998-07-15 2002-03-05 Microsoft Corporation Software update notification
US6266781B1 (en) 1998-07-20 2001-07-24 Academia Sinica Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network
US6338092B1 (en) 1998-09-24 2002-01-08 International Business Machines Corporation Method, system and computer program for replicating data in a distributed computed environment
US6959300B1 (en) 1998-12-10 2005-10-25 At&T Corp. Data compression method and apparatus
US7099898B1 (en) 1999-08-12 2006-08-29 International Business Machines Corporation Data access system
US6738759B1 (en) 2000-07-07 2004-05-18 Infoglide Corporation, Inc. System and method for performing similarity searching using pointer optimization
US8692695B2 (en) 2000-10-03 2014-04-08 Realtime Data, Llc Methods for encoding and decoding data
US6832373B2 (en) 2000-11-17 2004-12-14 Bitfone Corporation System and method for updating and distributing information
US6691109B2 (en) 2001-03-22 2004-02-10 Turbo Worx, Inc. Method and apparatus for high-performance sequence comparison
GB2374687A (en) 2001-04-19 2002-10-23 Ibm Managing configuration changes in a data processing system
US7082478B2 (en) * 2001-05-02 2006-07-25 Microsoft Corporation Logical semantic compression
US6961723B2 (en) 2001-05-04 2005-11-01 Sun Microsystems, Inc. System and method for determining relevancy of query responses in a distributed network search mechanism
US20030028869A1 (en) 2001-08-02 2003-02-06 Drake Daniel R. Method and computer program product for integrating non-redistributable software applications in a customer driven installable package
JP2003150442A (en) * 2001-11-19 2003-05-23 Fujitsu Ltd Memory development program and data processor
US6954456B2 (en) 2001-12-14 2005-10-11 At & T Corp. Method for content-aware redirection and content renaming
US6829606B2 (en) 2002-02-14 2004-12-07 Infoglide Software Corporation Similarity search engine for use with relational databases
US7421478B1 (en) 2002-03-07 2008-09-02 Cisco Technology, Inc. Method and apparatus for exchanging heartbeat messages and configuration information between nodes operating in a master-slave configuration
US8015143B2 (en) 2002-05-22 2011-09-06 Estes Timothy W Knowledge discovery agent system and method
US7570262B2 (en) 2002-08-08 2009-08-04 Reuters Limited Method and system for displaying time-series data and correlated events derived from text mining
US7249312B2 (en) * 2002-09-11 2007-07-24 Intelligent Results Attribute scoring for unstructured content
US7058846B1 (en) 2002-10-17 2006-06-06 Veritas Operating Corporation Cluster failover for storage management services
US20040205064A1 (en) 2003-04-11 2004-10-14 Nianjun Zhou Adaptive search employing entropy based quantitative information measurement
US7543174B1 (en) 2003-09-24 2009-06-02 Symantec Operating Corporation Providing high availability for an application by rapidly provisioning a node and failing over to the node
US9009153B2 (en) 2004-03-31 2015-04-14 Google Inc. Systems and methods for identifying a named entity
US7818615B2 (en) 2004-09-16 2010-10-19 Invensys Systems, Inc. Runtime failure management of redundantly deployed hosts of a supervisory process control data acquisition facility
US7403945B2 (en) 2004-11-01 2008-07-22 Sybase, Inc. Distributed database system providing data and space management methodology
US20060179026A1 (en) 2005-02-04 2006-08-10 Bechtel Michael E Knowledge discovery tool extraction and integration
US20060294071A1 (en) 2005-06-28 2006-12-28 Microsoft Corporation Facet extraction and user feedback for ranking improvement and personalization
US7630977B2 (en) 2005-06-29 2009-12-08 Xerox Corporation Categorization including dependencies between different category systems
US8386463B2 (en) 2005-07-14 2013-02-26 International Business Machines Corporation Method and apparatus for dynamically associating different query execution strategies with selective portions of a database table
US7681075B2 (en) 2006-05-02 2010-03-16 Open Invention Network Llc Method and system for providing high availability to distributed computer applications
US7447940B2 (en) 2005-11-15 2008-11-04 Bea Systems, Inc. System and method for providing singleton services in a cluster
US8341622B1 (en) 2005-12-15 2012-12-25 Crimson Corporation Systems and methods for efficiently using network bandwidth to deploy dependencies of a software package
US7899871B1 (en) 2006-01-23 2011-03-01 Clearwell Systems, Inc. Methods and systems for e-mail topic classification
US7519613B2 (en) 2006-02-28 2009-04-14 International Business Machines Corporation Method and system for generating threads of documents
US8726267B2 (en) 2006-03-24 2014-05-13 Red Hat, Inc. Sharing software certification and process metadata
US8190742B2 (en) 2006-04-25 2012-05-29 Hewlett-Packard Development Company, L.P. Distributed differential store with non-distributed objects and compression-enhancing data-object routing
US20070282959A1 (en) 2006-06-02 2007-12-06 Stern Donald S Message push with pull of information to a communications computing device
US8615800B2 (en) 2006-07-10 2013-12-24 Websense, Inc. System and method for analyzing web content
US7624118B2 (en) 2006-07-26 2009-11-24 Microsoft Corporation Data processing over very large databases
US8122026B1 (en) 2006-10-20 2012-02-21 Google Inc. Finding and disambiguating references to entities on web pages
US7853611B2 (en) * 2007-02-26 2010-12-14 International Business Machines Corporation System and method for deriving a hierarchical event based database having action triggers based on inferred probabilities
WO2009005744A1 (en) 2007-06-29 2009-01-08 Allvoices, Inc. Processing a content item with regard to an event and a location
US20090043792A1 (en) 2007-08-07 2009-02-12 Eric Lawrence Barsness Partial Compression of a Database Table Based on Historical Information
US9342551B2 (en) 2007-08-14 2016-05-17 John Nicholas and Kristin Gross Trust User based document verifier and method
GB2453174B (en) 2007-09-28 2011-12-07 Advanced Risc Mach Ltd Techniques for generating a trace stream for a data processing apparatus
KR100898339B1 (en) 2007-10-05 2009-05-20 한국전자통신연구원 Autonomous fault processing system in home network environments and operation method thereof
US8396838B2 (en) 2007-10-17 2013-03-12 Commvault Systems, Inc. Legal compliance, electronic discovery and electronic document handling of online and offline copies of data
US8375073B1 (en) 2007-11-12 2013-02-12 Google Inc. Identification and ranking of news stories of interest
US8294763B2 (en) 2007-12-14 2012-10-23 Sri International Method for building and extracting entity networks from video
US8326847B2 (en) * 2008-03-22 2012-12-04 International Business Machines Corporation Graph search system and method for querying loosely integrated data
US20100077001A1 (en) 2008-03-27 2010-03-25 Claude Vogel Search system and method for serendipitous discoveries with faceted full-text classification
US8712926B2 (en) 2008-05-23 2014-04-29 International Business Machines Corporation Using rule induction to identify emerging trends in unstructured text streams
US8358308B2 (en) 2008-06-27 2013-01-22 Microsoft Corporation Using visual techniques to manipulate data
CA2686796C (en) 2008-12-03 2017-05-16 Trend Micro Incorporated Method and system for real time classification of events in computer integrity system
US8874576B2 (en) 2009-02-27 2014-10-28 Microsoft Corporation Reporting including filling data gaps and handling uncategorized data
GB0904113D0 (en) * 2009-03-10 2009-04-22 Intrasonics Ltd Video and audio bookmarking
US20100235311A1 (en) * 2009-03-13 2010-09-16 Microsoft Corporation Question and answer search
US8213725B2 (en) 2009-03-20 2012-07-03 Eastman Kodak Company Semantic event detection using cross-domain knowledge
US8161048B2 (en) * 2009-04-24 2012-04-17 At&T Intellectual Property I, L.P. Database analysis using clusters
US8055933B2 (en) 2009-07-21 2011-11-08 International Business Machines Corporation Dynamic updating of failover policies for increased application availability
EP2488960A4 (en) * 2009-10-15 2016-08-03 Hewlett Packard Entpr Dev Lp Heterogeneous data source management
US8645372B2 (en) 2009-10-30 2014-02-04 Evri, Inc. Keyword-based search engine results using enhanced query strategies
US20110125764A1 (en) 2009-11-26 2011-05-26 International Business Machines Corporation Method and system for improved query expansion in faceted search
US8583647B2 (en) 2010-01-29 2013-11-12 Panasonic Corporation Data processing device for automatically classifying a plurality of images into predetermined categories
US9710556B2 (en) * 2010-03-01 2017-07-18 Vcvc Iii Llc Content recommendation based on collections of entities
US8595234B2 (en) 2010-05-17 2013-11-26 Wal-Mart Stores, Inc. Processing data feeds
US8429256B2 (en) 2010-05-28 2013-04-23 Red Hat, Inc. Systems and methods for generating cached representations of host package inventories in remote package repositories
US8345998B2 (en) 2010-08-10 2013-01-01 Xerox Corporation Compression scheme selection based on image data type and user selections
US8321443B2 (en) 2010-09-07 2012-11-27 International Business Machines Corporation Proxying open database connectivity (ODBC) calls
US20120102121A1 (en) * 2010-10-25 2012-04-26 Yahoo! Inc. System and method for providing topic cluster based updates
US8423522B2 (en) 2011-01-04 2013-04-16 International Business Machines Corporation Query-aware compression of join results
US20120246154A1 (en) 2011-03-23 2012-09-27 International Business Machines Corporation Aggregating search results based on associating data instances with knowledge base entities
KR20120134916A (en) 2011-06-03 2012-12-12 삼성전자주식회사 Storage device and data processing device for storage device
US20120310934A1 (en) 2011-06-03 2012-12-06 Thomas Peh Historic View on Column Tables Using a History Table
US9104979B2 (en) 2011-06-16 2015-08-11 Microsoft Technology Licensing, Llc Entity recognition using probabilities for out-of-collection data
WO2013003770A2 (en) 2011-06-30 2013-01-03 Openwave Mobility Inc. Database compression system and method
US9032387B1 (en) 2011-10-04 2015-05-12 Amazon Technologies, Inc. Software distribution framework
US9026480B2 (en) 2011-12-21 2015-05-05 Telenav, Inc. Navigation system with point of interest classification mechanism and method of operation thereof
US9037579B2 (en) 2011-12-27 2015-05-19 Business Objects Software Ltd. Generating dynamic hierarchical facets from business intelligence artifacts
US9251250B2 (en) * 2012-03-28 2016-02-02 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for processing text with variations in vocabulary usage
US10908792B2 (en) 2012-04-04 2021-02-02 Recorded Future, Inc. Interactive event-based information system
US9483513B2 (en) * 2012-04-30 2016-11-01 Sap Se Storing large objects on disk and not in main memory of an in-memory database system
US10162766B2 (en) * 2012-04-30 2018-12-25 Sap Se Deleting records in a multi-level storage architecture without record locks
US20130290232A1 (en) * 2012-04-30 2013-10-31 Mikalai Tsytsarau Identifying news events that cause a shift in sentiment
US8948789B2 (en) 2012-05-08 2015-02-03 Qualcomm Incorporated Inferring a context from crowd-sourced activity data
US9703833B2 (en) 2012-11-30 2017-07-11 Sap Se Unification of search and analytics
US9542652B2 (en) 2013-02-28 2017-01-10 Microsoft Technology Licensing, Llc Posterior probability pursuit for entity disambiguation
US9104710B2 (en) * 2013-03-15 2015-08-11 Src, Inc. Method for cross-domain feature correlation
US8977600B2 (en) 2013-05-24 2015-03-10 Software AG USA Inc. System and method for continuous analytics run against a combination of static and real-time data
CN103365974A (en) * 2013-06-28 2013-10-23 百度在线网络技术(北京)有限公司 Semantic disambiguation method and system based on related words topic
US9734221B2 (en) 2013-09-12 2017-08-15 Sap Se In memory database warehouse
US9201744B2 (en) 2013-12-02 2015-12-01 Qbase, LLC Fault tolerant architecture for distributed computing systems
US9025892B1 (en) 2013-12-02 2015-05-05 Qbase, LLC Data record compression with progressive and/or selective decomposition
US9223875B2 (en) 2013-12-02 2015-12-29 Qbase, LLC Real-time distributed in memory search architecture
US9424294B2 (en) 2013-12-02 2016-08-23 Qbase, LLC Method for facet searching and search suggestions

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018005203A1 (en) * 2016-06-28 2018-01-04 Microsoft Technology Licensing, Llc Leveraging information available in a corpus for data parsing and predicting
US10200397B2 (en) 2016-06-28 2019-02-05 Microsoft Technology Licensing, Llc Robust matching for identity screening
US10311092B2 (en) 2016-06-28 2019-06-04 Microsoft Technology Licensing, Llc Leveraging corporal data for data parsing and predicting
CN106991171A (en) * 2017-03-25 2017-07-28 贺州学院 Topic based on Intelligent campus information service platform finds method

Also Published As

Publication number Publication date
US20150154286A1 (en) 2015-06-04
JP6284643B2 (en) 2018-02-28
CA2932399A1 (en) 2015-06-11
JP2016541069A (en) 2016-12-28
KR20160124742A (en) 2016-10-28
EP3077919A1 (en) 2016-10-12
WO2015084724A1 (en) 2015-06-11
CN106164890A (en) 2016-11-23
EP3077919A4 (en) 2017-05-10
US9239875B2 (en) 2016-01-19

Similar Documents

Publication Publication Date Title
US9239875B2 (en) Method for disambiguated features in unstructured text
US9201931B2 (en) Method for obtaining search suggestions from fuzzy score matching and population frequencies
US10725836B2 (en) Intent-based organisation of APIs
US9613166B2 (en) Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching
US9424524B2 (en) Extracting facts from unstructured text
US9626623B2 (en) Method of automated discovery of new topics
US9619571B2 (en) Method for searching related entities through entity co-occurrence
US10810215B2 (en) Supporting evidence retrieval for complex answers
WO2015084757A1 (en) Systems and methods for processing data stored in a database
US9507834B2 (en) Search suggestions using fuzzy-score matching and entity co-occurrence
US20170124090A1 (en) Method of discovering and exploring feature knowledge
JP6145562B2 (en) Information structuring system and information structuring method
US20160085760A1 (en) Method for in-loop human validation of disambiguated features
CN113656574A (en) Method, computing device and storage medium for search result ranking
Li Connecting Text with Knowledge

Legal Events

Date Code Title Description
AS Assignment

Owner name: QBASE, LLC, VIRGINIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIGHTNER, SCOTT;WECKESSER, FRANZ;BODDHU, SANJAY;AND OTHERS;SIGNING DATES FROM 20141201 TO 20141202;REEL/FRAME:037363/0166

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION