US20160110446A1 - Method for disambiguated features in unstructured text - Google Patents
Method for disambiguated features in unstructured text Download PDFInfo
- Publication number
- US20160110446A1 US20160110446A1 US14/979,703 US201514979703A US2016110446A1 US 20160110446 A1 US20160110446 A1 US 20160110446A1 US 201514979703 A US201514979703 A US 201514979703A US 2016110446 A1 US2016110446 A1 US 2016110446A1
- Authority
- US
- United States
- Prior art keywords
- feature
- server
- features
- cluster
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G06F17/30684—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G06F17/30705—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Definitions
- the present disclosure generally relates to data management; and, more specifically, to data management systems and methods that extract and store material from source items received over a network.
- Searching for information about entities i.e., people, locations, organizations
- sources such as a network
- the method may include multiple modules, such as one or more feature extraction modules, one or more disambiguation modules, one or more scoring modules, and one or more linking modules.
- Disambiguating features will be supported in part by extracting topics from the ambient document of the feature, employing a multi-component extension of Latent Dirichlet Allocation (MC-LDA) topic models.
- MC-LDA Latent Dirichlet Allocation
- each component is modeled around each secondary feature stored in the existing knowledge base or extracted on the incoming document.
- the linking or disambiguation process is modeled as topic inference from the MC-LDA, which provides automated weight estimation during the MC-LDA training and applies them readily during inference.
- the exemplary method may improve the accuracy of entity disambiguation beyond what may be achieved by considering no document linking. Taking account of document linkage may allow better disambiguation by considering document and entity relationships implied by links.
- a method comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is a match, determining, by the disambiguation module of the in-
- a non-transitory computer readable medium having stored thereon computer executable instructions comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is
- FIG. 1 is a flowchart of a method for disambiguating features in unstructured text, according to an exemplary embodiment.
- FIG. 2 is a flowchart of the steps performed by a disambiguation module employed in the method for disambiguating features, according to an exemplary embodiment.
- FIG. 3 is a flowchart of the steps performed by a link on-the-fly module employed in the method for disambiguating features, according to an exemplary embodiment.
- FIG. 4 is an illustrative diagram of a system employed for implementing the method for disambiguating features, according to an exemplary embodiment.
- FIG. 5 shows a graphical representation of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic model, according to an exemplary embodiment.
- MC-LDA Latent Dirichlet Allocation
- FIG. 6 illustrates an embodiment of the Gibbs sampling equations for multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
- FIG. 7 illustrates an embodiment of the implementation of a stochastic variational inference algorithm for training and inference in multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
- FIG. 8 is a table illustrating a sample topic for a multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment.
- Document refers to a discrete electronic representation of information having a start and end.
- Multi-Document refers to a document with its tokens, different types of named entities, and key phrases organized into separate “bag-of-surface-forms” components.
- Database refers to any system including any combination of clusters and modules suitable for storing one or more collections and suitable to process one or more queries.
- Corpus refers to a collection of one or more documents.
- Live corpus or “Document Stream”, refers to a corpus that is constantly fed as new documents are uploaded into a network.
- Feature refers to any information which is at least partially derived from a document.
- Feature attribute refers to metadata associated with a feature; for example, location of a feature in a document, confidence score, among others.
- Cluster refers to a collection of features.
- Entity knowledge base refers to a base containing features/entities.
- Link on-the-fly module refers to any linking module that performs data linkage as data is requested from the system rather than as data is added to the system.
- Memory refers to any hardware component suitable for storing information and retrieving said information at a sufficiently high speed.
- Module refers to a computer software component suitable for carrying out one or more defined tasks.
- “Sentiment” refers to subjective assessments associated with a document, part of a document, or feature.
- Topic refers to a set of thematic information which is at least partially derived from a corpus.
- Topic ID refers to an identifier that refers to a specific instance of a topic.
- Topic Collection refers to a specific set of topics derived from the corpus, with each topic having a unique identifier (“unique ID”).
- Topic Classification refers to the assignation of specific topic identifiers as features of a document.
- Query refers to a request to retrieve information from one or more suitable databases.
- the present disclosure describes a method for disambiguating features in an unstructured text.
- the exemplary embodiments discuss practices for disambiguating features according to this disclosure, it is intended that the systems and methods described herein can be configured for any suitable use within the scope of this disclosure.
- An aspect of the present disclosure includes a method that may allow an increased accuracy in feature and entity disambiguation, therefore, increased accuracy in text analytics.
- the disclosed method for disambiguating features may be employed in an initial corpus of data to perform a document ingestion and a feature extraction that may allow a topic classification and other text analytics on each document included in the initial corpus.
- Each feature may be identified and recorded as name, type, positional information in the document, and confidence score, among others.
- FIG. 1 is a flowchart of a method 100 illustrating a plurality of steps for disambiguating features in unstructured text.
- method 100 for disambiguating features may initiate as a new document input, step 102 , is made in an existing knowledge base. Subsequently, a feature extraction step 104 may be performed on the document.
- a feature may be related to different feature attributes, such as a topic identifier (“topic ID”), a document identifier (“document ID”), feature type, feature name, confidence score, and feature position, among others.
- topic ID topic identifier
- document ID document identifier
- feature type feature name
- confidence score confidence score
- document input in step 102 may be fed from a massive corpus or live corpus (such as the internet or network connected corpus) that in turn, may be fed every second.
- a massive corpus or live corpus such as the internet or network connected corpus
- one or more feature recognition and extraction algorithms may be employed during feature extraction step 104 to analyze the unstructured text of document input step 102 .
- a score may be assigned to each extracted feature. The score may indicate the level of certainty of the feature being correctly extracted with the correct attributes.
- one or more primary features may be identified from a document input in step 102 .
- Each primary feature may have been associated with a set of feature attributes and one or more secondary features.
- Each secondary feature may be associated with a set of feature attributes.
- one or more secondary features may have one or more tertiary features, each one having its own set of feature attributes.
- the relative weight or relevance of each of the features within document input at step 102 may be determined. Additionally, the relevance of the association between features may be determined using a weighted scoring model.
- the features extracted from document input at step 102 and all their related information may be loaded into an in-memory data base (MemDB), during inclusion of features in MemDB, step 106 , as part of a feature disambiguation request step 108 .
- MemDB in-memory data base
- the MemDB forms part of a disambiguation computer server environment having one or more processors executing the steps discussed in connection with FIGS. 1-8 .
- the MemDB is a computer module that may include one or more search controllers, multiple search nodes, collections of compressed data, and a disambiguation sub module.
- One search controller may be selectively associated with one or more search nodes.
- Each search node may be capable of independently performing a fuzzy key search through a collection of compressed data and returning a set of scored results to its associated search controller.
- Feature disambiguation step 108 may be performed by a disambiguation sub module within the MemDB.
- Feature disambiguation 108 process may include machine generated topic IDs, which may be employed to classify features, documents, or corpora. The relatedness of individual features and specific topic IDs may be determined using disambiguating algorithms. In some documents, the same feature may be related to one or more topic IDs, depending on the context of the different occurrences of the feature within the document.
- the set of features (like topics, proximity terms and entities, key phrases, events and facts) extracted from one document may be compared with sets of features from other documents, using disambiguating algorithms to define with a certain level of accuracy if two or more features across different documents are a single feature or if they are distinct features.
- co-occurrence of two or more features across the collection of documents in the database may be analyzed to improve the accuracy of feature disambiguation process 108 .
- global scoring algorithms may be used to determine the probability of features being the same.
- a knowledge base may be generated within the MemDB. This knowledge base may be used to temporarily store clusters of relevant disambiguated primary features and their related secondary features.
- the new disambiguated set of features may be compared with the existing knowledge base in order to determine the relationship between features and determine if there is a match between the new features and already extracted features.
- the knowledge base may be updated and a feature ID of the matching features may be returned to the user and/or requesting application or process and further based on the frequency of matches a prominence measure could be attached with the feature ID, which captures its popularity index in the given corpus.
- a unique feature ID is assigned to the disambiguated entity or feature, and the unique feature ID is associated with the cluster of defining features and stored in within the knowledge base of the MemDB.
- the feature ID of disambiguated feature may be returned to the source through the system interface.
- the feature ID of disambiguated feature may include secondary features, cluster of features, relevant feature attributes or other requested data. Disambiguation sub module employed for feature disambiguation step 108 is described in more detail in FIG. 2 below.
- FIG. 2 is a flowchart of a process 200 performed by a disambiguation sub module employed in unstructured texts for feature disambiguation step 108 of the method 100 ( FIG. 1 ), according to an embodiment.
- Disambiguation process 200 may begin after inclusion of features in MemDB in step 106 of FIG. 1 .
- the extracted features provided in step 202 may be used to perform a candidate search in step 204 , in which a search for the extracted features may be performed through all candidate records, including co-occurring features.
- candidates may be primary features with a set of associated secondary features that may be used in feature disambiguation process 108 .
- the disambiguation results may be improved by the co-occurrence of topic IDs and relatedness among topic IDs.
- the relatedness of topic IDs, even across different topic models can be discovered from a large corpus where the topic IDs have been assigned.
- Related topic IDs can be used during records linkage step 206 to provide linkage to documents that may not contain the exact topic ID but do contain one or more related topic IDs. This approach may improve the recall of relevant features to be included in the records linkage step 206 and improve disambiguation results in certain cases.
- Cluster comparison step 208 may include the assignment of relative matching scores to clusters of disambiguated features, different thresholds of acceptance may be defined for different applications. The defined levels of accuracy may determine which scores may be considered a positive match search and which scores may be considered a negative match search, step 210 .
- Each new cluster may be given a unique ID and may be temporarily stored in a knowledge base. Each new cluster may include a new disambiguated primary feature and its set of secondary features. If a new cluster matches a cluster that is already stored in the knowledge base, the system updates knowledge base in step 212 , and return of a matched feature ID to the user and/or requesting application or process may be performed in step 214 .
- Update of knowledge base 212 may imply the association of additional of secondary features to one primary feature, or the addition of feature attributes that were not previously associated with primary or secondary features.
- step 216 the system performs a unique ID assignment, step 216 , to the primary feature of the cluster and updates knowledge base 212 . Afterwards, the system performs a return of matched ID process 214 . Records linkage step 206 is further explained in detail in FIG. 3 .
- FIG. 3 is a flowchart of a process 300 performed by a link on-the-fly (“link OTF”) sub module employed in method 100 for disambiguating features, according to an embodiment.
- Link OTF process 300 may be capable of constantly evaluating, scoring, linking, and clustering a feed of information.
- Link OTF sub module may perform records linkage 206 using multiple algorithms.
- Candidate search results of step 204 may be constantly fed into link OTF module 300 .
- the input of data may be followed by a match scoring algorithm application, step 302 , where one or more match scoring algorithms may be applied simultaneously in multiple search nodes of the MemDB while performing fuzzy key searches for evaluating and scoring the relevant results, taking in account multiple feature attributes, such as string edit distances, phonetics, and sentiments, among others.
- Linking algorithm application 304 may be added to compare all candidate records, identified during match scoring algorithm application step 302 , to each other.
- Linking algorithm application 304 may include the use of one or more analytical linking algorithms capable of filtering and evaluating the scored results of the fuzzy key searches performed inside the multiple search nodes of the MemDB.
- co-occurrence of two or more features across the collection of identified candidate records in the MemDB may be analyzed to improve the accuracy of the process.
- Different weighted models and confidence scores associated with different feature attributes may be taken into account for linking algorithm application 304 .
- the linked results may be arranged in clusters of related features and returned, as part of return of linked records clusters in step 306 .
- FIG. 4 is an illustrative diagram of an embodiment of a system 400 for disambiguating features in unstructured text, as discussed above in connection with FIG. 1 .
- the system 400 hosts an in-memory database and comprises one or more nodes.
- the system 400 includes one or more processors executing computer instructions for a plurality of special-purpose computer modules 401 , 402 , 411 , 412 , and 414 (discussed below) to disambiguate features within one or more documents.
- the document input modules 401 , 402 receive documents from internet based sources and/or a live corpus of documents. A large number of new documents may be uploaded by the second into the document input module 402 through a network connection 404 . Therefore, the source may be constantly getting new knowledge, updated by user workstations 406 , where such new knowledge is not pre-linked in a static way. Thus, the number of documents to be evaluated may be infinitely increasing.
- MemDB 408 may facilitate a faster disambiguation process, may facilitate disambiguation process on-the-fly, which may facilitate reception of the latest information that is going to contribute to MemDB 408 .
- Various methods for linking the features may be employed, which may essentially use a weighted model for determining which entity types are most important, which have more weight, and, based on confidence scores, determine how confident the extraction and disambiguation of the correct features has been performed, and that the correct feature may go into the resulting cluster of features. As shown in FIG. 4 , as more system nodes are working in parallel, the process may become more efficient.
- a new document arrives into the system 400 via the document input module 401 , 402 through a network connection 404 .
- feature extraction is performed via the extraction module 411 and, then, feature disambiguation may be performed on the new document via the feature disambiguation sub-module 414 of the MemDB 408 .
- the extracted new features 410 may be included in the MemDB to pass through link OTF sub-module 412 ; where the features may be compared and linked, and a feature ID of disambiguated feature 110 may be returned to the user as a result from a query.
- the resulting feature cluster defining the disambiguated feature may optionally be returned.
- MemDB computer 408 can be a database storing data in records controlled by a database management system (DBMS) (not shown) configured to store data records in a device's main memory, as opposed to conventional databases and DBMS modules that store data in “disk” memory.
- DBMS database management system
- Conventional disk storage requires processors (CPUs) to execute read and write commands to a device's hard disk, thus requiring CPUs to execute instructions to locate (i.e., seek) and retrieve the memory location for the data, before performing some type of operation with the data at that memory location.
- In-memory database systems access data that is placed into main memory, and then addressed accordingly, thereby mitigating the number of instructions performed by the CPUs and eliminating the seek time associated with CPUs seeking data on hard disk.
- In-memory databases may be implemented in a distributed computing architecture, which may be a computing system comprising one or more nodes configured to aggregate the nodes' respective resources (e.g., memory, disks, processors).
- a computing system hosting an in-memory database may distribute and store data records of the database among one or more nodes.
- these nodes are formed into “clusters” of nodes.
- these clusters of nodes store portions, or “collections,” of database information.
- Various embodiments provide a computer executed feature disambiguation technique that employs an evolving and efficiently linkable feature knowledge base that is configured to store secondary features, such as co-occurring topics, key phrases, proximity terms, events, facts and trending popularity index.
- the disclosed embodiments may be performed via a wide variety of linking algorithms that can vary from simple conceptual distance measure to sophisticated graph clustering approaches based on the dimensions of the involved secondary features that aid in resolving a given extracted feature to a stored feature in the knowledge base.
- embodiments can introduce an approach to evolves the existing feature knowledge base by a capability that not only updates the secondary features of the existing feature entry, but also expands it by discovering new features that can be appended to the knowledge base.
- Embodiments of the disambiguation approach can employ a topic modeling approach to provide an automated weighted (across all the secondary features) linking process (disambiguation) that is modeled as topic inference.
- a topic modeling approach to provide an automated weighted (across all the secondary features) linking process (disambiguation) that is modeled as topic inference.
- embodiments extend the conventional LDA topic modeling to build a novel topic modeling approach referred to as a multi-component LDA (MC-LDA) that can support any number of components (secondary features) as conditionally-independent.
- MC-LDA multi-component LDA
- Embodiments of the modeling approach also can automatically learn the weights of components during training and employ them for inference (linking) in connection with disambiguation.
- the introduced MC-LDA approach for disambiguation can scale for any additional number of secondary features that could be introduced to increase disambiguation accuracy.
- FIG. 5 shows a graphical representation of an embodiment of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic computer modeling approach employed by system 400 of FIG. 4 above.
- each component block represents modeling each secondary feature across the knowledge base, for instance as executed via the MemDB 408 of FIG. 4 that is initialized with the parameters set forth in FIG. 5 .
- FIG. 6 illustrates an embodiment of the Gibbs sampling equations for MC-LDA topic model employed in FIG. 5 above.
- An embodiment of this sampling approach aids the system 400 of FIG. 4 in training the individual component (secondary feature) weights in an automated fashion and in an efficient manner.
- FIG. 7 illustrates an embodiment of the computer executed implementation of a stochastic variational inference algorithm for training and inference in MC-LDA topic model of FIGS. 5-6 , for instance as executed via the MemDB 408 of the system 400 of FIG. 4 that is initialized with the parameters set forth in FIG. 7 .
- An embodiment of this inference method applies readily to model the linking/disambiguation process as topic inference, by taking all the secondary features (extracted from the document of interest) as an input and providing weighted topics as the output. These weighted topics can then be used to compute a similarity score against the stored feature knowledge base entries.
- FIG. 8 is a table illustrating a sample topic for a MC-LDA topic model.
- FIG. 8 displays the top scoring surface forms for each component of the model, for instance as executed via the MemDB 408 of the system 400 of FIG. 4 , according to an embodiment.
- Example #1 is an application of method 100 for disambiguating features in unstructured text, where the feature of interest (primary feature) is John Doe, a football player, and the user wants to monitor the news referencing John Doe.
- a document input 102 mentioning John Doe may be uploaded into the network.
- Features of document input 102 may be extracted and included into MemDB 408 for it to be disambiguated and linked to a cluster of secondary features associated to the primary feature (John Doe), and compared to existing cluster of similar features.
- Method 100 may output different feature IDs and the feature IDs' associated clusters that include all related secondary features to John Doe; for example, John Doe, engineer; John Doe, teacher; and John Doe, football player.
- Example #2 is an application of method 100 for disambiguating features in unstructured document, where the primary feature may be an image.
- method 100 may include a feature extraction 104 , where a feature may be a general attribute, such as edges and shapes, among others; or a specific attribute, such as a tank, a person, and a clock, among others.
- a new image may be input, where the image may have secondary features such as a specific shape (e.g., the shape of square, a person, or a car); the secondary features may be extracted and included in the MemDB 408 where a match may be found among all other images that has similar secondary features.
- features may only include images, i.e. text may not be included as a feature.
- Example #3 is an application of method 100 for disambiguating features in unstructured text, where the primary feature may be an event.
- method 100 may allow a user to receive results associated to an event, such as an earthquake, a fire, or an epidemic outbreak, among others.
- Method 100 may perform a feature extraction 104 and feature disambiguation 108 of the features to find the event's associated features and provide feature IDs of disambiguated features 110 .
- Example #4 is an embodiment of method 100 , where prediction of one or more events that might occur may be made.
- a user may previously indicate features and events of interest prior to operation, and, therefore, links between different features associated to the events of interest may be previously established.
- method 100 may predict that the event of interest might occur, based on an increased number of occurrences of the associated features.
- an alert may be sent to the user. For example, a user working for the health department from Thailand may choose to receive an alert for an epidemic outbreak of dengue.
- method 100 may disambiguate all related comments from the social networks, and, taking in account the number of users 406 including related information, may predict and alert to the health department worker that an epidemic outbreak of dengue may be occurring. Therefore, the health department worker may have additional evidence and further actions may be taken into the affected community to keep the epidemic from spreading.
- Example #5 is an application of method 100 for disambiguating features in unstructured text, where primary features may be geographic place names.
- method 100 may be employed to disambiguate the name of a city, where different scoring weights may be associated with secondary features in the disambiguation sub-module.
- method 100 may be employed to disambiguate Paris, Tex. from Paris, France.
- Example #6 is an application of method 100 for disambiguating features in unstructured text, where primary features may be sentiments associated with a person, event, or company, among others; where sentiments may be positive or negative comments about a person, event, or company, among others, that may be fed from any suitable source, including social networks.
- method 100 may be employed for a company to acknowledge the acceptance that it is having amongst the public.
- Example #7 is an embodiment of method 100 , where method 100 may include human validation in order to increase the confidence score of a feature.
- link OTF process 300 FIG. 4
- the user may indicate if a disambiguated feature has been correctly disambiguated and indicate if two different clusters should be one, meaning that what method 100 (taking in account all feature and topic co-occurrence information) is indicating as two different primary features the user has knowledge may be the same. Therefore, the confidence score associated to that cluster may be higher, thus, the probability of the feature to be correctly disambiguated may be higher.
- Example #8 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300 .
- the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.85 within a period of 1000 ms.
- Example #9 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300 .
- the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.80 within a period of not exceeding 300 ms.
- the algorithm used in this example provides an answer in a smaller period of time compared to the algorithm used in example #8 but generally returns a lower confidence score.
- Example #10 is an embodiment of method 100 using disambiguation process 200 and link OTF process 300 .
- the linking algorithm used in linking algorithm application 304 is configured to provide a confidence score above 0.90 within a period of generally exceeding 3000 ms.
- the algorithm used in this example provides an answer with a confidence score generally greater than that returned by the algorithm used in example #8, but generally requires a significantly longer period of time.
- Example #11 is an example of method 100 for disambiguating features in unstructured text to perform e-discovery on a large corpus of documents from a plurality of sources. Given a large corpus of documents from a plurality of sources, applying method 100 to disambiguate all features in those documents, enables the discovery of all features in the corpus. The collection of discovered features can be further utilized to discover all documents related to a feature and the discovery of related features.
- process flow diagrams are provided merely as illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by one of skill in the art the steps in the foregoing embodiments may be performed in any order. Words such as “then,” “next,” etc. are not intended to limit the order of the steps; these words are simply used to guide the reader through the description of the methods.
- process flow diagrams may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged.
- a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
- Embodiments implemented in computer software may be implemented in software, firmware, middleware, microcode, hardware description languages, or any combination thereof.
- a code segment or machine-executable instructions may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements.
- a code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents.
- Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.
- the functions When implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium.
- the steps of a method or algorithm disclosed here may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium.
- a non-transitory computer-readable or processor-readable media includes both computer storage media and tangible storage media that facilitate transfer of a computer program from one place to another.
- a non-transitory processor-readable storage media may be any available media that may be accessed by a computer.
- non-transitory processor-readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible storage medium that may be used to store desired program code in the form of instructions or data structures and that may be accessed by a computer or processor.
- Disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
- the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.
- the various components of the technology can be located at distant portions of a distributed network and/or the Internet, or within a dedicated secure, unsecured and/or encrypted system.
- the components of the system can be combined into one or more devices or co-located on a particular node of a distributed network, such as a telecommunications network.
- the components of the system can be arranged at any location within a distributed network without affecting the operation of the system.
- the components could be embedded in a dedicated machine.
- the various links connecting the elements can be wired or wireless links, or any combination thereof, or any other known or later developed element(s) that is capable of supplying and/or communicating data to and from the connected elements.
- module as used herein can refer to any known or later developed hardware, software, firmware, or combination thereof that is capable of performing the functionality associated with that element.
- determine, calculate and compute, and variations thereof, as used herein are used interchangeably and include any type of methodology, process, mathematical operation or technique.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Bioethics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Computing Systems (AREA)
- Chemical & Material Sciences (AREA)
- Biophysics (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for disambiguating features in unstructured text is provided. The disclosed method may not require pre-existing links to be present. The method for disambiguating features in unstructured text may use co-occurring features derived from both the source document and a large document corpus. The disclosed method may include multiple modules, including a linking module for linking the derived features from the source document to the co-occurring features of an existing knowledge base. The disclosed method for disambiguating features may allow identifying unique entities from a knowledge base that includes entities with a unique set of co-occurring features, which in turn may allow for increased precision in knowledge discovery and search results, employing advanced analytical methods over a massive corpus, employing a combination of entities, co-occurring entities, topic IDs, and other derived features.
Description
- This application is a continuation of U.S. patent Ser. No. 14/557,794, entitled “Method For Disambiguating Features In Unstructured Text,” filed on Dec. 2, 2014, which claims the benefit of U.S. Provisional Application No. 61/910,739, entitled “Method For Disambiguating Features In Unstructured Text,” filed on Dec. 2, 2013, all of which are hereby incorporated by reference in its entirety.
- This application is related to U.S. application Ser. No. 14/558,300, entitled “Event Detection Through Text Analysis Using Trained Event Template Models,” filed Dec. 2, 2014; and U.S. application Ser. No. 14/558,254, entitled “Design And Implementation Of Clustered In-Memory Database,” filed Dec. 2, 2014; each of which are hereby incorporated by reference in their entirety.
- The present disclosure generally relates to data management; and, more specifically, to data management systems and methods that extract and store material from source items received over a network.
- Searching for information about entities (i.e., people, locations, organizations) in large document collection, including sources such as a network, may often be ambiguous, which may lead to imprecise text processing functions, imprecise association of features during knowledge extraction, and, thus, imprecise data analysis.
- State of the art systems use linkage based clustering and ranking in several algorithms, such as PageRank and the hyperlink-induced topic search (HITS) algorithm. The basic idea behind this and related approaches is that pre-existing links typically exist between related pages or concepts. A limitation of clustering-based techniques is that sometimes contextual information needed to disambiguate entities is not present in the context, leading to incorrectly disambiguated results. Similarly, documents about different entities in the same or superficially similar contexts may be incorrectly clustered together.
- Other systems attempt to disambiguate entities by reference to one or more external dictionaries (or knowledgebase) of entities. In such systems, an entity's context is compared to possible matching entities in the dictionary and the closest match is returned. A limitation associated with current dictionary-based techniques stems from the fact that entities may increase in number at any moment and, therefore, no dictionary may include a representation of all of the world's entities. Thus, if a document's context is matched to an entity in the dictionary, then the technique has identified only the most similar entity in the dictionary, and not necessarily the correct entity, which may be outside the dictionary.
- Most methods just use entities and key phrases in the disambiguation process. Therefore, there is still a need for accurate entity disambiguation techniques that allow a precise data analysis.
- Some embodiments describe a method for disambiguating features. The method may include multiple modules, such as one or more feature extraction modules, one or more disambiguation modules, one or more scoring modules, and one or more linking modules.
- Disambiguating features will be supported in part by extracting topics from the ambient document of the feature, employing a multi-component extension of Latent Dirichlet Allocation (MC-LDA) topic models. Here, each component is modeled around each secondary feature stored in the existing knowledge base or extracted on the incoming document. Further, the linking or disambiguation process is modeled as topic inference from the MC-LDA, which provides automated weight estimation during the MC-LDA training and applies them readily during inference.
- The exemplary method may improve the accuracy of entity disambiguation beyond what may be achieved by considering no document linking. Taking account of document linkage may allow better disambiguation by considering document and entity relationships implied by links.
- In one embodiment, a method comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is a match, determining, by the disambiguation module of the in-memory database server computer, an existing unique identifier (“unique ID”) corresponding to each matching primary feature in the knowledgebase cluster and updating the knowledgebase cluster to include the new cluster; and when there is no match, creating, by the node, a new knowledgebase cluster and assigning a new unique ID to the primary feature of the new knowledgebase cluster; and transmitting, by the node, one of the existing unique ID and the new unique ID for the primary feature.
- In another embodiment, a non-transitory computer readable medium having stored thereon computer executable instructions comprises searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature; associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“topic IDs”); disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs; identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs; disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs; linking, by the node, each primary feature to the associated set of secondary features to form a new cluster; determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein, when there is a match, determining, by the node, an existing unique identifier (“unique ID”) corresponding to each matching primary feature in the knowledgebase cluster and updating the knowledgebase cluster to include the new cluster; and when there is no match, creating a new knowledgebase cluster and assigning a new unique ID to the primary feature of the new knowledgebase cluster; and transmitting, by the node, one of the existing unique ID and the new unique ID for the primary feature.
- Additional features and advantages of an embodiment will be set forth in the description which follows, and in part will be apparent from the description. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the exemplary embodiments in the written description and claims hereof as well as the appended drawings.
- It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed
- The present disclosure can be better understood by referring to the following figures. The accompanying drawings constitute a part of this specification and illustrate an embodiment of the invention and together with the specification, explain the invention The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosure. In the figures, reference numerals designate corresponding parts throughout the different views.
-
FIG. 1 is a flowchart of a method for disambiguating features in unstructured text, according to an exemplary embodiment. -
FIG. 2 is a flowchart of the steps performed by a disambiguation module employed in the method for disambiguating features, according to an exemplary embodiment. -
FIG. 3 is a flowchart of the steps performed by a link on-the-fly module employed in the method for disambiguating features, according to an exemplary embodiment. -
FIG. 4 is an illustrative diagram of a system employed for implementing the method for disambiguating features, according to an exemplary embodiment. -
FIG. 5 shows a graphical representation of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic model, according to an exemplary embodiment. -
FIG. 6 illustrates an embodiment of the Gibbs sampling equations for multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment. -
FIG. 7 illustrates an embodiment of the implementation of a stochastic variational inference algorithm for training and inference in multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment. -
FIG. 8 is a table illustrating a sample topic for a multi-component, conditionally-independent latent Dirichlet allocation topic model, according to an exemplary embodiment. - As used herein, the following terms have the following definitions:
- “Document” refers to a discrete electronic representation of information having a start and end.
- “Multi-Document” refers to a document with its tokens, different types of named entities, and key phrases organized into separate “bag-of-surface-forms” components.
- “Database” refers to any system including any combination of clusters and modules suitable for storing one or more collections and suitable to process one or more queries.
- “Corpus” refers to a collection of one or more documents.
- “Live corpus”, or “Document Stream”, refers to a corpus that is constantly fed as new documents are uploaded into a network.
- “Feature” refers to any information which is at least partially derived from a document.
- “Feature attribute” refers to metadata associated with a feature; for example, location of a feature in a document, confidence score, among others.
- “Cluster” refers to a collection of features.
- “Entity knowledge base” refers to a base containing features/entities.
- “Link on-the-fly module” refers to any linking module that performs data linkage as data is requested from the system rather than as data is added to the system.
- “Memory” refers to any hardware component suitable for storing information and retrieving said information at a sufficiently high speed.
- “Module” refers to a computer software component suitable for carrying out one or more defined tasks.
- “Sentiment” refers to subjective assessments associated with a document, part of a document, or feature.
- “Topic” refers to a set of thematic information which is at least partially derived from a corpus.
- “Topic Identifier”, or “topic ID”, refers to an identifier that refers to a specific instance of a topic.
- “Topic Collection” refers to a specific set of topics derived from the corpus, with each topic having a unique identifier (“unique ID”).
- “Topic Classification” refers to the assignation of specific topic identifiers as features of a document.
- “Query” refers to a request to retrieve information from one or more suitable databases.
- Reference will now be made in detail to the preferred embodiments, examples of which are illustrated in the accompanying drawings. The embodiments described above are intended to be exemplary. One skilled in the art recognizes that numerous alternative components and embodiments may be substituted for the particular examples described herein and still fall within the scope of the invention.
- The present disclosure describes a method for disambiguating features in an unstructured text. Although the exemplary embodiments discuss practices for disambiguating features according to this disclosure, it is intended that the systems and methods described herein can be configured for any suitable use within the scope of this disclosure.
- Existing knowledge bases include non-ambiguous features and their related features, which may lead to low confidence text analytics. An aspect of the present disclosure includes a method that may allow an increased accuracy in feature and entity disambiguation, therefore, increased accuracy in text analytics.
- According to an embodiment, the disclosed method for disambiguating features may be employed in an initial corpus of data to perform a document ingestion and a feature extraction that may allow a topic classification and other text analytics on each document included in the initial corpus. Each feature may be identified and recorded as name, type, positional information in the document, and confidence score, among others.
-
FIG. 1 is a flowchart of amethod 100 illustrating a plurality of steps for disambiguating features in unstructured text. According to an embodiment,method 100 for disambiguating features may initiate as a new document input,step 102, is made in an existing knowledge base. Subsequently, afeature extraction step 104 may be performed on the document. According to an embodiment, a feature may be related to different feature attributes, such as a topic identifier (“topic ID”), a document identifier (“document ID”), feature type, feature name, confidence score, and feature position, among others. - According to various embodiments, document input in
step 102 may be fed from a massive corpus or live corpus (such as the internet or network connected corpus) that in turn, may be fed every second. - According to different embodiments, one or more feature recognition and extraction algorithms may be employed during
feature extraction step 104 to analyze the unstructured text ofdocument input step 102. A score may be assigned to each extracted feature. The score may indicate the level of certainty of the feature being correctly extracted with the correct attributes. - Additionally, during
feature extraction step 104, one or more primary features may be identified from a document input instep 102. Each primary feature may have been associated with a set of feature attributes and one or more secondary features. Each secondary feature may be associated with a set of feature attributes. In some embodiments, one or more secondary features may have one or more tertiary features, each one having its own set of feature attributes. - Taking into account the feature attributes, the relative weight or relevance of each of the features within document input at
step 102 may be determined. Additionally, the relevance of the association between features may be determined using a weighted scoring model. - Following
feature extraction step 104, the features extracted from document input atstep 102 and all their related information may be loaded into an in-memory data base (MemDB), during inclusion of features in MemDB,step 106, as part of a featuredisambiguation request step 108. - In an embodiment, the MemDB forms part of a disambiguation computer server environment having one or more processors executing the steps discussed in connection with
FIGS. 1-8 . In one embodiment, the MemDB is a computer module that may include one or more search controllers, multiple search nodes, collections of compressed data, and a disambiguation sub module. One search controller may be selectively associated with one or more search nodes. Each search node may be capable of independently performing a fuzzy key search through a collection of compressed data and returning a set of scored results to its associated search controller. -
Feature disambiguation step 108 may be performed by a disambiguation sub module within the MemDB.Feature disambiguation 108 process may include machine generated topic IDs, which may be employed to classify features, documents, or corpora. The relatedness of individual features and specific topic IDs may be determined using disambiguating algorithms. In some documents, the same feature may be related to one or more topic IDs, depending on the context of the different occurrences of the feature within the document. - The set of features (like topics, proximity terms and entities, key phrases, events and facts) extracted from one document may be compared with sets of features from other documents, using disambiguating algorithms to define with a certain level of accuracy if two or more features across different documents are a single feature or if they are distinct features. In some examples, co-occurrence of two or more features across the collection of documents in the database may be analyzed to improve the accuracy of
feature disambiguation process 108. In some embodiments, global scoring algorithms may be used to determine the probability of features being the same. - In some embodiments, as part of the
feature disambiguation process 108, a knowledge base may be generated within the MemDB. This knowledge base may be used to temporarily store clusters of relevant disambiguated primary features and their related secondary features. When new documents are loaded into the MemDB, the new disambiguated set of features may be compared with the existing knowledge base in order to determine the relationship between features and determine if there is a match between the new features and already extracted features. - If the features compared match, the knowledge base may be updated and a feature ID of the matching features may be returned to the user and/or requesting application or process and further based on the frequency of matches a prominence measure could be attached with the feature ID, which captures its popularity index in the given corpus. If the features compared do not match with any of the already extracted features, a unique feature ID is assigned to the disambiguated entity or feature, and the unique feature ID is associated with the cluster of defining features and stored in within the knowledge base of the MemDB. Subsequently, in
step 110, the feature ID of disambiguated feature may be returned to the source through the system interface. In some embodiments, the feature ID of disambiguated feature may include secondary features, cluster of features, relevant feature attributes or other requested data. Disambiguation sub module employed forfeature disambiguation step 108 is described in more detail inFIG. 2 below. - Disambiguation Sub Module
-
FIG. 2 is a flowchart of aprocess 200 performed by a disambiguation sub module employed in unstructured texts forfeature disambiguation step 108 of the method 100 (FIG. 1 ), according to an embodiment.Disambiguation process 200 may begin after inclusion of features in MemDB instep 106 ofFIG. 1 . The extracted features provided instep 202 may be used to perform a candidate search instep 204, in which a search for the extracted features may be performed through all candidate records, including co-occurring features. - According to various embodiments, candidates may be primary features with a set of associated secondary features that may be used in
feature disambiguation process 108. - The disambiguation results may be improved by the co-occurrence of topic IDs and relatedness among topic IDs. The relatedness of topic IDs, even across different topic models can be discovered from a large corpus where the topic IDs have been assigned. Related topic IDs can be used during
records linkage step 206 to provide linkage to documents that may not contain the exact topic ID but do contain one or more related topic IDs. This approach may improve the recall of relevant features to be included in therecords linkage step 206 and improve disambiguation results in certain cases. - Once sets of potentially related documents have been identified and sets of relevant primary and secondary features within these documents have been extracted, feature attributes, the relationships between features of the same document (meaningful context), the relative weight of the features and other variables may be used during
records linkage process 206, to disambiguate primary and secondary features across documents. Then, each of the records may be linked to other records to determine clusters of disambiguated primary features and their related secondary features. The algorithms used forrecords linkage 206 may be capable of overcoming spelling errors or transliterations and other challenges of mining unstructured data sets. -
Cluster comparison step 208 may include the assignment of relative matching scores to clusters of disambiguated features, different thresholds of acceptance may be defined for different applications. The defined levels of accuracy may determine which scores may be considered a positive match search and which scores may be considered a negative match search,step 210. Each new cluster may be given a unique ID and may be temporarily stored in a knowledge base. Each new cluster may include a new disambiguated primary feature and its set of secondary features. If a new cluster matches a cluster that is already stored in the knowledge base, the system updates knowledge base instep 212, and return of a matched feature ID to the user and/or requesting application or process may be performed instep 214. Update ofknowledge base 212 may imply the association of additional of secondary features to one primary feature, or the addition of feature attributes that were not previously associated with primary or secondary features. - If the cluster being evaluated is assigned a score below the threshold of
positive match search 210, the system performs a unique ID assignment,step 216, to the primary feature of the cluster and updatesknowledge base 212. Afterwards, the system performs a return of matchedID process 214.Records linkage step 206 is further explained in detail inFIG. 3 . - Link On-the-Fly Sub Module
-
FIG. 3 is a flowchart of aprocess 300 performed by a link on-the-fly (“link OTF”) sub module employed inmethod 100 for disambiguating features, according to an embodiment.Link OTF process 300 may be capable of constantly evaluating, scoring, linking, and clustering a feed of information. Link OTF sub module may performrecords linkage 206 using multiple algorithms. Candidate search results ofstep 204 may be constantly fed intolink OTF module 300. The input of data may be followed by a match scoring algorithm application,step 302, where one or more match scoring algorithms may be applied simultaneously in multiple search nodes of the MemDB while performing fuzzy key searches for evaluating and scoring the relevant results, taking in account multiple feature attributes, such as string edit distances, phonetics, and sentiments, among others. - Afterwards, a linking
algorithm application step 304 may be added to compare all candidate records, identified during match scoringalgorithm application step 302, to each other. Linkingalgorithm application 304 may include the use of one or more analytical linking algorithms capable of filtering and evaluating the scored results of the fuzzy key searches performed inside the multiple search nodes of the MemDB. In some examples, co-occurrence of two or more features across the collection of identified candidate records in the MemDB may be analyzed to improve the accuracy of the process. Different weighted models and confidence scores associated with different feature attributes may be taken into account for linkingalgorithm application 304. - After the linking
algorithm application step 304, the linked results may be arranged in clusters of related features and returned, as part of return of linked records clusters instep 306. -
FIG. 4 is an illustrative diagram of an embodiment of asystem 400 for disambiguating features in unstructured text, as discussed above in connection withFIG. 1 . Thesystem 400 hosts an in-memory database and comprises one or more nodes. - According to an embodiment, the
system 400 includes one or more processors executing computer instructions for a plurality of special-purpose computer modules FIG. 4 , thedocument input modules document input module 402 through anetwork connection 404. Therefore, the source may be constantly getting new knowledge, updated byuser workstations 406, where such new knowledge is not pre-linked in a static way. Thus, the number of documents to be evaluated may be infinitely increasing. - This evaluation may be achieved via the
MemDB computer 408.MemDB 408 may facilitate a faster disambiguation process, may facilitate disambiguation process on-the-fly, which may facilitate reception of the latest information that is going to contribute toMemDB 408. Various methods for linking the features may be employed, which may essentially use a weighted model for determining which entity types are most important, which have more weight, and, based on confidence scores, determine how confident the extraction and disambiguation of the correct features has been performed, and that the correct feature may go into the resulting cluster of features. As shown inFIG. 4 , as more system nodes are working in parallel, the process may become more efficient. - According to various embodiments, when a new document arrives into the
system 400 via thedocument input module network connection 404, feature extraction is performed via theextraction module 411 and, then, feature disambiguation may be performed on the new document via thefeature disambiguation sub-module 414 of theMemDB 408. In one embodiment, after feature disambiguation of the new document is performed, the extractednew features 410 may be included in the MemDB to pass throughlink OTF sub-module 412; where the features may be compared and linked, and a feature ID ofdisambiguated feature 110 may be returned to the user as a result from a query. In addition to the feature ID, the resulting feature cluster defining the disambiguated feature may optionally be returned. -
MemDB computer 408 can be a database storing data in records controlled by a database management system (DBMS) (not shown) configured to store data records in a device's main memory, as opposed to conventional databases and DBMS modules that store data in “disk” memory. Conventional disk storage requires processors (CPUs) to execute read and write commands to a device's hard disk, thus requiring CPUs to execute instructions to locate (i.e., seek) and retrieve the memory location for the data, before performing some type of operation with the data at that memory location. In-memory database systems access data that is placed into main memory, and then addressed accordingly, thereby mitigating the number of instructions performed by the CPUs and eliminating the seek time associated with CPUs seeking data on hard disk. - In-memory databases may be implemented in a distributed computing architecture, which may be a computing system comprising one or more nodes configured to aggregate the nodes' respective resources (e.g., memory, disks, processors). As disclosed herein, embodiments of a computing system hosting an in-memory database may distribute and store data records of the database among one or more nodes. In some embodiments, these nodes are formed into “clusters” of nodes. In some embodiments, these clusters of nodes store portions, or “collections,” of database information.
- Various embodiments provide a computer executed feature disambiguation technique that employs an evolving and efficiently linkable feature knowledge base that is configured to store secondary features, such as co-occurring topics, key phrases, proximity terms, events, facts and trending popularity index. The disclosed embodiments may be performed via a wide variety of linking algorithms that can vary from simple conceptual distance measure to sophisticated graph clustering approaches based on the dimensions of the involved secondary features that aid in resolving a given extracted feature to a stored feature in the knowledge base. Additionally, embodiments can introduce an approach to evolves the existing feature knowledge base by a capability that not only updates the secondary features of the existing feature entry, but also expands it by discovering new features that can be appended to the knowledge base.
- Embodiments of the disambiguation approach can employ a topic modeling approach to provide an automated weighted (across all the secondary features) linking process (disambiguation) that is modeled as topic inference. To support the automated weighted linking process, embodiments extend the conventional LDA topic modeling to build a novel topic modeling approach referred to as a multi-component LDA (MC-LDA) that can support any number of components (secondary features) as conditionally-independent. Embodiments of the modeling approach also can automatically learn the weights of components during training and employ them for inference (linking) in connection with disambiguation. The introduced MC-LDA approach for disambiguation can scale for any additional number of secondary features that could be introduced to increase disambiguation accuracy.
-
FIG. 5 shows a graphical representation of an embodiment of a multi-component, conditionally-independent Latent Dirichlet Allocation (MC-LDA) topic computer modeling approach employed bysystem 400 ofFIG. 4 above. In the illustrated embodiment, each component block represents modeling each secondary feature across the knowledge base, for instance as executed via theMemDB 408 ofFIG. 4 that is initialized with the parameters set forth inFIG. 5 . -
FIG. 6 illustrates an embodiment of the Gibbs sampling equations for MC-LDA topic model employed inFIG. 5 above. An embodiment of this sampling approach aids thesystem 400 ofFIG. 4 in training the individual component (secondary feature) weights in an automated fashion and in an efficient manner. -
FIG. 7 illustrates an embodiment of the computer executed implementation of a stochastic variational inference algorithm for training and inference in MC-LDA topic model ofFIGS. 5-6 , for instance as executed via theMemDB 408 of thesystem 400 ofFIG. 4 that is initialized with the parameters set forth inFIG. 7 . An embodiment of this inference method applies readily to model the linking/disambiguation process as topic inference, by taking all the secondary features (extracted from the document of interest) as an input and providing weighted topics as the output. These weighted topics can then be used to compute a similarity score against the stored feature knowledge base entries. -
FIG. 8 is a table illustrating a sample topic for a MC-LDA topic model.FIG. 8 displays the top scoring surface forms for each component of the model, for instance as executed via theMemDB 408 of thesystem 400 ofFIG. 4 , according to an embodiment. -
Example # 1 is an application ofmethod 100 for disambiguating features in unstructured text, where the feature of interest (primary feature) is John Doe, a football player, and the user wants to monitor the news referencing John Doe. According to one embodiment, adocument input 102 mentioning John Doe may be uploaded into the network. Features ofdocument input 102 may be extracted and included intoMemDB 408 for it to be disambiguated and linked to a cluster of secondary features associated to the primary feature (John Doe), and compared to existing cluster of similar features.Method 100 may output different feature IDs and the feature IDs' associated clusters that include all related secondary features to John Doe; for example, John Doe, engineer; John Doe, teacher; and John Doe, football player. Other primary features with similar secondary features may be considered, for example nicknames or short names. Then “JD” football player, from the same team as John Doe football player, with the same age and career may be considered the same primary feature. Therefore, all documents related to John Doe, football player, may be accessed easily. -
Example # 2 is an application ofmethod 100 for disambiguating features in unstructured document, where the primary feature may be an image. According to one embodiment,method 100 may include afeature extraction 104, where a feature may be a general attribute, such as edges and shapes, among others; or a specific attribute, such as a tank, a person, and a clock, among others. For example, a new image may be input, where the image may have secondary features such as a specific shape (e.g., the shape of square, a person, or a car); the secondary features may be extracted and included in theMemDB 408 where a match may be found among all other images that has similar secondary features. According to the present embodiment, features may only include images, i.e. text may not be included as a feature. -
Example # 3 is an application ofmethod 100 for disambiguating features in unstructured text, where the primary feature may be an event. According to one embodiment, when a query is made,method 100 may allow a user to receive results associated to an event, such as an earthquake, a fire, or an epidemic outbreak, among others.Method 100 may perform afeature extraction 104 andfeature disambiguation 108 of the features to find the event's associated features and provide feature IDs of disambiguated features 110. -
Example # 4 is an embodiment ofmethod 100, where prediction of one or more events that might occur may be made. According to one embodiment, a user may previously indicate features and events of interest prior to operation, and, therefore, links between different features associated to the events of interest may be previously established. As the associated features are appearing in the network in a high number of occurrences,method 100 may predict that the event of interest might occur, based on an increased number of occurrences of the associated features. When the imminent event is detected, an alert may be sent to the user. For example, a user working for the health department from Thailand may choose to receive an alert for an epidemic outbreak of dengue. Asother users 406 from, for example, social networks upload comments including symptoms of dengue or inclusions into a hospital,method 100 may disambiguate all related comments from the social networks, and, taking in account the number ofusers 406 including related information, may predict and alert to the health department worker that an epidemic outbreak of dengue may be occurring. Therefore, the health department worker may have additional evidence and further actions may be taken into the affected community to keep the epidemic from spreading. -
Example # 5 is an application ofmethod 100 for disambiguating features in unstructured text, where primary features may be geographic place names. According to an embodiment,method 100 may be employed to disambiguate the name of a city, where different scoring weights may be associated with secondary features in the disambiguation sub-module. For example,method 100 may be employed to disambiguate Paris, Tex. from Paris, France. - Example #6 is an application of
method 100 for disambiguating features in unstructured text, where primary features may be sentiments associated with a person, event, or company, among others; where sentiments may be positive or negative comments about a person, event, or company, among others, that may be fed from any suitable source, including social networks. According to one embodiment,method 100 may be employed for a company to acknowledge the acceptance that it is having amongst the public. - Example #7 is an embodiment of
method 100, wheremethod 100 may include human validation in order to increase the confidence score of a feature. According to one embodiment, link OTF process 300 (FIG. 4 ) may be assisted by a user, where the user may indicate if a disambiguated feature has been correctly disambiguated and indicate if two different clusters should be one, meaning that what method 100 (taking in account all feature and topic co-occurrence information) is indicating as two different primary features the user has knowledge may be the same. Therefore, the confidence score associated to that cluster may be higher, thus, the probability of the feature to be correctly disambiguated may be higher. - Example #8 is an embodiment of
method 100 usingdisambiguation process 200 and linkOTF process 300. In this example, the linking algorithm used in linkingalgorithm application 304 is configured to provide a confidence score above 0.85 within a period of 1000 ms. - Example #9 is an embodiment of
method 100 usingdisambiguation process 200 and linkOTF process 300. In this example, the linking algorithm used in linkingalgorithm application 304 is configured to provide a confidence score above 0.80 within a period of not exceeding 300 ms. The algorithm used in this example provides an answer in a smaller period of time compared to the algorithm used in example #8 but generally returns a lower confidence score. -
Example # 10 is an embodiment ofmethod 100 usingdisambiguation process 200 and linkOTF process 300. In this example, the linking algorithm used in linkingalgorithm application 304 is configured to provide a confidence score above 0.90 within a period of generally exceeding 3000 ms. The algorithm used in this example provides an answer with a confidence score generally greater than that returned by the algorithm used in example #8, but generally requires a significantly longer period of time. -
Example # 11 is an example ofmethod 100 for disambiguating features in unstructured text to perform e-discovery on a large corpus of documents from a plurality of sources. Given a large corpus of documents from a plurality of sources, applyingmethod 100 to disambiguate all features in those documents, enables the discovery of all features in the corpus. The collection of discovered features can be further utilized to discover all documents related to a feature and the discovery of related features. - The foregoing method descriptions and the process flow diagrams are provided merely as illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by one of skill in the art the steps in the foregoing embodiments may be performed in any order. Words such as “then,” “next,” etc. are not intended to limit the order of the steps; these words are simply used to guide the reader through the description of the methods. Although process flow diagrams may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
- The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed here may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
- Embodiments implemented in computer software may be implemented in software, firmware, middleware, microcode, hardware description languages, or any combination thereof. A code segment or machine-executable instructions may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.
- The actual software code or specialized control hardware used to implement these systems and methods is not limiting of the invention. Thus, the operation and behavior of the systems and methods were described without reference to the specific software code being understood that software and control hardware can be designed to implement the systems and methods based on the description here.
- When implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium. The steps of a method or algorithm disclosed here may be embodied in a processor-executable software module which may reside on a computer-readable or processor-readable storage medium. A non-transitory computer-readable or processor-readable media includes both computer storage media and tangible storage media that facilitate transfer of a computer program from one place to another. A non-transitory processor-readable storage media may be any available media that may be accessed by a computer. By way of example, and not limitation, such non-transitory processor-readable media may comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible storage medium that may be used to store desired program code in the form of instructions or data structures and that may be accessed by a computer or processor. Disk and disc, as used here, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media. Additionally, the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.
- It is to be appreciated that the various components of the technology can be located at distant portions of a distributed network and/or the Internet, or within a dedicated secure, unsecured and/or encrypted system. Thus, it should be appreciated that the components of the system can be combined into one or more devices or co-located on a particular node of a distributed network, such as a telecommunications network. As will be appreciated from the description, and for reasons of computational efficiency, the components of the system can be arranged at any location within a distributed network without affecting the operation of the system. Moreover, the components could be embedded in a dedicated machine.
- Furthermore, it should be appreciated that the various links connecting the elements can be wired or wireless links, or any combination thereof, or any other known or later developed element(s) that is capable of supplying and/or communicating data to and from the connected elements. The term module as used herein can refer to any known or later developed hardware, software, firmware, or combination thereof that is capable of performing the functionality associated with that element. The terms determine, calculate and compute, and variations thereof, as used herein are used interchangeably and include any type of methodology, process, mathematical operation or technique.
- The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined here may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown here but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed here.
- The embodiments described above are intended to be exemplary. One skilled in the art recognizes that numerous alternative components and embodiments that may be substituted for the particular examples described herein and still fall within the scope of the invention.
Claims (20)
1. A method comprising:
in response to receiving, by a server, a search query from a client:
searching, by the server, a set of records comprising a co-occurring feature, wherein the server comprises a main memory hosting a database, wherein the database stores a first cluster, wherein the first cluster comprises a disambiguated primary feature with a unique identifier and a set of secondary features, wherein the first cluster comprises a first score;
identifying, by the server, a record in the set of records, wherein the record matches an extracted feature such that the extracted feature is a primary feature;
associating, by the server, the extracted feature with a topic identifier;
disambiguating, by the server, the primary feature based on a relatedness of the topic identifier;
identifying, by the server, the set of secondary features based on the relatedness;
disambiguating, by the server, the primary feature from the set of secondary features based on the relatedness;
accessing, by the server, the database;
linking, by the server, in real-time, during the accessing, the primary feature to the set of secondary features;
forming, by the server, a second cluster based on the linking, wherein the second cluster comprises a second score;
comparing, by the server, the first score against the second score;
determining, by the server, whether the first score matches the second score;
identifying, by the server, the unique identifier related to the primary feature in the first cluster based on the first score matching the second score;
amending, by the server, based on the identifying the unique identifier, the first cluster such that the first cluster includes the second cluster; and
sending, by the server, the unique identifier to the client.
2. The method of claim 1 , further comprising:
comparing, by the server, each member of the set of records which matches the extracted feature against a data item;
assigning, by the server, a third score to the extracted feature based on the comparing of the each of the member.
3. The method of claim 2 , further comprising:
associating, by the server, the extracted feature with a feature attribute.
4. The method of claim 3 , wherein the feature attribute is weighted.
5. The method of claim 2 , further comprising:
determining, by the server, a relatedness of the extracted feature based on the feature attribute.
6. The method of claim 1 , wherein the primary feature is associated with a feature attribute.
7. The method of claim 1 , wherein the extracted feature is associated with a lower-ordinal feature in accordance with a cluster hierarchy.
8. The method of claim 1 , wherein the searching is in a fuzzy manner.
9. The method of claim 1 , further comprising:
comparing, by the server, a first feature against a second feature, wherein the first feature comprises the extracted feature, wherein the first feature is provided via a first data source, wherein the second feature is provided via a second data source;
determining, by the server, if the first feature co-occurs in the second data source based on the comparing of the first feature against the second feature;
linking, by the server, at least one of the first data source or the second data source.
10. The method of claim 1 , further comprising:
determining, by the server, a co-occurrence of the extracted feature in a plurality of data sources;
improving, by the server, a rate of accuracy of the disambiguating based on the determining of the co-occurrence of the extracted feature.
11. A method comprising:
in response to receiving, by a server, a search query from a client:
searching, by the server, based on the receiving, a set of records comprising a co-occurring feature, wherein the server comprises a main memory hosting a database, wherein the database stores a first cluster, wherein the first cluster comprises a disambiguated primary feature with a first unique identifier and a set of secondary features, wherein the first cluster comprises a first score;
identifying, by the server, a record in the set of records, wherein the record matches an extracted feature such that the extracted feature is a first primary feature;
associating, by the server, the extracted feature with a topic identifier;
disambiguating, by the server, the first primary feature based on a relatedness of the topic identifier;
identifying, by the server, the set of secondary features based on the relatedness;
disambiguating, by the server, the first primary feature from the set of secondary features based on the relatedness;
accessing, by the server, the database;
linking, by the server, in real-time, during the accessing, the first primary feature to the set of secondary features;
forming, by the server, a second cluster based on the linking, wherein the second cluster comprises a second score;
comparing, by the server, the first score against the second score;
determining, by the server, whether the first score matches the second score;
generating, by the server, a third cluster based on the first score not matching the second score, wherein the third cluster comprises a second primary feature;
assigning, by the server, a second unique identifier to the second primary feature;
sending, by the server, the second unique identifier to the client.
12. The method of claim 1 , further comprising:
comparing, by the server, each member of the set of records which matches the extracted feature against a data item;
assigning, by the server, a third score to the extracted feature based on the comparing of the each of the member.
13. The method of claim 2 , further comprising:
associating, by the server, the extracted feature with a feature attribute.
14. The method of claim 3 , wherein the feature attribute is weighted.
15. The method of claim 2 , further comprising:
determining, by the server, a relatedness of the extracted feature based on the feature attribute.
16. The method of claim 1 , wherein at least one of the first primary feature or the second primary feature is associated with a feature attribute.
17. The method of claim 1 , wherein the extracted feature is associated with a lower-ordinal feature in accordance with a cluster hierarchy.
18. The method of claim 1 , wherein the searching is in a fuzzy manner.
19. The method of claim 1 , further comprising:
comparing, by the server, a first feature against a second feature, wherein the first feature comprises the extracted feature, wherein the first feature is provided via a first data source, wherein the second feature is provided via a second data source;
determining, by the server, if the first feature co-occurs in the second data source based on the comparing of the first feature against the second feature;
linking, by the server, at least one of the first data source or the second data source.
20. The method of claim 1 , further comprising:
determining, by the server, a co-occurrence of the extracted feature in a plurality of data sources;
improving, by the server, a rate of accuracy of the disambiguating based on the determining of the co-occurrence of the extracted feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/979,703 US20160110446A1 (en) | 2013-12-02 | 2015-12-28 | Method for disambiguated features in unstructured text |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361910739P | 2013-12-02 | 2013-12-02 | |
US14/557,794 US9239875B2 (en) | 2013-12-02 | 2014-12-02 | Method for disambiguated features in unstructured text |
US14/979,703 US20160110446A1 (en) | 2013-12-02 | 2015-12-28 | Method for disambiguated features in unstructured text |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/557,794 Continuation US9239875B2 (en) | 2013-12-02 | 2014-12-02 | Method for disambiguated features in unstructured text |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160110446A1 true US20160110446A1 (en) | 2016-04-21 |
Family
ID=53265533
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/557,794 Active US9239875B2 (en) | 2013-12-02 | 2014-12-02 | Method for disambiguated features in unstructured text |
US14/979,703 Abandoned US20160110446A1 (en) | 2013-12-02 | 2015-12-28 | Method for disambiguated features in unstructured text |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/557,794 Active US9239875B2 (en) | 2013-12-02 | 2014-12-02 | Method for disambiguated features in unstructured text |
Country Status (7)
Country | Link |
---|---|
US (2) | US9239875B2 (en) |
EP (1) | EP3077919A4 (en) |
JP (1) | JP6284643B2 (en) |
KR (1) | KR20160124742A (en) |
CN (1) | CN106164890A (en) |
CA (1) | CA2932399A1 (en) |
WO (1) | WO2015084724A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106991171A (en) * | 2017-03-25 | 2017-07-28 | 贺州学院 | Topic based on Intelligent campus information service platform finds method |
WO2018005203A1 (en) * | 2016-06-28 | 2018-01-04 | Microsoft Technology Licensing, Llc | Leveraging information available in a corpus for data parsing and predicting |
US10200397B2 (en) | 2016-06-28 | 2019-02-05 | Microsoft Technology Licensing, Llc | Robust matching for identity screening |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9201744B2 (en) | 2013-12-02 | 2015-12-01 | Qbase, LLC | Fault tolerant architecture for distributed computing systems |
US9424294B2 (en) | 2013-12-02 | 2016-08-23 | Qbase, LLC | Method for facet searching and search suggestions |
US9025892B1 (en) | 2013-12-02 | 2015-05-05 | Qbase, LLC | Data record compression with progressive and/or selective decomposition |
US9659108B2 (en) | 2013-12-02 | 2017-05-23 | Qbase, LLC | Pluggable architecture for embedding analytics in clustered in-memory databases |
US9547701B2 (en) | 2013-12-02 | 2017-01-17 | Qbase, LLC | Method of discovering and exploring feature knowledge |
US9348573B2 (en) * | 2013-12-02 | 2016-05-24 | Qbase, LLC | Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes |
US9424524B2 (en) | 2013-12-02 | 2016-08-23 | Qbase, LLC | Extracting facts from unstructured text |
EP3077927A4 (en) | 2013-12-02 | 2017-07-12 | Qbase LLC | Design and implementation of clustered in-memory database |
US9355152B2 (en) | 2013-12-02 | 2016-05-31 | Qbase, LLC | Non-exclusionary search within in-memory databases |
US10572935B1 (en) * | 2014-07-16 | 2020-02-25 | Intuit, Inc. | Disambiguation of entities based on financial interactions |
US10176457B2 (en) * | 2015-02-05 | 2019-01-08 | Sap Se | System and method automatically learning and optimizing sequence order |
US11157920B2 (en) * | 2015-11-10 | 2021-10-26 | International Business Machines Corporation | Techniques for instance-specific feature-based cross-document sentiment aggregation |
US10810408B2 (en) | 2018-01-26 | 2020-10-20 | Viavi Solutions Inc. | Reduced false positive identification for spectroscopic classification |
US11656174B2 (en) | 2018-01-26 | 2023-05-23 | Viavi Solutions Inc. | Outlier detection for spectroscopic classification |
US11009452B2 (en) | 2018-01-26 | 2021-05-18 | Viavi Solutions Inc. | Reduced false positive identification for spectroscopic quantification |
CN109344256A (en) * | 2018-10-12 | 2019-02-15 | 中国科学院重庆绿色智能技术研究院 | A kind of Press release subject classification and checking method |
KR102037453B1 (en) | 2018-11-29 | 2019-10-29 | 부산대학교 산학협력단 | Apparatus and Method for Numeral Classifier Disambiguation using Word Embedding based on Subword Information |
CN110110046B (en) * | 2019-04-30 | 2021-10-01 | 北京搜狗科技发展有限公司 | Method and device for recommending entities with same name |
US11636355B2 (en) * | 2019-05-30 | 2023-04-25 | Baidu Usa Llc | Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process |
CN110942765B (en) * | 2019-11-11 | 2022-05-27 | 珠海格力电器股份有限公司 | Method, device, server and storage medium for constructing corpus |
Family Cites Families (98)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2343097A (en) | 1996-03-21 | 1997-10-10 | Mpath Interactive, Inc. | Network match maker for selecting clients based on attributes of servers and communication links |
US6178529B1 (en) | 1997-11-03 | 2001-01-23 | Microsoft Corporation | Method and system for resource monitoring of disparate resources in a server cluster |
US6353926B1 (en) | 1998-07-15 | 2002-03-05 | Microsoft Corporation | Software update notification |
US6266781B1 (en) | 1998-07-20 | 2001-07-24 | Academia Sinica | Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network |
US6338092B1 (en) | 1998-09-24 | 2002-01-08 | International Business Machines Corporation | Method, system and computer program for replicating data in a distributed computed environment |
US6959300B1 (en) | 1998-12-10 | 2005-10-25 | At&T Corp. | Data compression method and apparatus |
US7099898B1 (en) | 1999-08-12 | 2006-08-29 | International Business Machines Corporation | Data access system |
US6738759B1 (en) | 2000-07-07 | 2004-05-18 | Infoglide Corporation, Inc. | System and method for performing similarity searching using pointer optimization |
US8692695B2 (en) | 2000-10-03 | 2014-04-08 | Realtime Data, Llc | Methods for encoding and decoding data |
US6832373B2 (en) | 2000-11-17 | 2004-12-14 | Bitfone Corporation | System and method for updating and distributing information |
US6691109B2 (en) | 2001-03-22 | 2004-02-10 | Turbo Worx, Inc. | Method and apparatus for high-performance sequence comparison |
GB2374687A (en) | 2001-04-19 | 2002-10-23 | Ibm | Managing configuration changes in a data processing system |
US7082478B2 (en) * | 2001-05-02 | 2006-07-25 | Microsoft Corporation | Logical semantic compression |
US6961723B2 (en) | 2001-05-04 | 2005-11-01 | Sun Microsystems, Inc. | System and method for determining relevancy of query responses in a distributed network search mechanism |
US20030028869A1 (en) | 2001-08-02 | 2003-02-06 | Drake Daniel R. | Method and computer program product for integrating non-redistributable software applications in a customer driven installable package |
JP2003150442A (en) * | 2001-11-19 | 2003-05-23 | Fujitsu Ltd | Memory development program and data processor |
US6954456B2 (en) | 2001-12-14 | 2005-10-11 | At & T Corp. | Method for content-aware redirection and content renaming |
US6829606B2 (en) | 2002-02-14 | 2004-12-07 | Infoglide Software Corporation | Similarity search engine for use with relational databases |
US7421478B1 (en) | 2002-03-07 | 2008-09-02 | Cisco Technology, Inc. | Method and apparatus for exchanging heartbeat messages and configuration information between nodes operating in a master-slave configuration |
US8015143B2 (en) | 2002-05-22 | 2011-09-06 | Estes Timothy W | Knowledge discovery agent system and method |
US7570262B2 (en) | 2002-08-08 | 2009-08-04 | Reuters Limited | Method and system for displaying time-series data and correlated events derived from text mining |
US7249312B2 (en) * | 2002-09-11 | 2007-07-24 | Intelligent Results | Attribute scoring for unstructured content |
US7058846B1 (en) | 2002-10-17 | 2006-06-06 | Veritas Operating Corporation | Cluster failover for storage management services |
US20040205064A1 (en) | 2003-04-11 | 2004-10-14 | Nianjun Zhou | Adaptive search employing entropy based quantitative information measurement |
US7543174B1 (en) | 2003-09-24 | 2009-06-02 | Symantec Operating Corporation | Providing high availability for an application by rapidly provisioning a node and failing over to the node |
US9009153B2 (en) | 2004-03-31 | 2015-04-14 | Google Inc. | Systems and methods for identifying a named entity |
US7818615B2 (en) | 2004-09-16 | 2010-10-19 | Invensys Systems, Inc. | Runtime failure management of redundantly deployed hosts of a supervisory process control data acquisition facility |
US7403945B2 (en) | 2004-11-01 | 2008-07-22 | Sybase, Inc. | Distributed database system providing data and space management methodology |
US20060179026A1 (en) | 2005-02-04 | 2006-08-10 | Bechtel Michael E | Knowledge discovery tool extraction and integration |
US20060294071A1 (en) | 2005-06-28 | 2006-12-28 | Microsoft Corporation | Facet extraction and user feedback for ranking improvement and personalization |
US7630977B2 (en) | 2005-06-29 | 2009-12-08 | Xerox Corporation | Categorization including dependencies between different category systems |
US8386463B2 (en) | 2005-07-14 | 2013-02-26 | International Business Machines Corporation | Method and apparatus for dynamically associating different query execution strategies with selective portions of a database table |
US7681075B2 (en) | 2006-05-02 | 2010-03-16 | Open Invention Network Llc | Method and system for providing high availability to distributed computer applications |
US7447940B2 (en) | 2005-11-15 | 2008-11-04 | Bea Systems, Inc. | System and method for providing singleton services in a cluster |
US8341622B1 (en) | 2005-12-15 | 2012-12-25 | Crimson Corporation | Systems and methods for efficiently using network bandwidth to deploy dependencies of a software package |
US7899871B1 (en) | 2006-01-23 | 2011-03-01 | Clearwell Systems, Inc. | Methods and systems for e-mail topic classification |
US7519613B2 (en) | 2006-02-28 | 2009-04-14 | International Business Machines Corporation | Method and system for generating threads of documents |
US8726267B2 (en) | 2006-03-24 | 2014-05-13 | Red Hat, Inc. | Sharing software certification and process metadata |
US8190742B2 (en) | 2006-04-25 | 2012-05-29 | Hewlett-Packard Development Company, L.P. | Distributed differential store with non-distributed objects and compression-enhancing data-object routing |
US20070282959A1 (en) | 2006-06-02 | 2007-12-06 | Stern Donald S | Message push with pull of information to a communications computing device |
US8615800B2 (en) | 2006-07-10 | 2013-12-24 | Websense, Inc. | System and method for analyzing web content |
US7624118B2 (en) | 2006-07-26 | 2009-11-24 | Microsoft Corporation | Data processing over very large databases |
US8122026B1 (en) | 2006-10-20 | 2012-02-21 | Google Inc. | Finding and disambiguating references to entities on web pages |
US7853611B2 (en) * | 2007-02-26 | 2010-12-14 | International Business Machines Corporation | System and method for deriving a hierarchical event based database having action triggers based on inferred probabilities |
WO2009005744A1 (en) | 2007-06-29 | 2009-01-08 | Allvoices, Inc. | Processing a content item with regard to an event and a location |
US20090043792A1 (en) | 2007-08-07 | 2009-02-12 | Eric Lawrence Barsness | Partial Compression of a Database Table Based on Historical Information |
US9342551B2 (en) | 2007-08-14 | 2016-05-17 | John Nicholas and Kristin Gross Trust | User based document verifier and method |
GB2453174B (en) | 2007-09-28 | 2011-12-07 | Advanced Risc Mach Ltd | Techniques for generating a trace stream for a data processing apparatus |
KR100898339B1 (en) | 2007-10-05 | 2009-05-20 | 한국전자통신연구원 | Autonomous fault processing system in home network environments and operation method thereof |
US8396838B2 (en) | 2007-10-17 | 2013-03-12 | Commvault Systems, Inc. | Legal compliance, electronic discovery and electronic document handling of online and offline copies of data |
US8375073B1 (en) | 2007-11-12 | 2013-02-12 | Google Inc. | Identification and ranking of news stories of interest |
US8294763B2 (en) | 2007-12-14 | 2012-10-23 | Sri International | Method for building and extracting entity networks from video |
US8326847B2 (en) * | 2008-03-22 | 2012-12-04 | International Business Machines Corporation | Graph search system and method for querying loosely integrated data |
US20100077001A1 (en) | 2008-03-27 | 2010-03-25 | Claude Vogel | Search system and method for serendipitous discoveries with faceted full-text classification |
US8712926B2 (en) | 2008-05-23 | 2014-04-29 | International Business Machines Corporation | Using rule induction to identify emerging trends in unstructured text streams |
US8358308B2 (en) | 2008-06-27 | 2013-01-22 | Microsoft Corporation | Using visual techniques to manipulate data |
CA2686796C (en) | 2008-12-03 | 2017-05-16 | Trend Micro Incorporated | Method and system for real time classification of events in computer integrity system |
US8874576B2 (en) | 2009-02-27 | 2014-10-28 | Microsoft Corporation | Reporting including filling data gaps and handling uncategorized data |
GB0904113D0 (en) * | 2009-03-10 | 2009-04-22 | Intrasonics Ltd | Video and audio bookmarking |
US20100235311A1 (en) * | 2009-03-13 | 2010-09-16 | Microsoft Corporation | Question and answer search |
US8213725B2 (en) | 2009-03-20 | 2012-07-03 | Eastman Kodak Company | Semantic event detection using cross-domain knowledge |
US8161048B2 (en) * | 2009-04-24 | 2012-04-17 | At&T Intellectual Property I, L.P. | Database analysis using clusters |
US8055933B2 (en) | 2009-07-21 | 2011-11-08 | International Business Machines Corporation | Dynamic updating of failover policies for increased application availability |
EP2488960A4 (en) * | 2009-10-15 | 2016-08-03 | Hewlett Packard Entpr Dev Lp | Heterogeneous data source management |
US8645372B2 (en) | 2009-10-30 | 2014-02-04 | Evri, Inc. | Keyword-based search engine results using enhanced query strategies |
US20110125764A1 (en) | 2009-11-26 | 2011-05-26 | International Business Machines Corporation | Method and system for improved query expansion in faceted search |
US8583647B2 (en) | 2010-01-29 | 2013-11-12 | Panasonic Corporation | Data processing device for automatically classifying a plurality of images into predetermined categories |
US9710556B2 (en) * | 2010-03-01 | 2017-07-18 | Vcvc Iii Llc | Content recommendation based on collections of entities |
US8595234B2 (en) | 2010-05-17 | 2013-11-26 | Wal-Mart Stores, Inc. | Processing data feeds |
US8429256B2 (en) | 2010-05-28 | 2013-04-23 | Red Hat, Inc. | Systems and methods for generating cached representations of host package inventories in remote package repositories |
US8345998B2 (en) | 2010-08-10 | 2013-01-01 | Xerox Corporation | Compression scheme selection based on image data type and user selections |
US8321443B2 (en) | 2010-09-07 | 2012-11-27 | International Business Machines Corporation | Proxying open database connectivity (ODBC) calls |
US20120102121A1 (en) * | 2010-10-25 | 2012-04-26 | Yahoo! Inc. | System and method for providing topic cluster based updates |
US8423522B2 (en) | 2011-01-04 | 2013-04-16 | International Business Machines Corporation | Query-aware compression of join results |
US20120246154A1 (en) | 2011-03-23 | 2012-09-27 | International Business Machines Corporation | Aggregating search results based on associating data instances with knowledge base entities |
KR20120134916A (en) | 2011-06-03 | 2012-12-12 | 삼성전자주식회사 | Storage device and data processing device for storage device |
US20120310934A1 (en) | 2011-06-03 | 2012-12-06 | Thomas Peh | Historic View on Column Tables Using a History Table |
US9104979B2 (en) | 2011-06-16 | 2015-08-11 | Microsoft Technology Licensing, Llc | Entity recognition using probabilities for out-of-collection data |
WO2013003770A2 (en) | 2011-06-30 | 2013-01-03 | Openwave Mobility Inc. | Database compression system and method |
US9032387B1 (en) | 2011-10-04 | 2015-05-12 | Amazon Technologies, Inc. | Software distribution framework |
US9026480B2 (en) | 2011-12-21 | 2015-05-05 | Telenav, Inc. | Navigation system with point of interest classification mechanism and method of operation thereof |
US9037579B2 (en) | 2011-12-27 | 2015-05-19 | Business Objects Software Ltd. | Generating dynamic hierarchical facets from business intelligence artifacts |
US9251250B2 (en) * | 2012-03-28 | 2016-02-02 | Mitsubishi Electric Research Laboratories, Inc. | Method and apparatus for processing text with variations in vocabulary usage |
US10908792B2 (en) | 2012-04-04 | 2021-02-02 | Recorded Future, Inc. | Interactive event-based information system |
US9483513B2 (en) * | 2012-04-30 | 2016-11-01 | Sap Se | Storing large objects on disk and not in main memory of an in-memory database system |
US10162766B2 (en) * | 2012-04-30 | 2018-12-25 | Sap Se | Deleting records in a multi-level storage architecture without record locks |
US20130290232A1 (en) * | 2012-04-30 | 2013-10-31 | Mikalai Tsytsarau | Identifying news events that cause a shift in sentiment |
US8948789B2 (en) | 2012-05-08 | 2015-02-03 | Qualcomm Incorporated | Inferring a context from crowd-sourced activity data |
US9703833B2 (en) | 2012-11-30 | 2017-07-11 | Sap Se | Unification of search and analytics |
US9542652B2 (en) | 2013-02-28 | 2017-01-10 | Microsoft Technology Licensing, Llc | Posterior probability pursuit for entity disambiguation |
US9104710B2 (en) * | 2013-03-15 | 2015-08-11 | Src, Inc. | Method for cross-domain feature correlation |
US8977600B2 (en) | 2013-05-24 | 2015-03-10 | Software AG USA Inc. | System and method for continuous analytics run against a combination of static and real-time data |
CN103365974A (en) * | 2013-06-28 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Semantic disambiguation method and system based on related words topic |
US9734221B2 (en) | 2013-09-12 | 2017-08-15 | Sap Se | In memory database warehouse |
US9201744B2 (en) | 2013-12-02 | 2015-12-01 | Qbase, LLC | Fault tolerant architecture for distributed computing systems |
US9025892B1 (en) | 2013-12-02 | 2015-05-05 | Qbase, LLC | Data record compression with progressive and/or selective decomposition |
US9223875B2 (en) | 2013-12-02 | 2015-12-29 | Qbase, LLC | Real-time distributed in memory search architecture |
US9424294B2 (en) | 2013-12-02 | 2016-08-23 | Qbase, LLC | Method for facet searching and search suggestions |
-
2014
- 2014-12-01 WO PCT/US2014/067918 patent/WO2015084724A1/en active Application Filing
- 2014-12-01 EP EP14868541.5A patent/EP3077919A4/en not_active Withdrawn
- 2014-12-01 KR KR1020167017515A patent/KR20160124742A/en not_active Application Discontinuation
- 2014-12-01 CN CN201480072968.3A patent/CN106164890A/en active Pending
- 2014-12-01 JP JP2016536850A patent/JP6284643B2/en active Active
- 2014-12-01 CA CA2932399A patent/CA2932399A1/en not_active Abandoned
- 2014-12-02 US US14/557,794 patent/US9239875B2/en active Active
-
2015
- 2015-12-28 US US14/979,703 patent/US20160110446A1/en not_active Abandoned
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018005203A1 (en) * | 2016-06-28 | 2018-01-04 | Microsoft Technology Licensing, Llc | Leveraging information available in a corpus for data parsing and predicting |
US10200397B2 (en) | 2016-06-28 | 2019-02-05 | Microsoft Technology Licensing, Llc | Robust matching for identity screening |
US10311092B2 (en) | 2016-06-28 | 2019-06-04 | Microsoft Technology Licensing, Llc | Leveraging corporal data for data parsing and predicting |
CN106991171A (en) * | 2017-03-25 | 2017-07-28 | 贺州学院 | Topic based on Intelligent campus information service platform finds method |
Also Published As
Publication number | Publication date |
---|---|
US20150154286A1 (en) | 2015-06-04 |
JP6284643B2 (en) | 2018-02-28 |
CA2932399A1 (en) | 2015-06-11 |
JP2016541069A (en) | 2016-12-28 |
KR20160124742A (en) | 2016-10-28 |
EP3077919A1 (en) | 2016-10-12 |
WO2015084724A1 (en) | 2015-06-11 |
CN106164890A (en) | 2016-11-23 |
EP3077919A4 (en) | 2017-05-10 |
US9239875B2 (en) | 2016-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9239875B2 (en) | Method for disambiguated features in unstructured text | |
US9201931B2 (en) | Method for obtaining search suggestions from fuzzy score matching and population frequencies | |
US10725836B2 (en) | Intent-based organisation of APIs | |
US9613166B2 (en) | Search suggestions of related entities based on co-occurrence and/or fuzzy-score matching | |
US9424524B2 (en) | Extracting facts from unstructured text | |
US9626623B2 (en) | Method of automated discovery of new topics | |
US9619571B2 (en) | Method for searching related entities through entity co-occurrence | |
US10810215B2 (en) | Supporting evidence retrieval for complex answers | |
WO2015084757A1 (en) | Systems and methods for processing data stored in a database | |
US9507834B2 (en) | Search suggestions using fuzzy-score matching and entity co-occurrence | |
US20170124090A1 (en) | Method of discovering and exploring feature knowledge | |
JP6145562B2 (en) | Information structuring system and information structuring method | |
US20160085760A1 (en) | Method for in-loop human validation of disambiguated features | |
CN113656574A (en) | Method, computing device and storage medium for search result ranking | |
Li | Connecting Text with Knowledge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: QBASE, LLC, VIRGINIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIGHTNER, SCOTT;WECKESSER, FRANZ;BODDHU, SANJAY;AND OTHERS;SIGNING DATES FROM 20141201 TO 20141202;REEL/FRAME:037363/0166 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |