WO2021178750A1

WO2021178750A1 - Recursive evidence extraction method

Info

Publication number: WO2021178750A1
Application number: PCT/US2021/021014
Authority: WO
Inventors: Jean-Marie LAIGLE; Carlos COLLANTES; Eric Hare
Original assignee: Belmont Technology Inc.
Priority date: 2020-03-05
Filing date: 2021-03-05
Publication date: 2021-09-10

Abstract

A method for extracting information from documents includes accepting a document as input to a programmed computer. In the computer, text is processed to identify named entities in the text. In the computer, the identified named entities are disambiguated. In the computer, presence of concepts in the document is established using an artificial neural network trained using established relationships between predetermined named entities and predetermined concepts. In the computer, parts of the document where established concepts occur are classified. In the computer, the disambiguated entities are characterized according to attributes associated with the disambiguated entities. In the computer, the characterized entities are used to update a database comprising published information corresponding to the characterized entities and the established present concepts.

Description

RECURSIVE EVIDENCE EXTRACTION METHOD

Background

[0001] This disclosure relates to the field of document processing to extract information and to update information databases with the information extracted from the processed documents. More specifically, the disclosure relates to methods for extracting information from documents such that entities in the documents can be characterized with respect to abstract concepts present in a “knowledge graph”, and new knowledge may be obtained from the processed documents with minimal operator or user supervision.

[0002] A knowledge base is a set of truths or facts that may be considered unbiased.

Automated construction of a knowledge base requires an information processing system that accepts as input a large number of documents and identifies and characterizes relationships between entities recited in the documents. The relationships are computationally enhanced with additional processing of documents through repetition or mention in multiple documents. Depending on the type of documents that are processed by an automated knowledge base construction system, such a system can construe certain words or phrases in any input document with their correct meaning in the case of ambiguity and lack of explicit context to resolve ambiguity. Automated knowledge base construction is a challenging task which has been investigated through multiple approaches. Examples of automated knowledge base construction systems include Knowledge Vault (Dong et ah, 2014), NELL (Mitchell et ah, 2015), and the DeepDive system (Zhang, 2015). The DeepDive system powers a geoscience knowledge base construction system named GeoDeepDive.

Summary

[0003] One aspect of the present disclosure is a method for extracting information from documents. A method, according to this aspect of the disclosure, includes accepting a document as input to a programmed computer. In the computer, text is processed to identify named entities in the text. In the computer, the identified named entities are disambiguated. In the computer, presence of concepts in the document is established using a neural network trained using established relationships between predetermined named entities and predetermined concepts. In the computer, parts of the document where established concepts occur are classified. In the computer, the disambiguated entities are characterized according to attributes associated with the disambiguated entities. In the computer, the characterized entities are used to update a database comprising published information corresponding to the characterized entities and the established present concepts.

[0004] A computer program according to another aspect of this disclosure is stored in a non-transitory computer readable medium. The program comprises logic operable to cause a computer to perform actions including accepting a document as input to the computer. In the computer, text is processed to identify named entities in the text. In the computer, the identified named entities are disambiguated. In the computer, presence of concepts in the document is established using an artificial neural network trained using established relationships between predetermined named entities and predetermined concepts. In the computer, parts of the document where established concepts occur are classified. In the computer, the disambiguated entities are characterized according to attributes associated with the disambiguated entities. In the computer, the characterized entities are used to update a database comprising published information corresponding to the characterized entities and the established present concepts.

[0005] In some embodiments, the identifying named entities comprises, in the computer, generating noun groupings from the input document and searching the database for presence of the noun groupings.

[0006] In some embodiments, the disambiguating comprises, in the computer, using a trained neural network to determine contextual word embeddings.

[0007] In some embodiments, the characterizing the disambiguated entities comprises, in the computer, determining attributes of each disambiguated entity based on information in the database.

[0008] Other aspects and possible advantages will be apparent from the description and claims that follow. Brief Description of the Drawings

[0009] FIG. 1 shows a flow chart of an example embodiment of a process according to the present disclosure.

[0010] FIG. 2 shows an example computer system that may be used in accordance with the present disclosure.

Detailed Description

[0011] A method according to the present disclosure accepts documents as input. The documents may comprise text, and metadata such as tables, graphs and illustrative figures. The method may also accept as input: named “entities”, that is terms relevant to a particular technical or business field of interest, i.e., a domain; and relationships between the entities in a particular context. An end result of the method may comprise a “knowledge graph”. A knowledge graph, as that term is used in this disclosure, consists of a set of nodes and relationships between the nodes. The nodes comprise document processed nodes, detected named entities, concept and paragraph classification nodes. Relationships are connections between the nodes. Relationships can be of different types (e g., logical operators, LINKED TO, RELATED TO, IS IN, HAS...) and can be directional between certain types of nodes. Computed attributes can be stored in the relationships between nodes (primarily between the document and the other entity nodes). In order for the disclosed method to produce and deliver that knowledge graph, the method uses a Ground Truth Knowledge Graph that comprises a network of a priori or predetermined related and relevant concepts for a given industry, technical field or other subject of interest that encapsulates objective knowledge from, e.g., subject matter experts for that industry, technical field or other subject of interest. This knowledge can be, for example, identifiable countries, states, counties, fields, schools, businesses, etc/, and how these entities are related to each other (e.g., by geographic location, relevant industry, number of employees in a business organization, etc.).

[0012] The disclosed method shares similarities with the GeoDeepDive system set forth in the Background section herein at a high level, but the present disclosed method is specifically oriented toward precision update of a “Ground Truth Knowledge Graph”, while having modularity to develop custom applications, for example, for content classification. In addition, the core algorithms for entity extraction and disambiguation are different than those in systems known in the art prior to the present disclosure. In methods known in the art prior to the present disclosure, such as Knowledge Vault, the type of disambiguation used is based in triplets of facts. See, Dong X., Gabrilovich E., Heitz G., Horn W., Lao N., Murphy K., Strohmann, T. Sun S., Zhang W. [2014] Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. In this disclosure, context, rather than triplets of facts, is used to disambiguate entities. As used herein, a “knowledge graph” is a generic term to refer any graph made of nodes and edges/relationships, where nodes, and relationships between elements of the graph, are linked to knowledge such as technical knowledge in the subject matter of interest. A “ground truth” knowledge graph may be an initial knowledge graph for a particular technical field, industry or subject of interest used to extract data from a document to be processed according to the present method. The initial “ground truth” knowledge graph may be expanded as more documents are processed according to the present disclosed method. The relevant part of the knowledge graph is that the nodes are the set of “nouns” (entities) that are to be identified in a document, and the relationships between these nodes used in the present method to determine relevancy for these “nouns” (entities) detected in the paragraphs of the document.

[0013] An example embodiment of a method according to the present disclosure will now be explained with reference to the flow chart in FIG. 1. At 101, digitization of documents used as input to the method is performed. Digitization of a document may comprise extracting the document structure, and meta information such as document title and names of the authors. Digitization may comprise separating the document into sections, identifying the publication in which the document was disclosed, the publication date (e.g., year) and other identifying information. Document file format is not a limitation on the scope of the present disclosure; the present example embodiment has been developed to accept input from MICROSOFT POWERPOINT documents, MICROSOFT WORD documents, portable document files, email messages and plain text files. MICROSOFT POWERPOINT and MICROSOFT WORD are trademarks of Microsoft Corporation, Redmond, WA.

[0014] At 102, the method may comprise performing tokenization on the digitized documents. Tokenization may comprise identifying document structural elements such as tables and figures; identifying paragraph indentation; and identifying line breaks in the text. The shortest “paragraphs” (e.g., blocks of separated text) may then be filtered in order to ensure that bits of text such as portable document file (PDF) headers or footnotes are excluded from further analysis. The user can set a minimum filter threshold in order to exclude text that is very short in length, e.g., below a predetermined word count threshold, such as short figure captions, from the subsequent analysis. Tokenization as that term is used with reference to 102 in FIG. 1 comprises maintaining the original structure of the paragraphs from the input (or source) documents as closely as practical. At 102 in FIG. 1, the tokens may be structural paragraphs in the document.

[0015] At 103, any input documents not already in PDF format may be converted to PDF format to facilitate further processing and to facilitate display of the original document in a user interface for other processes. PDF is an abbreviation for portable document format, which is an open standard document computer file format.

[0016] At 104, Named Entity Recognition may be performed. Named Entity Recognition is a well-known task in Natural Language Processing, which comprises identifying named entities (features) in the text and classifying the identified features into pre defined categories. For example, in geoscience documents, it is important to be able to identify recitation in the processed documents of particular features such as sedimentary basins and geologic provinces, petroleum producing fields, lease or unit blocks, wells, geological formations, or geological ages. In geoscience documents, the foregoing features constitute a framework to anchor information provided in other documents processed according to the present method.

[0017] Named Entity Recognition comprises separating the paragraphs into tokens and comparing these tokens to a list of candidate entities that may originate from the ground truth knowledge graph. The candidates can be single tokens or multipart tokens. [0018] It has been observed that relatively simple approaches to named entity recognition, for example, approaches based on pattern matching, have proven successful where a document text is carefully written, but may be inadequate otherwise. Examples of what are usually carefully written documents include scientific publications, but such simple approaches have proven less successful when processing typical business documents such as informal meeting reports, presentations, emails and the like. The named entities detected in the document being processed are derived from the initial and updated ground truth knowledge graph, that is to say, tokens in the paragraph are compared to a list of candidates created from nodes in the knowledge graph. The process used to match identified entities to the knowledge graph may be implemented simply The list of candidate entities, however, generally requires initializing the knowledge base with named entities generated by subject matter experts, i.e., to determine the validity of identifying a particular entity as related to a particular business context, technical context or subject of interest. The disclosed method, therefore, may use a language modeling network algorithms (see, e.g., Devlin et ah, 2018) to compute contextual word embeddings (embedding being vectors of n dimensions, that describe numerically each word/token in a paragraph) which are transmitted to a neural network for classification into various categories. Words and tokens in a paragraph may be converted (see, e.g., Devlin et ah, 2018) to a string of numbers (a vector) and the vector for the same word or token may be different (may contain different numerical values) according to the context in which the words and tokens appear in any specific paragraph in any document, and these are the contextual word embeddings (numerical representations of word/tokens within a context (paragraph)). In this context, tokenization means separating paragraphs into smaller elements (words in some cases, word phrases or groupings in other cases).

[0019] The language modeling network algorithms (see, e.g., Devlin et ah, 2018) convert paragraphs in the processed document to corresponding strings of numbers by reading the paragraph text forward and backward. The language modeling network output allows for the detection of very minute details in a document to determine which, in an occurrence of a word, one of several different entities is represented by the occurrence of the word. For example, in geoscience, it may be determined whether the word is used to identify a petroleum producing field, a geologic basin, a well, a country, a state, a county, a formation, or a geological concept, among other entities. See, e.g., Devlin et al., 2018. Such identification can be performed because the contextual vectors, represented by the aforementioned string of numbers, encode precise meaning about the entity and the context surrounding the entity in the paragraph.

[0020] At 109, New Knowledge Discovery and New Facts Extraction may be performed.

This process is accomplished by running any token or set of tokens not available in the existing ground truth knowledge graph through the machine learning trained neural network explained with reference to 104 in FIG. 1, and those entities with a classifier type probability higher than a certain threshold are considered candidates for new knowledge, because this knowledge is used in the appropriate way in the given context of a paragraph. The foregoing process elements are oriented toward building automatically new relationships between entities and existing abstract concepts while limiting the rate of false positives. One consequence is that the disclosed method enables training models to discover new knowledge, as explained herein with reference to 109 in FIG. 1. New knowledge may comprise a set of entities and concepts that do not, at the time of that document being processed, exist in the Ground Truth Knowledge Graph, but that can be inferred from the current document’s paragraphs and context. At this point there may be two or more detected entities, some of which are of different type and with the same name.

At 105, Named Entity Disambiguation is performed. Named entity disambiguation comprises establishing a link, correlation or relationship between entities successfully detected (at 104 in FIG. 1) between the input document and entities in the Ground Truth Knowledge Graph. Named Entity Disambiguation is performed with a set of trained neural networks that is used to compute the probability of these entities being of a given type. The entities with the highest probability above a certain threshold are considered valid from a perspective of the subject of interest of the particular document, that is to say, having a high probability of being of a given entity type. The disclosed method is able to calculate, from the Ground Truth Knowledge Graph, what are the expected connections between detected and disambiguated entities and the document. [0021] For example, the word “September” may mean the name of an oil field in Egypt, the ninth month of the year in the Gregorian calendar, the name of an engineer, the name of a company, the name of a town, among other meanings attributed to the word. Disambiguation provides the probability of the word September being of one meaning or another given the context in the paragraph in which the word is found.

[0022] Disambiguation is performed by a neural network that is trained with contextual word embeddings, which may be derived from the paragraphs that have been tokenized, and labeled examples from these contextual word embeddings. The neural network may be trained with the word embedding vectors and a label associated to each word embedding.

[0023] Relevance is the objective, to determine if a detected entity (and its type, e.g., field, formations, basin, etc.) is relevant given the context. The probability of an entity being relevant is the output of the neural network model (disambiguation). Effectively, the probability is a direct measure of the likelihood that a given entity is relevant in a particular paragraph in a document.

[0024] Such approach not only allows disambiguating efficiently, it also makes possible suggesting missing entities in the Ground Truth database. Accuracy of this method (percentage of correctly disambiguated entities) has been shown to be higher than 98%.

[0025] At 106, Concept identification and content classification are performed. When concepts are identified in a document, the disclosed method uses a trained neural network to classify the content of the part of the document where the concept is recited, and classify the concept accordingly. The process to train this neural network may comprise access to a data base constructed by processing a large number (e.g., several thousand) documents that have been roughly categorized for the classes in content classification. Analysis of the concept usage, such as frequency of occurrence within a document and within paragraphs, adjacency to other concepts within a paragraph and positional value, e.g., “porosity degradation”, wherein “degradation” modifies “porosity”, produces a large data set of labeled examples used to train a neural network to be used in paragraph content classification. Occurrences of concepts, for example, geoscience concepts present in the Ground Truth Knowledge Base are systematically extracted from processed documents. For some of the extracted concepts it is useful to evaluate whether extracted keywords represent one or more particular concepts present in the Ground Truth Knowledge Base. Typical examples are words with multiple meanings both within and outside a Ground Truth Knowledge Base, such as ‘migration’, ‘source’, ‘basement’ or ‘fault’. Examples are provided where paragraph content is rightly classified into categories relevant for petroleum systems, such as ‘Charge’, ‘Thermal Maturity’, ‘Geochemistry’, ‘Fluids’, ‘Fluid flow’, ‘Pore Pressure’, ‘Drilling Operations’, ‘Mineralogy’, ‘Sedimentology’, ‘Structural Geology’, ‘Geodynamics’, etc. Such categories can be easily customized, and classifiers retrained because the language modeling network embeddings are stored during the document processing.

[0026] At 108, Recursive Named Entities Characterization may be performed as follows.

Each named entity which is identified and disambiguated as explained above may then be systematically investigated according to the schema of the Ground Truth Knowledge Base. Systematic investigation may be performed using a recursive process which is parameterized by both the Knowledge Base ontology and the methods of the corresponding object model. For example, when a geological formation is identified in a document, the recursive extraction process tries to determine whether the formation is mentioned in the document as a petroleum source rock, a subsurface reservoir, or as a seal formation above a reservoir. Depending on the results, the recursive extraction process tries then to further determine mentions in the document of the depositional environment, lithology, key metrics, etc. To keep the approach generic and flexible, and in order to limit the amount of labeled data, the recursive process leverages pragmatically the Part of Speech and a vocabulary corpus to discover possible attributes. However, we recognize that it is also possible to train neural networks, or to fine tune the BERT process (see Devlin et al. 2018), to establish these relations, the only limitation being the availability of a training data set. The discovered characteristics may be registered into the Knowledge Graph along with their source, and they can be leveraged to summarize key facts from a document or a data set. All facts extracted from a given document are registered and made available for further processing. [0027] At 107, the Ground Truth Knowledge Graph is updated. This element of the process, is the practical update of the current iteration of the knowledge graph. Updating the knowledge graph comprises adding the document being processed as a new node and establishing connections with new and preexisting graph nodes of type different than document. At this point the Knowledge Graph may have disambiguated entities with the same name and type.

[0028] The nodes in the Ground Truth Knowledge Graph to be linked to the document nodes are the set of entities detected in the document. The entities, however may not have unique names, therefore extra information about these entities can be used to determine their relevance for that document. For example, there are several oil fields in the world named Coral, so in order to select the correct field for a given document, the method may use other entities having a pre-established relationship to the Coral entities appearing in the document, for example, sedimentary basin, country, geologic formation, oil company operator, rock mineral composition (lithology), among others. The method may enforce the condition that at least two related entities need to be identified in the same document in order for the specific entity to be identified as relevant. For example, two petroleum producing fields recited as belonging to the same sedimentary basin, one field and its associated sedimentary basin, one field and its country, one field and its oil company operator. To be able to connect the identified entities in an input document to entities in the Ground Truth Knowledge Graph, in the disclosed method, the entities in the processed document and those already in the knowledge graph need to have a predetermined relationship to each other. This explains the need for an a priori knowledge graph so that more complex relationships between entities can be established. In the present example embodiment, each entity identified in a document may comprise a network of related entities, and each of these related entities may have its own network of further related entities. By finding the commonalities for in each network the present method calculates consistency. This means, for example, that several entities may become relevant for a given document, because they have an a priori relationship together (information consistency), or none does become relevant, because the document is sparse in details (e.g., geodetic location, etc.) [0029] As a result of the entity disambiguation process described above, the rate of false positives, that is, false association of a word or phrase in a document with an entity, has been shown to be below 2%.

[0030] Knowledge graphs are important to build better search engines, question answering systems, recommendation engines and feed algorithms for the cross analysis of multiple data sets. These algorithms map user queries to Knowledge Graph entities. A graph analysis is performed to determine the paths linking the concepts and leverage the information discovered along these paths to construct answers to the users.

[0031] The updated Ground Truth Knowledge Graph may be used for specific purposes, including but not limited to: retrieving documents related to an entity designated by the user, such as a geographic location, geologic structure, geologic process or a type of hydrocarbon reservoir. Examples may include documents related to the country Angola and/or documents related to braided fluvial channels and/or secondary migration of reservoir fluids; suggesting knowledge hidden among the documents, for example when selecting documents related to the country Angola, the corpus of documents may show 90% of them are linked to clastic hydrocarbon reservoirs, but 10% are linked to carbonate reservoirs, meaning that historically oil in Angola has been produced primarily from sandstone reservoirs, but possible new opportunities are opening in carbonate reservoirs or carbonates were identified but not considered relevant in the past, and were becoming a relevant opportunity at the time the database is updated.

[0032] FIG. 2 shows an example computing system 200 that may be used in accordance with some embodiments. The computing system 200 may be an individual computer system 201A or an arrangement of distributed computer systems. The individual computer system 201 A may include one or more analysis modules 202 that may be configured to perform various tasks according to some embodiments, such as the tasks explained with reference to FIG. 2. To perform these various tasks, the analysis module 202 may operate independently or in coordination with one or more processors 204, which may be connected to one or more storage media 206. A display device 205 such as a graphic user interface of any known type may be in signal communication with the processor 204 to enable user entry of commands and/or data and to display results of execution of a set of instructions according to the present disclosure.

[0033] The processor(s) 204 may also be connected to a network interface 208 to allow the individual computer system 201A to communicate over a data network 210 (e.g., a local network or the Internet) with one or more additional individual computer systems and/or computing systems, such as 20 IB, 201C, and/or 20 ID (note that computer systems 20 IB, 201C and/or 20 ID may or may not share the same architecture as computer system 201A, and may be located in different physical locations, for example, computer systems 201A and 201B may be at a well drilling location, while in communication with one or more computer systems such as 201C and/or 20 ID that may be located in one or more data centers on shore, aboard ships, and/or located in varying countries on different continents).

[0034] A processor may include, without limitation, a microprocessor, microcontroller, processor module or subsystem, programmable integrated circuit, programmable gate array, or another control or computing device.

[0035] The storage media 206 may be implemented as one or more computer-readable or machine-readable storage media. Note that while in the example embodiment of FIG. 2 the storage media 206 are shown as being disposed within the individual computer system 201A, in some embodiments, the storage media 206 may be distributed within and/or across multiple internal and/or external enclosures of the individual computing system 201A and/or additional computing systems, e.g., 201B, 201C, 201D. Storage media 206 may include, without limitation, one or more different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. Note that computer instructions to cause any individual computer system or a computing system to perform the tasks described above may be provided on one computer-readable or machine-readable storage medium, or may be provided on multiple computer-readable or machine-readable storage media distributed in a multiple component computing system having one or more nodes. Such computer-readable or machine-readable storage medium or media may be considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine- readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.

[0036] It should be appreciated that computing system 200 is only one example of a computing system, and that any other embodiment of a computing system may have more or fewer components than shown, may combine additional components not shown in the example embodiment of FIG. 2, and/or the computing system 200 may have a different configuration or arrangement of the components shown in FIG. 2. The various components shown in FIG. 2 may be implemented in hardware, software, or a combination of both hardware and software, including one or more signal processing and/or application specific integrated circuits.

[0037] Further, the acts of the processing methods described above may be implemented by running one or more functional modules in information processing apparatus such as general purpose processors or application specific chips, such as ASICs, FPGAs, PLDs, or other appropriate devices. These modules, combinations of these modules, and/or their combination with general hardware are all included within the scope of the present disclosure.

[0038] The disclosed method provides a comprehensive, flexible end-to-end, scalable framework which allows processing large document databases, extract, e.g., geological knowledge automatically and update knowledge graphs with a very low rate of false positives thanks to a thorough disambiguation process. While this communication discusses mostly the identification of geological content, it can be easily parameterized to process any type of technical content.

References Cited in this Disclosure

Mikolov T., Chen K., Corrado G., Dean J., [2013] Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/abs/1301.3781

Devlin J., Chang M., Lee K., Toutanova K. [2018] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https ://arxiv. org/abs/1810.04805

Dong X., Gabrilovich E., Heitz G., Horn W., Lao N., Murphy K., Strohmann, T. Sun S., Zhang W. [2014] Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. https://research.google/pubs/pub45634

T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. [2015] Never-Ending Learning. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Zhang C. [2015] DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, University ofWisconsin-Madison

[0039] In light of the principles and example embodiments described and illustrated herein, it will be recognized that the example embodiments can be modified in arrangement and detail without departing from such principles. The foregoing discussion has focused on specific embodiments, but other configurations are also contemplated. In particular, even though expressions such as in “an embodiment," or the like are used herein, these phrases are meant to generally reference embodiment possibilities, and are not intended to limit the disclosure to particular embodiment configurations. As used herein, these terms may reference the same or different embodiments that are combinable into other embodiments. As a rule, any embodiment referenced herein is freely combinable with any one or more of the other embodiments referenced herein, and any number of features of different embodiments are combinable with one another, unless indicated otherwise. Although only a few examples have been described in detail above, those skilled in the art will readily appreciate that many modifications are possible within the scope of the described examples. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the following claims.

Claims

Claims What is claimed is:

1. A method for extracting information from documents, comprising: accepting a document as input to a programmed computer; in the computer, processing text to identify named entities in the text; in the computer, disambiguating the identified named entities; in the computer, establishing presence of concepts in the document and characterizing paragraph evaluating contextual word embeddings from the concepts in a trained neural network; in the computer, classifying parts of the document where established concepts occur; in the computer, characterizing the disambiguated entities according to attributes associated with the disambiguated entities; and in the computer, using the characterized entities to update a database comprising published information corresponding to the characterized entities and the established present concepts.

2. The method of claim 1 wherein the identifying named entities comprises, in the computer, generating noun groupings from the input document and searching the database for presence of the noun groupings.

3. The method of claim 1 wherein the disambiguating comprises, in the computer, using a trained neural network to classify contextual word embeddings produced by a language processing network.

4. The method of claim 1 wherein the characterizing the disambiguated entities comprises, in the computer, determining attributes of each disambiguated entity based on information in the database.

5. A computer program stored in a non-transitory computer readable medium, the program comprising logic operable to cause a computer to perform actions, comprising: accepting a document as input the computer; in the computer, processing text to identify named entities in the text; in the computer, disambiguating the identified named entities; in the computer, establishing presence of concepts in the document using an artificial neural network trained using established relationships between predetermined named entities and predetermined concepts; in the computer, classifying parts of the document where established concepts occur; in the computer, characterizing the disambiguated entities according to attributes associated with the disambiguated entities; and in the computer, using the characterized entities to update a database comprising published information corresponding to the characterized entities and the established present concepts.

6. The computer program of claim 5 wherein the identifying named entities comprises, in the computer, generating noun groupings from the input document and searching the database for presence of the noun groupings.

7. The computer program of claim 5 wherein the disambiguating comprises, in the computer, using a trained neural network to determine contextual word embeddings.

8. The computer program of claim 5 wherein the characterizing the disambiguated entities comprises, in the computer, determining attributes of each disambiguated entity based on information in the database.