CN115295165A

CN115295165A - Knowledge graph system for medical science and decision-making auxiliary method thereof

Info

Publication number: CN115295165A
Application number: CN202210865767.5A
Authority: CN
Inventors: 何璇; 袁文轩; 郭子健; 李雨芮; 李培宁; 刘云霞
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-11-04

Abstract

The invention relates to a knowledge graph system for medical science and a decision-making auxiliary method thereof, wherein the knowledge graph system comprises: a data acquisition layer configured to crawl, by a crawler, incremental indexes that do not exist in the index database and to crawl incremental content corresponding to the incremental indexes; the natural language processing layer is configured to perform entity extraction and relation extraction on the incremental content by using a natural language processing tool to obtain triple information of the incremental content; a persistence layer configured to construct a knowledge graph using the triplet information of the incremental content and store the knowledge graph; a medical application layer configured to provide services to a user based on a knowledge-graph; wherein the medical application layer comprises a decision assistance module configured to provide a decision assistance scheme for the user. By means of the technical scheme, the latest literature information can be monitored in real time, and the knowledge graph can be automatically updated.

Description

Knowledge graph system for medical science and decision-making auxiliary method thereof

Technical Field

The invention relates to the field of knowledge graphs, in particular to a knowledge graph system for medicine and a decision auxiliary method thereof.

Background

With the development of life science and technology in recent years, a series of brilliant achievements are obtained in the medical field, and the number of documents reaches the unprecedented scale. For medical workers, the medical literature is an important way to improve self-level, communicating experience. In this data-intensive scientific era today, the number of medical literature is expanding every day, bringing a lot of new important information. This also leads to an increasingly complex structure of knowledge points in the medical field, which makes it difficult for medical workers to accurately grasp the latest research progress in the medical field in time. More importantly, the knowledge structure and knowledge context in the medical field are more and more complicated, concepts such as genes, proteins, medicines, examinations, diseases, symptoms, treatments and the like show complex interaction relations under different conditions, and scientific research is increasingly becoming a data-driven knowledge discovery activity.

To explain the complex contextual relationships between knowledge between concepts, the knowledge structure can be expressed as a network topology to study the associations between knowledge. Meanwhile, with the development of natural language processing technology, the path for acquiring knowledge is gradually enriched. Knowledge bases such as the Unified Medical Language System (UMLS), the Medical term System Nomenclature-Clinical Terms (systematic Nomenclature of Medical-Clinical Terms, snomed ct) have been developed for decades, and hierarchical relationships between Medical concepts have been constructed.

At present, most medical knowledge bases are manually compiled by experts and undergo a long process, including medical Systematized Nomenclature-Clinical Terms (SNOMED-CT), ICD-10, drug database drug Bank and The like. However, with the continuous update of medical knowledge, the manually compiled knowledge base tables have the disadvantages of slow update speed and lack of flexibility.

Disclosure of Invention

Technical problem to be solved

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a knowledge graph system for medical use and a decision assistance method thereof, which solve the technical problems of slow updating speed of the knowledge base caused by the need of manually updating the database in the prior art.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

in a first aspect, an embodiment of the present invention provides a knowledge mapping system for medicine, including: a data acquisition layer configured to crawl, by a crawler, an incremental index that does not exist in the index database and crawl incremental content corresponding to the incremental index; the natural language processing layer is configured to perform entity extraction and relation extraction on the incremental content by using a natural language processing tool to obtain triple information of the incremental content; a persistence layer configured to construct a knowledge graph using the triplet information of the incremental content and store the knowledge graph; a medical application layer configured to provide services to a user based on a knowledge-graph; wherein the medical application layer comprises a decision assistance module configured to provide a decision assistance scheme for the user.

In one possible embodiment, the data acquisition layer includes; the crawler scheduler is configured to crawl indexes related to the medical keywords based on preset medical keywords, remove duplicates of all the crawled indexes based on the index database to obtain an incremental index, store the incremental index into the index database, and update the state of the incremental index; wherein the states of the incremental index include not crawled, in-crawled, complete crawl, and overtime crawl; a plurality of crawlers configured to request the incremental index from the crawler scheduler and crawl the incremental content based on the incremental index.

In one possible embodiment, the data acquisition layer further comprises a data integrator configured to collect the incremental content crawled by the current crawler and send a first message of completion of the crawling to the crawler scheduler; the crawler scheduler is further configured to query the state of an increment index corresponding to the increment content crawled by the current crawler based on the first message, and feed back a second message of crawling failure to the data integrator if the state is determined to be crawling completion; a data consolidator further configured to discard the current crawler crawled incremental content based on the second message.

In one possible embodiment, the data acquisition layer further comprises; and the data cleaning module is configured to perform format cleaning on the incremental content, perform validity verification on the incremental content, send the incremental content to the natural language processing layer if the incremental content has validity, and delete the incremental content if the incremental content does not have validity.

In one possible embodiment, the crawler scheduler is further configured to determine a crawl time for the incremental content, mark the incremental index as a crawl timeout if the crawl time exceeds a preset time, and add the incremental index to the non-crawled queue.

In one possible embodiment, the persistence layer comprises: a hierarchical archiving module configured to store the knowledge-graph into a graph database and store the incremental content and the characteristic information associated with the knowledge-graph into a relational database; the characteristic information comprises a node list and a relation list.

In one possible embodiment, the crawler scheduler is further configured to crawl an index related to the medical keywords from a public medical center PMC database based on preset medical keywords.

In a second aspect, an embodiment of the present invention provides a decision assistance method for a medical-based knowledge graph system, where the medical-based knowledge graph system is any one of the optional medical-based knowledge graph systems in the first aspect, and the decision assistance method includes: step S1, adding all entities contained in the query request into an initial root node queue, and initializing the weights of all nodes in the initial root node queue; the initial root node queue is a minimum data bank corresponding to all entities and inquired from the graph database through a persistence layer; s2, updating the weights of all unmarked first adjacent nodes of each node in the initial root node queue to obtain an intermediate root node queue; s3, updating the intermediate root node queue; step S4, traversing all nodes appearing in the steps S1 to S3, and updating the weights of all nodes appearing in the steps S1 to S3; wherein, the weight of each node in all nodes appearing in the steps from S1 to S3 is determined according to the weight of the current node and the number of the relations connected with the current node; step S5, repeating the steps S1 to S4 again, determining all paths between any two entities in all the entities contained in the query request, counting the weight sum of all the paths, and outputting a subgraph consisting of the path weight and the first N paths; wherein N is a positive integer.

In one possible embodiment, step S2 comprises: step S21, acquiring all first neighboring nodes of the first node, where the first node is any one node in the initial root node queue, and performing the following steps for each of the first neighboring nodes: step S211, checking whether the current first adjacent node is traversed, if the current first adjacent node is not traversed, marking the current first adjacent node, otherwise, skipping the step S211; step S212, updating the weight of the current first adjacent node; if the weight of the current first adjacent node is 0, calculating a first quotient value of the weight of the first node corresponding to the current first adjacent node and the weight attenuation rate, and taking the first quotient value as the weight of the current first adjacent node, otherwise, calculating a product value of the weight of the current first adjacent node and the first quotient value, and taking the product value as the weight of the current first adjacent node; step S213, taking the current first adjacent node as an initial node, and repeating the step S21 until the quotient calculated by the weight of the current first adjacent node and the weight attenuation rate is less than or equal to 1, and stopping circulation; and S22, cleaning the marking information of all the nodes in the initial root node queue, and marking all the nodes in the initial root node queue as not traversed.

In one possible embodiment, step S3 comprises: step S31, removing the second node from the intermediate root node queue; the second node is any one node in the intermediate root node queue; step S32, obtaining all second neighboring nodes of the second node, and executing the following steps for each second neighboring node of all second neighboring nodes: step S321, if it is determined that the weight of the current second adjacent node is greater than or equal to a second quotient value calculated by the weight of the second node and the weight attenuation rate, updating the weight of the current second adjacent node to the second quotient value, and adding the current second adjacent node to the intermediate root node queue; step S322, if the weight of the current second adjacent node is determined to be smaller than the second quotient value and the current second adjacent node is determined not to be marked, adding the current second adjacent node into the intermediate root node queue; in step S323, the current second neighboring node is marked as traversed.

In a third aspect, an embodiment of the present application provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program performs the method according to the second aspect or any optional implementation manner of the second aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the method of the second aspect or any of the alternative implementations of the second aspect.

In a fifth aspect, the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the second aspect or any possible implementation of the second aspect.

(III) advantageous effects

The invention has the beneficial effects that:

the knowledge map system for medical science and the decision-making auxiliary method thereof can monitor the latest literature information in real time and update the knowledge map automatically, so that a user can master the latest scientific research information at any time, and the problems of low updating speed of a knowledge base and the like caused by the need of manually updating a database in the prior art are solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 illustrates a schematic diagram of a knowledge graph system for medicine according to an embodiment of the present application;

FIG. 2 shows a flow chart of construction and application of a medical knowledge-graph provided by an embodiment of the present application;

fig. 3 shows a flowchart of a decision assistance method of a medical-based knowledge graph system according to an embodiment of the present application.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

With the continuous development of the automatic knowledge extraction technology, the triple elements in the knowledge map can be automatically extracted by using the related technologies such as machine learning, data mining and artificial intelligence. For example, medical Subject Headings (MeSH), unified Medical Language Systems (UMLS), semRep, metaMep, and SemMedDB all use a method of automatically extracting knowledge, thereby ensuring timely update of the knowledge base.

For example, application No. 202110321813.0 discloses a knowledge-graph based personalized knowledge service recommendation system. However, this method is suitable for a small scale field, and assists the user to learn past knowledge on the basis of the knowledge of the user. The knowledge graph needs to be manually updated by a user, and the process of reading a paper in a very large amount and extracting paper information cannot be avoided;

for another example, patent application No. 202111258604.2 discloses an assisted disease inference system based on a knowledge graph and an adaptive mechanism, which constructs a disease inference model based on a knowledge graph of a triplet < symptom, site of occurrence, disease > as a data structure and a nse translation model and a naive bayesian classifier. However, it is relatively simple in the abundance of the knowledge map and the inference means, and can only infer that a certain disease is caused by a certain symptom occurring at a certain part as a basis, and this inference is usually better understood for the common sense problem, while the present application aims to find some relationships that may be ignored as scientific research directions and diagnosis assistance.

Based on this, the embodiment of the application provides a scheme that medical atlas data can be automatically constructed and a scientific research auxiliary platform can be built based on medical documents (such as thesis), and the scheme can be updated and expanded in real time based on medical incremental data. And the display of the search result can be optimized, scientific research personnel can know knowledge points expressed in the medical literature and the relationship between the knowledge points by simply searching to obtain a knowledge map without completely reading the paper.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a knowledge mapping system for medicine according to an embodiment of the present disclosure. The knowledge graph system as shown in FIG. 1 includes a data acquisition layer, a natural language processing layer, a persistence layer, and a medical application layer. The data acquisition layer is configured to crawl incremental indexes which do not exist in the index database through a crawler and crawl incremental contents corresponding to the incremental indexes; the natural language processing layer is configured to perform entity extraction and relationship extraction on the incremental content by using a natural language processing tool to obtain triple information of the incremental content; the persistence layer is configured to construct a knowledge graph using the triplet information of the incremental content and store the knowledge graph; the medical application layer is configured to provide services to the user based on the knowledge-graph.

It should be understood that the specific structure of the data acquisition layer, the specific structure of the natural language processing layer, the specific structure of the persistence layer, and the specific structure of the medical application layer may all be set according to actual needs, and the embodiments of the present application are not limited thereto.

Optionally, the data acquisition layer comprises a crawler scheduler, a plurality of crawlers, a data integrator, and a data cleansing module.

The crawler scheduler can crawl medical literature indexes related to preset medical keywords through a preset interface, can deduplicate all the crawled medical literature indexes and indexes existing in an index database to obtain incremental indexes, can insert the incremental indexes into the index database, and marks the inserted incremental indexes as not-crawled indexes. The incremental index refers to an index which does not exist in an index database crawled by the crawler.

It should be understood that the specific interface of the preset interface may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, in a case where the preset interface is a Public Medical Center (PMC) interface, the crawler scheduler may crawl papers in the PMC database through the PMC interface, and may crawl hundreds of thousands of keyword-bearing paper indexes each time according to requirements of the PMC interface. Wherein the paper index may contain the names of papers, etc.

It should also be understood that specific documents of medical documents, specific keywords of medical keywords, and the like may also be set according to actual needs, and the embodiments of the present application are not limited thereto.

For example, the medical literature may be at least one of paper, journal, and web content.

For another example, the medical keywords may be manually added by the user input, and the manually added keywords also belong to the corresponding topics of the current knowledge-graph. Wherein, the main body can be set according to actual requirements.

And each crawler in the crawlers can request an increment index from the crawler scheduler through an interface of the crawler scheduler, and the crawler scheduler can select an increment index from the queue to be crawled and return the increment index to the crawler, mark the increment index as being crawled, and simultaneously save the crawling start time. And, the crawler can use the incremental index to crawl and send free-to-access information (i.e., incremental content) related to the augmented documents to the data consolidator. The specific number of the crawlers can be set according to actual needs, and the embodiment of the application is not limited to this.

It should be noted that one crawling task may be performed simultaneously by multiple crawlers (e.g., each of the crawlers may crawl the same paper according to the same index), or different crawling tasks may be performed.

And the data integrator can receive the incremental content uploaded by the crawler through the data uploading interface, and can send crawling completion to the crawler scheduler after the incremental content is obtained. And the crawler scheduler can mark the increment index corresponding to the crawler as crawling completion and return a message of crawling success. However, if the crawling-completed incremental index submitted by the data integrator is in a crawling-completed state (for example, when the same document is crawled by a plurality of crawlers at the same time, other crawlers have completed a crawling task prior to the current crawlers, and the state of the same incremental index corresponding to the document is changed into crawling-completed state), crawling failure information may be returned to the data integrator, and the subsequent data integrator may discard the data based on the crawling failure information.

In addition, the crawler scheduler can also execute a timing task, traverse all documents in crawling, mark the corresponding increment index as crawling overtime if the crawling of the increment content is still not completed, and add the increment index into the non-crawling queue for re-crawling.

It should be noted here that, according to the performance requirement of the crawler, the data integrator may exist independently as a computing node, may share the computing node with the crawler to save network resources, and may share the computing node with a data processing node in the following text to save computing resources.

And the data cleaning module can perform format cleaning on the crawled incremental content, perform validity verification on the incremental content, send the incremental content to the natural language processing layer if the incremental content has validity, and otherwise, delete the incremental content.

For example, whether the incremental content is valid can be determined by determining whether the document format and source, etc., of the incremental content is missing. If the incremental content is determined to be a paper, and the paper lacks a summary and the like, the paper is determined not to be valid; if the incremental content is determined to be a paper, and the literature format of the paper is complete and the source of the paper is valid, the paper can be determined to be valid.

Optionally, the natural language processing layer may include a word segmentation module, a named entity module, a part-of-speech tagging module, and a relationship extraction module.

Wherein the tokenization module can tokenize incremental content (e.g., articles) by part of speech. For example, a principal or predicate may be separated by part of speech, etc. And the named entity module can extract useful word bodies according to the word relation before and after. For example, it is determined whether the word is functional or meaningful in the present sentence, and if not, the word may be deleted. And the part of speech tagging module can be used for carrying out part of speech tagging. For example, the part-of-speech tagging module can determine which entity is an active entity that a relationship issues, and which entity is a passive entity. And, the relationship extraction module may integrate previous data (e.g., entities and parts of speech thereof, etc.) so that relationships between entities can be determined according to the entities and parts of speech.

It should be noted here that all modules in the natural language processing layer can be implemented by SemRep. SemRep is a Semantic analysis and data mining system based on natural language processing technology, and determines triples, which can be based on a Semantic Network (Semantic Network), a super word list Metathesaurus and an expert dictionary (Special Lexicon) in UMLS, extract a term relationship described in a paper by using natural language, and finally obtain triples based on an entity A-relationship type-entity B. And, entity a may be an active entity and entity B may be a passive entity.

And, the persistence layer can include a relational/entity storage module, a hierarchical archive module, a distributed storage module, and a multi-level cache module.

The relation/entity storage module can store the triple information of the incremental content in a graph form. And the hierarchical archiving module can perform hierarchical archiving. For example, the present application may be provided with a graph database and a relational database. The graph database can be stored according to nodes and relations, namely the graph database is stored according to a knowledge graph, so that when a path from a first entity to a second entity is required to be queried subsequently, the query is a relatively slow query operation, and the query can be conducted through the graph database; relational databases are primarily queries that maintain all data. For example, document information, a list of nodes composed of nodes in the knowledgegraph, a list of relationships composed of relationships in the knowledgegraph, and relationship information to represent a particular relationship between two entities in the knowledgegraph may be written into a relational database with a faster index. Therefore, when the query request is a condition of querying other information corresponding to the entity according to the entity name, the query request can be queried from the relational database.

Therefore, by setting the graph database and the relational database, the query requests of high speed and low speed can be separated, and the query can be realized without a bottom layer.

It should be understood that, the specific database of the graph database and the specific database of the relational database, etc. may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the graph database may be a Neo4j graph database;

as another example, the relational database may be a MySQL database.

It should be noted here that the distributed storage module and the multi-level cache module are technologies related to a database, and functions and requirements of the distributed storage module and the multi-level cache module may be set according to actual requirements, which is not limited to this embodiment of the present application.

And, the medical application layer may include a relational/entity query module, a knowledge base indexing module, a decision assistance module configured to provide a user-provided decision assistance scheme (e.g., a medical decision assistance scheme), and a document recommendation module.

It should be understood that the specific modules and functions of the relationship/entity query module, the specific modules and functions of the knowledge base index module, and the specific modules of the decision auxiliary module, etc. may all be set according to actual needs, and the embodiments of the present application are not limited thereto.

Therefore, by means of the knowledge graph system, the embodiment of the application can monitor the latest literature information in real time and automatically update the knowledge graph, so that a user can master the latest scientific research information at any time.

It should be understood that the above-described knowledge mapping system for medicine is only exemplary, and those skilled in the art can make various changes, modifications or variations according to the above-described method within the protection scope of the present application.

In order to facilitate understanding of the embodiments of the present application, the following description is given by way of specific examples.

Specifically, please refer to fig. 2, fig. 2 shows a flowchart of construction and application of a medical knowledge-graph provided by an embodiment of the present application. As shown in fig. 2, a paper may be crawled from a PMC database using a distributed crawler, and information of the paper may be extracted based on UMLS to obtain entities and relationships, and a knowledge graph may be constructed based on the entities and the relationships, and then the knowledge graph and related information thereof may be stored in a graph database, a relationship database, and a cache database, respectively. Subsequently, the functions of decision assistance, thesis recommendation, knowledge relationship query and the like can be realized based on the storage of the three databases.

To facilitate understanding of specific implementation procedures of the decision assistance, the following description is made by way of specific embodiments.

It should be noted here that, in order to facilitate understanding of the related schemes of subsequent decision assistance, the following explains the related concepts:

"entity list": it is an array of strings and it is used to represent entities carried in the query request (e.g., cold or tremble, etc.) or a list of entities that need to be queried;

"start weight IW": it is a numerical value and it is used to represent the weight of the origin of the entity;

"weighted decay rate WD": which is a numerical value and which is used to represent the decay rate of the weight per step.

Referring to fig. 3, fig. 3 is a flowchart illustrating a decision assistance method of a medical-based knowledge mapping system according to an embodiment of the present application. In particular, the knowledge graph system may be a knowledge graph system as shown in fig. 1, and the decision assistance method comprises:

step S310, add all entities included in the query request into the initial root node queue, and perform initialization operation on the weights of all nodes in the initial root node queue. Wherein the initial root node queue is the smallest data body corresponding to all entities queried from the graph database through the persistence layer.

Specifically, an initial root node list may be determined based on a query request input by a user. Subsequently, all nodes in the entity list corresponding to the query request may be added to the initial root node queue, and the weights of all nodes in the initial root node queue may be initialized to the initial weight IW.

Step S320, updating the weights of all unmarked first neighboring nodes of each node in the initial root node queue to obtain an intermediate root node queue.

It should be understood that the specific process of updating the weights of all unmarked first neighboring nodes of each node in the initial root node queue to obtain the intermediate root node queue may be set according to actual requirements, and the embodiment of the present application is not limited to this.

Specifically, all first neighboring nodes of the first node may be obtained, and the first node is any one node in the initial root node queue, and the following steps are performed on each of the first neighboring nodes in all the first neighboring nodes:

checking whether the current first adjacent node is traversed, if the current first adjacent node is not traversed, marking the current first adjacent node, otherwise, skipping the current step;

updating the weight of the current first adjacent node; if the original weight of the current first adjacent node is 0, calculating a first quotient value of the weight of the first node corresponding to the current first adjacent node and the weight attenuation rate WD (namely the weight/weight attenuation rate WD of the first node), and taking the first quotient value as the weight of the current first adjacent node, otherwise, calculating a product value of the weight of the current first adjacent node and the first quotient value (namely the original weight of the current first adjacent node is the weight/weight attenuation rate WD of the first node), and taking the product value as the weight of the current first adjacent node;

and taking the current first adjacent node as a starting node, repeating the steps until the quotient (namely the original weight/weight attenuation rate WD of the current first adjacent node) obtained by calculating the weight of the current first adjacent node and the weight attenuation rate WD is less than or equal to 1, and stopping circulation.

The marking information for all nodes in the initial root queue may then be cleared and all nodes in the initial root queue may be marked as not traversed.

Step S330, updating the intermediate root node queue.

It should be understood that the specific process of updating the intermediate root node queue may be set according to actual needs, and the embodiment of the present application is not limited thereto.

Specifically, the second node is removed or popped from the intermediate root node queue; the second node is any one node in the intermediate root node queue;

and acquiring all second adjacent nodes of the second node, and executing the following steps for each second adjacent node in all the second adjacent nodes:

if the weight of the current second adjacent node is determined to be larger than or equal to a second quotient value calculated through the weight of the second node and the weight attenuation rate WD (namely the weight of the current second adjacent node is larger than or equal to the weight/weight attenuation rate WD of the second node), updating the weight of the current second adjacent node into the second quotient value (namely the weight/weight attenuation rate WD of the second node), and adding the current second adjacent node into the intermediate root node queue;

if the weight of the current second adjacent node is determined to be smaller than the second quotient value (namely the weight of the current second adjacent node is smaller than the weight/weight attenuation rate WD of the second node), and the current second adjacent node is determined to be not marked, adding the current second adjacent node into the intermediate root node queue;

the current second neighboring node is marked as traversed.

Step S340, traverse all nodes appearing from step S310 to step S330, and update the weights of all nodes appearing from step S310 to step S330.

Specifically, all nodes appearing in the steps S310 to S330 are traversed, and the weight of each node is updated to the weight of the original node/the number of relationships connected to the node, so that the weight of the node with a larger number of relationships connected to the node becomes smaller, and the weight of the node with a smaller number of relationships connected to the node becomes larger, so that the weight of some common sense relationships can be reduced, and further, the nodes or relationships which are not noticed usually can be screened for doctors. Wherein the relationships connected to a node refer to the number of relationships connected to the node in the knowledge-graph. For example, the number may be 2 or 500.

And step S350, repeating steps S310 to S340 again, determining all paths between any two entities in all entities contained in the query request, counting the weight sum of all paths, and outputting a subgraph consisting of the path weight and the first N paths. Wherein N is a positive integer, and a path between two entities refers to a path of the two entities in the knowledge graph correspondingly.

It should be understood that the specific value of N may be set according to actual requirements, that is, a subgraph composed of the path weight and the top N paths in the sequence may be returned as needed, and the embodiment of the present application is not limited thereto.

For example, in the case where the user is a researcher, the value of N may be set larger as the researcher wants to know more information; for another example, where the user is a doctor, the value of N may be set smaller, as the doctor may only want to know what the patient gets, and the information it needs is less.

Therefore, by means of the scheme, users such as doctors and scientific researchers can input any entity nouns such as genes, proteins, medicines, diseases, clinical symptoms and the like, and then the related relation network can be obtained.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words are to be understood as part of the name of the component.

Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims

1. A knowledge graph system for medicine, comprising:

a data acquisition layer configured to crawl, by a crawler, an incremental index that does not exist in an index database and crawl incremental content corresponding to the incremental index;

the natural language processing layer is configured to perform entity extraction and relation extraction on the incremental content by using a natural language processing tool to obtain triple information of the incremental content;

a persistence layer configured to construct a knowledge graph using the triplet information of the incremental content and store the knowledge graph;

a medical application layer configured to provide a service to a user based on the knowledge-graph; wherein the medical application layer comprises a decision assistance module configured to provide a decision assistance scheme for the user.

2. The knowledge graph system of claim 1, wherein the data acquisition layer comprises;

the crawler scheduler is configured to crawl indexes related to the medical keywords based on preset medical keywords, remove duplicates of all the crawled indexes based on the index database to obtain the incremental indexes, store the incremental indexes into the index database, and update the states of the incremental indexes; wherein the states of the incremental index include not crawled, crawling in, crawling completed, and crawling timeout;

a plurality of crawlers configured to request the delta index from the crawler scheduler and crawl the delta content based on the delta index.

3. The knowledge graph system of claim 2, wherein the data acquisition layer further comprises a data consolidator configured to collect incremental content crawled by a current crawler and send a crawl completed first message to the crawler scheduler;

the crawler scheduler is further configured to query a state of an increment index corresponding to increment content crawled by the current crawler based on the first message, and feed back a second message of crawl failure to the data integrator if the state is determined to be that the crawl is completed;

the data consolidator further configured to discard the current crawler crawled incremental content based on the second message.

4. The knowledge graph system of claim 2, wherein the data acquisition layer further comprises;

and the data cleaning module is configured to perform format cleaning on the incremental content, perform validity verification on the incremental content, send the incremental content to the natural language processing layer if the incremental content has validity, and delete the incremental content if the incremental content does not have validity.

5. The knowledge graph system of claim 2, wherein the crawler scheduler is further configured to determine a crawl time for the incremental content, mark the incremental index as the crawl timeout if the crawl time exceeds a preset time, and add the incremental index to an un-crawl queue.

6. The knowledge graph system of claim 1, wherein the persistence layer comprises:

a hierarchical archiving module configured to store the knowledge-graph into a graph database and store the incremental content and feature information associated with the knowledge-graph into a relational database; wherein the characteristic information includes a node list and a relationship list.

7. The knowledge graph system of claim 2, wherein the crawler scheduler is further configured to crawl an index related to a preset medical keyword from a Public Medical Center (PMC) database based on the medical keyword.

8. A decision assistance method for a medical-based knowledge graph system, the system being as claimed in any one of claims 1 to 7, the method comprising:

step S1, adding all entities contained in a query request into an initial root node queue, and initializing the weights of all nodes in the initial root node queue; wherein the initial root node queue is a minimum data volume corresponding to all entities queried from a graph database through a persistence layer;

s2, updating the weights of all unmarked first adjacent nodes of each node in the initial root node queue to obtain an intermediate root node queue;

s3, updating the intermediate root node queue;

step S4, traversing all nodes appearing from the step S1 to the step S3, and updating the weights of all nodes appearing from the step S1 to the step S3; wherein, the weight of each node in all nodes appearing in the steps S1 to S3 is determined according to the weight of the current node and the number of relations connected with the current node;

step S5, repeating the steps S1 to S4 again, determining all paths between any two entities in all entities contained in the query request, counting the weight sum of all paths, and outputting a subgraph formed by the path weight and the first N paths; wherein N is a positive integer.

9. The decision-assistance method according to claim 8, wherein the step S2 comprises:

step S21, acquiring all first neighboring nodes of a first node, where the first node is any one node in the initial root node queue, and executing the following steps for each first neighboring node in all the first neighboring nodes:

step S211, checking whether the current first adjacent node is traversed, if the current first adjacent node is not traversed, marking the current first adjacent node, otherwise, skipping the step S211;

step S212, updating the weight of the current first adjacent node; if the weight of the current first adjacent node is 0, calculating a first quotient value of the weight of the first node corresponding to the current first adjacent node and a weight attenuation rate, and taking the first quotient value as the weight of the current first adjacent node, otherwise, calculating a product value of the weight of the current first adjacent node and the first quotient value, and taking the product value as the weight of the current first adjacent node;

step S213, taking the current first neighboring node as an initial node, and repeating the step S21 until a quotient calculated by the weight of the current first neighboring node and the weight attenuation rate is less than or equal to 1, and stopping the cycle;

and S22, cleaning the marking information of all the nodes in the initial root node queue, and marking all the nodes in the initial root node queue as not traversed.

10. The decision-assistance method according to claim 8, wherein the step S3 comprises:

step S31, removing a second node from the intermediate root node queue; the second node is any node in the intermediate root node queue;

step S32, obtaining all second neighboring nodes of the second node, and executing the following steps for each second neighboring node of all second neighboring nodes:

step S321, if it is determined that the weight of the current second neighboring node is greater than or equal to a second quotient calculated by the weight of the second node and the weight attenuation rate, updating the weight of the current second neighboring node to the second quotient, and adding the current second neighboring node to the intermediate root node queue;

step S322, if it is determined that the weight of the current second neighboring node is smaller than the second quotient value and it is determined that the current second neighboring node is not marked, adding the current second neighboring node to the intermediate root node queue;

step S323, mark the current second neighboring node as traversed.