CN112347204A - Method and device for constructing drug research and development knowledge base - Google Patents

Method and device for constructing drug research and development knowledge base Download PDF

Info

Publication number
CN112347204A
CN112347204A CN202110025086.3A CN202110025086A CN112347204A CN 112347204 A CN112347204 A CN 112347204A CN 202110025086 A CN202110025086 A CN 202110025086A CN 112347204 A CN112347204 A CN 112347204A
Authority
CN
China
Prior art keywords
entity
medical
entities
drug
relationship
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110025086.3A
Other languages
Chinese (zh)
Other versions
CN112347204B (en
Inventor
丁红霞
伍星
吴忠毅
苑敬
王雨福
李靖
李琪
廖宛玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingwei Jingwei Information Technology Beijing Co ltd
Original Assignee
Jingwei Jingwei Information Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingwei Jingwei Information Technology Beijing Co ltd filed Critical Jingwei Jingwei Information Technology Beijing Co ltd
Priority to CN202110025086.3A priority Critical patent/CN112347204B/en
Publication of CN112347204A publication Critical patent/CN112347204A/en
Application granted granted Critical
Publication of CN112347204B publication Critical patent/CN112347204B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing

Abstract

The invention discloses a method and a device for constructing a drug research and development knowledge base, wherein the method comprises the following steps: establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entity is an entity related to a drug; determining entity relationships; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types; and establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node. The invention can improve the automation degree of establishing the drug research and development knowledge base, is convenient for users to understand the relationship between drug data and quickly interpret or describe the drug complex knowledge.

Description

Method and device for constructing drug research and development knowledge base
Technical Field
The invention relates to the technical field of information processing, in particular to a method and a device for constructing a drug research and development knowledge base.
Background
The knowledge base, also called intelligent database or artificial intelligence database, is a knowledge-based database, has intelligence, and any information processing system can not be separated from the support of the data and knowledge base. At present, the field knowledge base is usually constructed manually, for example, by a person such as an expert in the field, but the manual construction of the field knowledge base requires a lot of time and effort, is inefficient and difficult to maintain, and particularly in the field of drug research and development, not only is the field of human beings very closely related, but also the knowledge system in the field is very complicated, and the knowledge system includes not only explicit knowledge of diseases, drugs, examination means, examination equipment, treatment methods, and the like, but also implicit knowledge of disease diagnosis experience, disease generation causes, related complications, and the like, and the knowledge is related to each other. Therefore, how to efficiently and comprehensively establish a drug development knowledge base is a problem which needs to be solved by the industry urgently.
Disclosure of Invention
The invention provides a method and a device for constructing a drug research and development knowledge base, which are used for solving the problems of low efficiency, long time and difficult maintenance of constructing the knowledge base in a manual mode in the prior art and realizing the automatic processing of constructing the drug research and development knowledge base.
Therefore, the invention provides the following technical scheme:
a method of drug development knowledge base construction, the method comprising:
establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entity is an entity related to a drug;
determining entity relationships; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;
and establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.
Optionally, the establishing a medical entity library comprises:
extracting medical entities from the medical related structured data and establishing a medical entity library;
collecting medical related linguistic data;
and extracting medical entities from the corpus, and supplementing the extracted medical entities into the medical entity library.
Optionally, the acquiring medically-related corpora includes:
collecting medically relevant corpora from any one or more of the following data sources: medically relevant documents, patents, news, web pages.
Optionally, the determining the entity relationship includes any one or more of the following ways:
extracting entity relationships from the medical related corpora by adopting a rule-based method;
and extracting entity relations from the medical related linguistic data by adopting a deep learning model-based method.
Optionally, the generic entity comprises: target, indications, company;
the relationships among the entities of the same type comprise: drug entity relationship, company entity relationship; the drug entity relationship includes: synergy and antagonism; the corporate entity relationship includes: parent, subsidiary; collaboration, transfer and admission in the business domain;
the relationships among the different types of entities comprise: the relationship between the drug and the indication, the relationship between the drug and the target, and the relationship between the drug and the company.
A drug development knowledge base construction apparatus, the apparatus comprising:
the entity library establishing module is used for establishing a medical entity library, and the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entity is an entity related to a drug;
the entity relationship determining module is used for determining the entity relationship; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;
and the knowledge graph generating module is used for establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.
Optionally, the entity library establishing module includes:
the medical entity library establishing unit is used for extracting medical entities from the medical related structured data and establishing a medical entity library;
the data acquisition unit is used for acquiring medical related corpora;
an entity extraction unit, configured to extract a medical entity from the corpus;
and the maintenance unit is used for supplementing the medical entities extracted by the entity extraction unit into the medical entity library.
Optionally, the data acquisition unit is specifically configured to acquire the medically-related corpora from any one or more of the following data sources: medically relevant documents, patents, news, web pages.
Optionally, the entity relationship determining module includes:
a first determining unit, configured to extract an entity relationship from the medically-related corpus by using a rule-based method; and/or
And the second determining unit is used for extracting the entity relation from the medical related linguistic data by adopting a deep learning model-based method.
Optionally, the generic entity comprises: target, indications, company;
the relationships among the entities of the same type comprise: drug entity relationship, company entity relationship; the drug entity relationship includes: synergy and antagonism; the corporate entity relationship includes: parent, subsidiary; collaboration, transfer and admission in the business domain;
the relationships among the different types of entities comprise: the relationship between the drug and the indication, the relationship between the drug and the target, and the relationship between the drug and the company.
According to the method and the device for constructing the drug research and development knowledge base, the entity relation is determined by establishing the medical entity base, the core entity is taken as a key node, the general entity is taken as a common node, and the knowledge graph corresponding to the medical entity base is established according to the entity relation. According to the scheme of the embodiment of the invention, the logic relation between related data and entities is displayed on the complex medicine data relating to multiple fields of medicines in a knowledge map form, so that a user can conveniently understand the relation between the medicine data and quickly read or describe the complex medicine knowledge; moreover, after the knowledge base is constructed, the medicine data processing is convenient, and the application direction exploration and reasoning of the medicine data are convenient.
Drawings
FIG. 1 is a flow chart of a method for building a knowledge base of drug development according to an embodiment of the invention;
FIG. 2 is a diagram illustrating a serial relationship between a core entity and a generic entity according to an embodiment of the present invention;
FIG. 3 is an example of a rule-based method for extracting entity relationships from the medically-related corpus in accordance with an embodiment of the present invention;
FIG. 4 is a block diagram of a drug development knowledge base construction apparatus according to an embodiment of the present invention;
fig. 5 is a block diagram of a structure of an entity library creating module according to an embodiment of the present invention.
Detailed Description
The medicine data in the medicine field not only has the professional characteristics of the industry, but also relates to a plurality of fields by combining the research and development of the medicine. Aiming at the characteristics, the embodiment of the invention provides a method and a device for constructing a drug research and development knowledge base.
As shown in fig. 1, the method for constructing a knowledge base of drug development in the embodiment of the present invention includes the following steps:
step 101, establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entities are drug related entities.
Such generic entities may include, but are not limited to: target, indications, company, etc.;
the entity attribute is descriptive information that defines an entity, such as:
the properties of the drug include: name, dosage form, route of administration, drug type, NME, compound, registry classification, etc.
Attributes of the indications include: disease name, etiology, anatomical location, disease classification, etc.
The attributes of a company include: region, parent company, subsidiary company, business domain, enterprise category, etc.
The attributes of the target include: classifying the target points; secondary property protein, nucleic acid, lipid and carbohydrate of the target of the non-signal path; tertiary attributes of protein and nucleic acid targets such as genes, sequences, etc.
The entity data mainly come from unstructured data, semi-structured data and structured data such as open documents, official network data and intra-industry databases.
For example, the medical entity library may be specifically established as follows:
(1) establishing a medical entity library; extracting medical entities from the medical related structured data and establishing a medical entity library;
in practical applications, a relational database (e.g., MySQL) may be used to store entity attributes (fields not used for association), such as picture URLs of entities; graph databases (e.g., Neo4 j) are employed to store the main fields and relationships of entities for knowledge reasoning (graph reasoning, context reasoning, equivalence reasoning, inconsistency detection, knowledge discovery).
The flow of extracting medical entities from structured data to a graph database is as follows:
designing a database table structure according to the medical entity and the attribute thereof;
creating mapping view in the existing professional database, and extracting data into the database table;
generating an RDF file by adopting a D2RQ tool;
the RDF file is imported into the graph database.
(2) Medical entities are extracted through various data sources, and a medical entity library is supplemented and maintained: the medical-related corpora are collected, for example, the medical-related corpora may be collected from any one or more of the following data sources: medically relevant documents, patents, news, web pages, of course, in practical applications are not limited to these data sources; then, the medical entities are extracted from the corpus, and the extracted medical entities are supplemented into the medical entity library (namely, the graph database).
In addition, in order to ensure the accuracy of the medical entities supplemented into the medical entity library, the frequency of occurrence of each extracted medical entity may be further counted, and when the frequency of occurrence of the medical entity is higher than a set threshold (for example, 10 times), the medical entity is supplemented into the medical entity library.
Further, in order to ensure the quality of the medical entity library, the medical entity to be supplemented can be submitted to an expert for review, and the reviewed medical entity is incorporated into the medical entity library.
It should be noted that, in practical applications, the establishment of the medical entity library may also adopt other manners, for example, collecting various medical related data sources, performing word segmentation on a text sequence therein, and searching a domain knowledge base according to word units obtained by word segmentation to obtain medical entities in the text sequence, such as drugs, indications, and the like; as another example, a classifier may also be utilized to determine medical entities in the text sequence. Of course, there may be other ways, and the embodiment of the present invention is not limited thereto.
It should be noted that, in practical applications, for different types of medical entities, different methods and data sources may be used according to their characteristics to obtain corresponding medical entities, supplement the medical entities to the medical entity library, and update and maintain the medical entity library. Indications and targets are exemplified below.
For the indication entity:
the method comprises the following steps: the approved indications for marketed drugs are an important source of standardized terminology compared to information such as clinical registration, news, and company yearbook. To this end, the indication entity may be obtained as follows:
1) constructing an original indication first information table, and extracting indication entities from data sources such as American FDA, European Union EMA, Japanese PMDA, Chinese NMPA (N-methyl pyrrolidone) medicine specifications and the like;
2) carrying out synonym merging processing on the original indication first information table to obtain a second information table, namely merging in a self set of the indication extraction words; the method can be carried out by means of character processing (removing, spacing, and the like), capital and small case conversion, English single and plural, tense normalization and the like, and finally converts the original first information table of the indication into a second information table;
3) carrying out medicine NDA/BLA registration number frequency statistics on the indication character strings in the second information table after the merging processing;
4) carrying out international universal standard library (MeSH) synonym relation mapping on the indication information in the second information table;
the synonym relation mapping refers to matching the indication information with entries in MeSH and synonyms thereof. This step is the process of mapping the extracted non-standard set of entities to the standard set. MeSH and other standard libraries (such as ICD) can be mapped through a certain relation, so that the application range of the knowledge base can be expanded.
5) For an indication character string which cannot be mapped, if the occurrence frequency of the indication character string in different data sources is more than a set threshold (such as 10 times), supplementing the indication character string into the medical entity library; and taking the indication character strings with the frequency lower than the threshold value as standby entity entries.
The second method comprises the following steps: common Clinical trial publicity platforms are the main Clinical new drug information acquisition ways, such as the us Clinical trial publicity platform Clinical Trials gov, the chinese Clinical trial registration and information publicity platform, the WHO international Clinical trial registration platform, the european EU Clinical Trials Register. In addition, annual news, investment and financing transaction news and patent information of companies are important disclosure channels for research projects of the companies. To this end, the indication entity may also be obtained as follows:
1) performing medical entity calibration on a plurality of clinical source information, such as multidimensional information calibration of medicines, companies, indications and the like, and taking the information as a spare entity entry when a new indication entry appears;
2) performing periodic frequency analysis on the spare entity entries (for example, performing word frequency statistics on the spare entries once every month), and performing approximate mapping on the entries with the frequency lower than a set threshold (for example, 10 times); a common method for approximate mapping is to approximate a lower term to an upper term, such as "Chronic Liver failure" to "Liver failure"; while chronic liver failure serves as backup data; if the frequency of occurrence is greater than or equal to a set threshold, the entry is enabled and mapped to the MeSH hierarchy. For example, "chronic liver failure" is converted from a backup entity entry to a formal indication entry when the frequency of occurrence from each normalized data source exceeds a threshold;
3) and after the spare entity entry is released, adjusting the mapping level of the entry.
For the target entity:
targets are classified by type into non-signal path targets, and others. Wherein, the non-signal channel target takes a specific molecular entity as a mark, and the drug and the target entity directly interact to generate pharmacological effect. The non-signal target molecules are represented by protein, nucleic acid, polysaccharide, lipid and other monomer or complex molecules. The signal path target points are physiological and biochemical phenomena, such as cell apoptosis, inflammatory reaction, aging and the like. In such cases, the physical target of the corresponding drug is unknown, or can mediate a related physiological phenomenon after binding to the physical target. Other types target ambiguous data.
In embodiments of the invention, the targets may be classified as follows: can be directly related to the targets of published proteins, gene libraries and the like, and belongs to the non-signal path target; data which can not be classified into non-signal path target points and other target points is classified into 'signal path class target points'; targets that could not be classified into the two above cases were classified as "other targets".
The entity molecules are classified into proteins, nucleic acids, lipids and saccharides. Wherein, most of the entity targets are proteins, such as single molecule protein, multi-subunit protein or multi-protein complex. The protein target can realize automatic maintenance and verification, and a small number of nucleic acids, lipids and saccharides can be manually maintained.
For example, the maintenance of a single-molecule protein target can be updated in several ways:
1) public source maintenance, such as the ID number of GeneCard, Uniprot, Chembl;
2) introducing functions, namely acquiring Function field content of a public database Unit;
3) sequence information, namely acquiring a classical Sequence of Unit, and storing the classical Sequence according to a FASTA format;
4) alias information, summarizing and de-duplicating aliases of Uniprot and GeneCards, and removing naming information such as enzyme number EC;
5) the hierarchical relationship is based on a MeSH hierarchical framework as a tree structure, and hierarchical assistance is performed by a gene family of HGNC, a Chembl hierarchy, and the like. For protein targets not contained in MeSH, a similar functional protein classification process was performed.
As another example, for more up-to-date maintenance of the composite protein target, only the Uniprot number, GeneCards number, of each subunit component may be maintained.
It should be noted that, considering that a large amount of data in the field of medicine is in the form of english as a main character, the scheme of the invention not only can extract medical entities from Chinese medical related data to establish a chinese medicine research and development knowledge base, but also can extract medical entities from data sources of other languages, and combines a chinese localization verification process to form data internationalization and localization knowledge and store the data internationalization and localization knowledge in a localization branch entity of the knowledge base by cross-comparison between an original text (other languages such as english) and a translated text (chinese), between a translated text (chinese) and an original text (other languages), word frequency comparison and expert comparison. For example, english and chinese terms are associated with UMLS (Unified Medical Language System), and the latest ICD-11 and the like not covered by UMLS are maintained; the Chinese terms are checked by Chinese clinical trial registration and information disclosure platform, Chinese medicine instruction book, keywords of published documents, etc.
Step 102, determining entity relationships; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types.
The relationships between entities of the same type include but are not limited to: drug entity relationship, company entity relationship. Wherein: the drug entity relationship includes: synergistic effects (such as additive, potentiating, sensitizing, etc.), antagonistic effects (such as subtractive, resistant, etc.); the corporate entity relationship includes: parent, subsidiary; collaboration, transfer, and assignment of business areas.
The relationships between the different types of entities include, but are not limited to: drug-to-indication relationships (e.g., therapeutic, prophylactic, diagnostic, inductive, etc.), drug-to-target relationships (e.g., sensitization, degradation, stabilization, inhibition, stimulation, agonism, increased degradation, partial agonism, increased, induced, inverse agonism, activation, binding, antagonism, donor, catalysis, regulation, clearance, block, etc.), drug-to-company relationships (e.g., original, in, out, assigned).
In the embodiment of the present invention, the core entity is connected in series with each general entity, and the general entity has an association relationship with the core entity, as shown in an example in fig. 2. In this way, drug information is greatly enriched in a relational manner from different dimensions by general entities. For example, the mechanism of action of a drug is described by the relationship between the drug and the target entity; describing the drug use through the entity relationship of the drug and the indication; the medicine development history, business information and the like are described through the relation between medicines and company entities.
Therefore, the relationship reasoning between the entities can be carried out, and the relationship between any two common entities can jump through the core entity. For example, the research and development investment of company a in the disease field needs to be inferred through the drug research and development situation of company a.
In the embodiment of the present invention, determining the entity relationship may adopt any one or more of the following manners:
1. extracting entity relationships from the medically relevant corpus using a rule-based method
For example, a rule mapping template is made for a guide (or textbook) type text with a fixed format, for example, as shown in fig. 3, the following rules are set:
extracting the disease name from paragraph "(one) applicable object"; extracting symptoms from the diagnosis basis of the paragraph II, wherein the extracted entities in the paragraph are 'phenomenological expression' relations of the disease; the drug name was extracted from paragraph "(seven) drug selection and timing of use", the entity extracted in this paragraph being the "treated.
And extracting the entities in the paragraphs by using a professional vocabulary and an AC automata algorithm, and filling the relation between the entities according to a preset rule.
2) Extracting entity relationships from the medically relevant corpus using a deep learning model-based method
For example, the number of co-occurrence times of entities (for example, 5 times) is counted from text data such as news, literature, and patents, a co-occurrence entity set is obtained by screening, a co-occurrence entity context paragraph is extracted as a prediction corpus to be used by a deep learning model, and the relationship between entities is predicted by using a Natural Language Processing (NLP) algorithm model based on deep learning.
And 103, establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.
The knowledge graph is a graph organization form and associates various entities through semantic association. The knowledge graph combines structured data and unstructured data through data extraction and fusion, embodies the ideas of data management and semantic connection, and is beneficial to utilization and migration of large-scale data.
According to the method for constructing the drug research and development knowledge base, the entity relation is determined by establishing the medical entity base, the core entity is used as a key node, the general entity is used as a common node, and the knowledge graph corresponding to the medical entity base is established according to the entity relation. According to the scheme of the embodiment of the invention, the logic relation between related data and entities is displayed on the complex medicine data relating to multiple fields of medicines in a knowledge map form, so that a user can conveniently understand the relation between the medicine data and quickly read or describe the complex medicine knowledge; moreover, after the knowledge base is constructed, the medicine data processing is convenient, and the application direction exploration and reasoning of the medicine data are convenient.
Correspondingly, the embodiment of the invention also provides a drug development knowledge base construction device, which is a structural block diagram of the device as shown in fig. 4.
In this embodiment, the apparatus includes the following modules:
an entity library establishing module 401, configured to establish a medical entity library, where the medical entity library includes medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entities are drug-related entities, such as targets, indications, companies, and the like;
an entity relationship determining module 402, configured to determine an entity relationship; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;
a knowledge graph generating module 403, configured to establish a knowledge graph corresponding to the medical entity library according to the entity relationship, where the core entity is a key node, and the general entity is a common node.
In the embodiment of the invention, each entity data mainly comes from unstructured data, semi-structured data and structured data such as open documents, official website data and intra-industry databases.
As shown in fig. 5, a specific implementation manner of the entity library establishing module 401 may include the following units:
a medical entity library establishing unit 51, configured to extract medical entities from the medically-related structured data, and establish a medical entity library 50;
a data acquisition unit 52 for acquiring medically related corpora; for example, medically relevant corpora may be collected from any one or more of the following data sources: medical related documents, patents, news, web pages and the like, and other data sources can be provided, which are not limited;
an entity extracting unit 53, configured to extract a medical entity from the corpus;
a maintenance unit 54 for supplementing the medical entities extracted by the entity extraction unit 53 into the medical entity repository 50.
It should be noted that, in practical applications, a relational database (such as MySQL) may be used to store entity attributes (fields not used for association), for example, a picture URL of an entity; graph databases (e.g., Neo4 j) are employed to store the main fields and relationships of entities for knowledge reasoning (graph reasoning, context reasoning, equivalence reasoning, inconsistency detection, knowledge discovery).
Further, in another embodiment of the entity library establishing module, a statistical unit (not shown) may be further included, for counting the occurrence frequency of each medical entity extracted by the entity extracting unit 53, and when the occurrence frequency of the medical entity is higher than a set threshold (for example, 10 times), triggering the maintaining unit 54 to supplement the medical entity library with the medical entity.
It should be noted that, in practical applications, for different types of medical entities, different methods and data sources may be used according to characteristics of the medical entities to obtain corresponding medical entities, and the corresponding medical entities are supplemented to the medical entity library, and the medical entity library is updated and maintained, which is not limited to the embodiment of the present invention.
In the embodiment of the present invention, the relationships between entities of the same type include: drug entity relationship, company entity relationship; the drug entity relationship includes: synergy, antagonism, etc.; the corporate entity relationship includes: parent, subsidiary; collaboration, transfer, and assignment of business areas, etc. The relationships among the different types of entities comprise: the relationship between the drug and the indication, the relationship between the drug and the target, the relationship between the drug and the company, etc.
Determining entity relationships may be performed in any one or more of the following ways, for example, the entity relationship determination module 402 may include: a first determination unit, and/or a second determination unit. Wherein:
the first determining unit is used for extracting entity relations from the medical related linguistic data by adopting a rule-based method;
the second determination unit is used for extracting entity relations from the medically related linguistic data by adopting a deep learning model-based method.
It should be noted that, for the above embodiments of the apparatus of the present invention, since the functional implementation of each module and unit is similar to that in the corresponding method, the description of each embodiment of the quality control material selection apparatus is relatively simple, and the relevant points can be referred to the description of the corresponding parts of the method embodiments.
The drug research and development knowledge base construction device provided by the embodiment of the invention determines the entity relationship by establishing the medical entity base, and establishes the knowledge graph corresponding to the medical entity base according to the entity relationship by taking the core entity as the key node and the general entity as the common node. According to the scheme of the embodiment of the invention, the logic relation between related data and entities is displayed on the complex medicine data relating to multiple fields of medicines in a knowledge map form, so that a user can conveniently understand the relation between the medicine data and quickly read or describe the complex medicine knowledge; moreover, after the knowledge base is constructed, the medicine data processing is convenient, and the application direction exploration and reasoning of the medicine data are convenient.
According to the scheme of the embodiment of the invention, the repeated labor force of field experts for maintaining and managing the knowledge base is greatly reduced and the convenience and flexibility for field users to acquire the field knowledge are improved by establishing the diversity association of the drug entities and other medical entities and organizing and managing the knowledge in the knowledge base in the form of the knowledge map.
It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Furthermore, the above-described system embodiments are merely illustrative, wherein modules and units illustrated as separate components may or may not be physically separate, i.e., may be located on one network element, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those skilled in the art will appreciate that all or part of the steps in the above method embodiments may be implemented by a program to instruct relevant hardware to perform the steps, and the program may be stored in a computer-readable storage medium, referred to herein as a storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.
Correspondingly, the embodiment of the invention also provides a device for the drug development knowledge base construction method, and the device is an electronic device, such as a mobile terminal, a computer, a tablet device, a personal digital assistant and the like. The electronic device may include one or more processors, memory; wherein the memory is used for storing computer executable instructions and the processor is used for executing the computer executable instructions to realize the method of the previous embodiments.
The present invention has been described in detail with reference to the embodiments, and the description of the embodiments is provided to facilitate the understanding of the method and apparatus of the present invention, and is intended to be a part of the embodiments of the present invention rather than the whole embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any creative effort shall fall within the protection scope of the present invention, and the content of the present description shall not be construed as limiting the present invention. Therefore, any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for constructing a drug development knowledge base, the method comprising:
establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug, and the general entity is an entity related to the drug;
determining entity relationships; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;
and establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.
2. The method of claim 1, wherein the establishing a medical entity library comprises:
extracting medical entities from the medical related structured data and establishing a medical entity library;
collecting medical related linguistic data;
and extracting medical entities from the corpus, and supplementing the extracted medical entities into the medical entity library.
3. The method of claim 2, wherein the collecting medically relevant corpora comprises:
collecting medically relevant corpora from any one or more of the following data sources: medically relevant documents, patents, news, web pages.
4. The method of claim 2, wherein determining entity relationships comprises any one or more of:
extracting entity relationships from the medical related corpora by adopting a rule-based method;
and extracting entity relations from the medical related linguistic data by adopting a deep learning model-based method.
5. The method of claim 1, wherein the generic entities comprise: target, indications, company;
the relationships among the entities of the same type comprise: drug entity relationship, company entity relationship; the drug entity relationship includes: synergy and antagonism; the corporate entity relationship includes: parent, subsidiary; collaboration, transfer and admission in the business domain;
the relationships among the different types of entities comprise: the relationship between the drug and the indication, the relationship between the drug and the target, and the relationship between the drug and the company.
6. A drug development knowledge base construction apparatus, the apparatus comprising:
the entity library establishing module is used for establishing a medical entity library, and the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entity is an entity related to a drug;
the entity relationship determining module is used for determining the entity relationship; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;
and the knowledge graph generating module is used for establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.
7. The apparatus of claim 6, wherein the entity library establishing module comprises:
the medical entity library establishing unit is used for extracting medical entities from the medical related structured data and establishing a medical entity library;
the data acquisition unit is used for acquiring medical related corpora;
an entity extraction unit, configured to extract a medical entity from the corpus;
and the maintenance unit is used for supplementing the medical entities extracted by the entity extraction unit into the medical entity library.
8. The apparatus of claim 7,
the data acquisition unit is specifically configured to acquire medically related corpora from any one or more of the following data sources: medically relevant documents, patents, news, web pages.
9. The apparatus of claim 7, wherein the entity relationship determination module comprises:
a first determining unit, configured to extract an entity relationship from the medically-related corpus by using a rule-based method; and/or
And the second determining unit is used for extracting the entity relation from the medical related linguistic data by adopting a deep learning model-based method.
10. The apparatus of claim 6, wherein the generic entity comprises: target, indications, company;
the relationships among the entities of the same type comprise: drug entity relationship, company entity relationship; the drug entity relationship includes: synergy and antagonism; the corporate entity relationship includes: parent, subsidiary; collaboration, transfer and admission in the business domain;
the relationships among the different types of entities comprise: the relationship between the drug and the indication, the relationship between the drug and the target, and the relationship between the drug and the company.
CN202110025086.3A 2021-01-08 2021-01-08 Method and device for constructing drug research and development knowledge base Active CN112347204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110025086.3A CN112347204B (en) 2021-01-08 2021-01-08 Method and device for constructing drug research and development knowledge base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110025086.3A CN112347204B (en) 2021-01-08 2021-01-08 Method and device for constructing drug research and development knowledge base

Publications (2)

Publication Number Publication Date
CN112347204A true CN112347204A (en) 2021-02-09
CN112347204B CN112347204B (en) 2021-05-14

Family

ID=74427922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110025086.3A Active CN112347204B (en) 2021-01-08 2021-01-08 Method and device for constructing drug research and development knowledge base

Country Status (1)

Country Link
CN (1) CN112347204B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268605A (en) * 2021-05-26 2021-08-17 深圳晶泰科技有限公司 Construction method and system of crystal form knowledge graph of small molecule drug
CN113761929A (en) * 2021-09-15 2021-12-07 慧算医疗科技(上海)有限公司 Method, device, equipment and medium for standardizing drug named entities in medical literature
CN113836931A (en) * 2021-11-24 2021-12-24 慧算医疗科技(上海)有限公司 Method, system and terminal for building cancer medication knowledge base based on domain ontology
CN114201618A (en) * 2022-02-17 2022-03-18 药渡经纬信息科技(北京)有限公司 Drug development literature visualization interpretation method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108492887A (en) * 2018-04-13 2018-09-04 合肥工业大学 medical knowledge map construction method and device
CN108804419A (en) * 2018-05-22 2018-11-13 湖南大学 Medicine is sold accurate recommended technology under a kind of line of knowledge based collection of illustrative plates
CN109145119A (en) * 2018-07-02 2019-01-04 北京妙医佳信息技术有限公司 The knowledge mapping construction device and construction method of health management arts
CN109299285A (en) * 2018-09-11 2019-02-01 中国医学科学院医学信息研究所 A kind of pharmacogenomics knowledge mapping construction method and system
CN109543047A (en) * 2018-11-21 2019-03-29 焦点科技股份有限公司 A kind of knowledge mapping construction method based on medical field website
CN111061859A (en) * 2019-12-02 2020-04-24 深圳追一科技有限公司 Data processing method and device based on knowledge graph and computer equipment
CN111191014A (en) * 2019-12-26 2020-05-22 上海科技发展有限公司 Medicine relocation method, system, terminal and medium
US20200218779A1 (en) * 2019-01-03 2020-07-09 International Business Machines Corporation Cognitive analysis of criteria when ingesting data to build a knowledge graph
CN112015905A (en) * 2020-08-05 2020-12-01 河北工程大学 Method for constructing fatigue marker disease knowledge graph

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108492887A (en) * 2018-04-13 2018-09-04 合肥工业大学 medical knowledge map construction method and device
CN108804419A (en) * 2018-05-22 2018-11-13 湖南大学 Medicine is sold accurate recommended technology under a kind of line of knowledge based collection of illustrative plates
CN109145119A (en) * 2018-07-02 2019-01-04 北京妙医佳信息技术有限公司 The knowledge mapping construction device and construction method of health management arts
CN109299285A (en) * 2018-09-11 2019-02-01 中国医学科学院医学信息研究所 A kind of pharmacogenomics knowledge mapping construction method and system
CN109543047A (en) * 2018-11-21 2019-03-29 焦点科技股份有限公司 A kind of knowledge mapping construction method based on medical field website
US20200218779A1 (en) * 2019-01-03 2020-07-09 International Business Machines Corporation Cognitive analysis of criteria when ingesting data to build a knowledge graph
CN111061859A (en) * 2019-12-02 2020-04-24 深圳追一科技有限公司 Data processing method and device based on knowledge graph and computer equipment
CN111191014A (en) * 2019-12-26 2020-05-22 上海科技发展有限公司 Medicine relocation method, system, terminal and medium
CN112015905A (en) * 2020-08-05 2020-12-01 河北工程大学 Method for constructing fatigue marker disease knowledge graph

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268605A (en) * 2021-05-26 2021-08-17 深圳晶泰科技有限公司 Construction method and system of crystal form knowledge graph of small molecule drug
CN113268605B (en) * 2021-05-26 2024-01-02 深圳晶泰科技有限公司 Construction method and system of small molecular medicine crystal form knowledge graph
CN113761929A (en) * 2021-09-15 2021-12-07 慧算医疗科技(上海)有限公司 Method, device, equipment and medium for standardizing drug named entities in medical literature
CN113836931A (en) * 2021-11-24 2021-12-24 慧算医疗科技(上海)有限公司 Method, system and terminal for building cancer medication knowledge base based on domain ontology
CN113836931B (en) * 2021-11-24 2022-03-08 慧算医疗科技(上海)有限公司 Method, system and terminal for building cancer medication knowledge base based on domain ontology
CN114201618A (en) * 2022-02-17 2022-03-18 药渡经纬信息科技(北京)有限公司 Drug development literature visualization interpretation method and system

Also Published As

Publication number Publication date
CN112347204B (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112347204B (en) Method and device for constructing drug research and development knowledge base
CN110990579B (en) Cross-language medical knowledge graph construction method and device and electronic equipment
Cheatham et al. String similarity metrics for ontology alignment
Aronson Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.
Uramoto et al. A text-mining system for knowledge discovery from biomedical documents
Zeller et al. DErivBase: Inducing and evaluating a derivational morphology resource for German
WO2008134588A1 (en) Methods and systems of automatic ontology population
Zweigenbaum et al. UMLF: a unified medical lexicon for French
Abulaish et al. A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora
Tateisi et al. Part-of-Speech Annotation of Biology Research Abstracts.
Evans et al. Automatic indexing of abstracts via natural-language processing using a simple thesaurus
Lakshmi et al. Association rule extraction from medical transcripts of diabetic patients
Moreno et al. Ontology-based information extraction of regulatory networks from scientific articles with case studies for Escherichia coli
CN102117285B (en) Search method based on semantic indexing
Zhou et al. Context-sensitive spelling correction of consumer-generated content on health care
Beck et al. Auto-CORPus: A natural language processing tool for standardizing and reusing biomedical literature
Pirkola Studies on linguistic problems and methods in text retrieval: the effects of anaphor and ellipsis resolution in proximity searching, and translation and query structuring methods in cross-language retrieval
Mani et al. Automatically inducing ontologies from corpora
US20200089697A1 (en) System and method for parsing user query
Maziarz et al. Towards mapping thesauri onto plWordNet
Bichindaritz et al. Concept mining for indexing medical literature
Buriachok et al. Implementation of an index optimize technology for highly specialized terms based on the phonetic algorithm metaphone
Buriachok et al. Implantation of indexing optimization technology for highly specialized terms based on Metaphone phonetical algorithm
Cohen Towards Understanding of Medical Hebrew
Boudjellal et al. A silver standard biomedical corpus for Arabic language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant