CN112347204A

CN112347204A - Method and device for constructing drug research and development knowledge base

Info

Publication number: CN112347204A
Application number: CN202110025086.3A
Authority: CN
Inventors: 丁红霞; 伍星; 吴忠毅; 苑敬; 王雨福; 李靖; 李琪; 廖宛玲
Original assignee: Jingwei Jingwei Information Technology Beijing Co ltd
Current assignee: Jingwei Jingwei Information Technology Beijing Co ltd
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2021-02-09
Anticipated expiration: 2041-01-08
Also published as: CN112347204B

Abstract

The invention discloses a method and a device for constructing a drug research and development knowledge base, wherein the method comprises the following steps: establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entity is an entity related to a drug; determining entity relationships; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types; and establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node. The invention can improve the automation degree of establishing the drug research and development knowledge base, is convenient for users to understand the relationship between drug data and quickly interpret or describe the drug complex knowledge.

Description

Method and device for constructing drug research and development knowledge base

Technical Field

The invention relates to the technical field of information processing, in particular to a method and a device for constructing a drug research and development knowledge base.

Background

The knowledge base, also called intelligent database or artificial intelligence database, is a knowledge-based database, has intelligence, and any information processing system can not be separated from the support of the data and knowledge base. At present, the field knowledge base is usually constructed manually, for example, by a person such as an expert in the field, but the manual construction of the field knowledge base requires a lot of time and effort, is inefficient and difficult to maintain, and particularly in the field of drug research and development, not only is the field of human beings very closely related, but also the knowledge system in the field is very complicated, and the knowledge system includes not only explicit knowledge of diseases, drugs, examination means, examination equipment, treatment methods, and the like, but also implicit knowledge of disease diagnosis experience, disease generation causes, related complications, and the like, and the knowledge is related to each other. Therefore, how to efficiently and comprehensively establish a drug development knowledge base is a problem which needs to be solved by the industry urgently.

Disclosure of Invention

The invention provides a method and a device for constructing a drug research and development knowledge base, which are used for solving the problems of low efficiency, long time and difficult maintenance of constructing the knowledge base in a manual mode in the prior art and realizing the automatic processing of constructing the drug research and development knowledge base.

Therefore, the invention provides the following technical scheme:

a method of drug development knowledge base construction, the method comprising:

establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entity is an entity related to a drug;

determining entity relationships; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;

and establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.

Optionally, the establishing a medical entity library comprises:

extracting medical entities from the medical related structured data and establishing a medical entity library;

collecting medical related linguistic data;

and extracting medical entities from the corpus, and supplementing the extracted medical entities into the medical entity library.

Optionally, the acquiring medically-related corpora includes:

collecting medically relevant corpora from any one or more of the following data sources: medically relevant documents, patents, news, web pages.

Optionally, the determining the entity relationship includes any one or more of the following ways:

extracting entity relationships from the medical related corpora by adopting a rule-based method;

and extracting entity relations from the medical related linguistic data by adopting a deep learning model-based method.

Optionally, the generic entity comprises: target, indications, company;

the relationships among the entities of the same type comprise: drug entity relationship, company entity relationship; the drug entity relationship includes: synergy and antagonism; the corporate entity relationship includes: parent, subsidiary; collaboration, transfer and admission in the business domain;

the relationships among the different types of entities comprise: the relationship between the drug and the indication, the relationship between the drug and the target, and the relationship between the drug and the company.

A drug development knowledge base construction apparatus, the apparatus comprising:

the entity library establishing module is used for establishing a medical entity library, and the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entity is an entity related to a drug;

the entity relationship determining module is used for determining the entity relationship; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;

and the knowledge graph generating module is used for establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.

Optionally, the entity library establishing module includes:

the medical entity library establishing unit is used for extracting medical entities from the medical related structured data and establishing a medical entity library;

the data acquisition unit is used for acquiring medical related corpora;

an entity extraction unit, configured to extract a medical entity from the corpus;

and the maintenance unit is used for supplementing the medical entities extracted by the entity extraction unit into the medical entity library.

Optionally, the data acquisition unit is specifically configured to acquire the medically-related corpora from any one or more of the following data sources: medically relevant documents, patents, news, web pages.

Optionally, the entity relationship determining module includes:

a first determining unit, configured to extract an entity relationship from the medically-related corpus by using a rule-based method; and/or

And the second determining unit is used for extracting the entity relation from the medical related linguistic data by adopting a deep learning model-based method.

Optionally, the generic entity comprises: target, indications, company;

According to the method and the device for constructing the drug research and development knowledge base, the entity relation is determined by establishing the medical entity base, the core entity is taken as a key node, the general entity is taken as a common node, and the knowledge graph corresponding to the medical entity base is established according to the entity relation. According to the scheme of the embodiment of the invention, the logic relation between related data and entities is displayed on the complex medicine data relating to multiple fields of medicines in a knowledge map form, so that a user can conveniently understand the relation between the medicine data and quickly read or describe the complex medicine knowledge; moreover, after the knowledge base is constructed, the medicine data processing is convenient, and the application direction exploration and reasoning of the medicine data are convenient.

Drawings

FIG. 1 is a flow chart of a method for building a knowledge base of drug development according to an embodiment of the invention;

FIG. 2 is a diagram illustrating a serial relationship between a core entity and a generic entity according to an embodiment of the present invention;

FIG. 3 is an example of a rule-based method for extracting entity relationships from the medically-related corpus in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of a drug development knowledge base construction apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of a structure of an entity library creating module according to an embodiment of the present invention.

Detailed Description

The medicine data in the medicine field not only has the professional characteristics of the industry, but also relates to a plurality of fields by combining the research and development of the medicine. Aiming at the characteristics, the embodiment of the invention provides a method and a device for constructing a drug research and development knowledge base.

As shown in fig. 1, the method for constructing a knowledge base of drug development in the embodiment of the present invention includes the following steps:

step 101, establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entities are drug related entities.

Such generic entities may include, but are not limited to: target, indications, company, etc.;

the entity attribute is descriptive information that defines an entity, such as:

the properties of the drug include: name, dosage form, route of administration, drug type, NME, compound, registry classification, etc.

Attributes of the indications include: disease name, etiology, anatomical location, disease classification, etc.

The attributes of a company include: region, parent company, subsidiary company, business domain, enterprise category, etc.

The attributes of the target include: classifying the target points; secondary property protein, nucleic acid, lipid and carbohydrate of the target of the non-signal path; tertiary attributes of protein and nucleic acid targets such as genes, sequences, etc.

The entity data mainly come from unstructured data, semi-structured data and structured data such as open documents, official network data and intra-industry databases.

For example, the medical entity library may be specifically established as follows:

(1) establishing a medical entity library; extracting medical entities from the medical related structured data and establishing a medical entity library;

in practical applications, a relational database (e.g., MySQL) may be used to store entity attributes (fields not used for association), such as picture URLs of entities; graph databases (e.g., Neo4 j) are employed to store the main fields and relationships of entities for knowledge reasoning (graph reasoning, context reasoning, equivalence reasoning, inconsistency detection, knowledge discovery).

The flow of extracting medical entities from structured data to a graph database is as follows:

designing a database table structure according to the medical entity and the attribute thereof;

creating mapping view in the existing professional database, and extracting data into the database table;

generating an RDF file by adopting a D2RQ tool;

the RDF file is imported into the graph database.

(2) Medical entities are extracted through various data sources, and a medical entity library is supplemented and maintained: the medical-related corpora are collected, for example, the medical-related corpora may be collected from any one or more of the following data sources: medically relevant documents, patents, news, web pages, of course, in practical applications are not limited to these data sources; then, the medical entities are extracted from the corpus, and the extracted medical entities are supplemented into the medical entity library (namely, the graph database).

In addition, in order to ensure the accuracy of the medical entities supplemented into the medical entity library, the frequency of occurrence of each extracted medical entity may be further counted, and when the frequency of occurrence of the medical entity is higher than a set threshold (for example, 10 times), the medical entity is supplemented into the medical entity library.

Further, in order to ensure the quality of the medical entity library, the medical entity to be supplemented can be submitted to an expert for review, and the reviewed medical entity is incorporated into the medical entity library.

It should be noted that, in practical applications, the establishment of the medical entity library may also adopt other manners, for example, collecting various medical related data sources, performing word segmentation on a text sequence therein, and searching a domain knowledge base according to word units obtained by word segmentation to obtain medical entities in the text sequence, such as drugs, indications, and the like; as another example, a classifier may also be utilized to determine medical entities in the text sequence. Of course, there may be other ways, and the embodiment of the present invention is not limited thereto.

It should be noted that, in practical applications, for different types of medical entities, different methods and data sources may be used according to their characteristics to obtain corresponding medical entities, supplement the medical entities to the medical entity library, and update and maintain the medical entity library. Indications and targets are exemplified below.

For the indication entity:

the method comprises the following steps: the approved indications for marketed drugs are an important source of standardized terminology compared to information such as clinical registration, news, and company yearbook. To this end, the indication entity may be obtained as follows:

1) constructing an original indication first information table, and extracting indication entities from data sources such as American FDA, European Union EMA, Japanese PMDA, Chinese NMPA (N-methyl pyrrolidone) medicine specifications and the like;

2) carrying out synonym merging processing on the original indication first information table to obtain a second information table, namely merging in a self set of the indication extraction words; the method can be carried out by means of character processing (removing, spacing, and the like), capital and small case conversion, English single and plural, tense normalization and the like, and finally converts the original first information table of the indication into a second information table;

3) carrying out medicine NDA/BLA registration number frequency statistics on the indication character strings in the second information table after the merging processing;

4) carrying out international universal standard library (MeSH) synonym relation mapping on the indication information in the second information table;

the synonym relation mapping refers to matching the indication information with entries in MeSH and synonyms thereof. This step is the process of mapping the extracted non-standard set of entities to the standard set. MeSH and other standard libraries (such as ICD) can be mapped through a certain relation, so that the application range of the knowledge base can be expanded.

5) For an indication character string which cannot be mapped, if the occurrence frequency of the indication character string in different data sources is more than a set threshold (such as 10 times), supplementing the indication character string into the medical entity library; and taking the indication character strings with the frequency lower than the threshold value as standby entity entries.

The second method comprises the following steps: common Clinical trial publicity platforms are the main Clinical new drug information acquisition ways, such as the us Clinical trial publicity platform Clinical Trials gov, the chinese Clinical trial registration and information publicity platform, the WHO international Clinical trial registration platform, the european EU Clinical Trials Register. In addition, annual news, investment and financing transaction news and patent information of companies are important disclosure channels for research projects of the companies. To this end, the indication entity may also be obtained as follows:

1) performing medical entity calibration on a plurality of clinical source information, such as multidimensional information calibration of medicines, companies, indications and the like, and taking the information as a spare entity entry when a new indication entry appears;

2) performing periodic frequency analysis on the spare entity entries (for example, performing word frequency statistics on the spare entries once every month), and performing approximate mapping on the entries with the frequency lower than a set threshold (for example, 10 times); a common method for approximate mapping is to approximate a lower term to an upper term, such as "Chronic Liver failure" to "Liver failure"; while chronic liver failure serves as backup data; if the frequency of occurrence is greater than or equal to a set threshold, the entry is enabled and mapped to the MeSH hierarchy. For example, "chronic liver failure" is converted from a backup entity entry to a formal indication entry when the frequency of occurrence from each normalized data source exceeds a threshold;

3) and after the spare entity entry is released, adjusting the mapping level of the entry.

For the target entity:

targets are classified by type into non-signal path targets, and others. Wherein, the non-signal channel target takes a specific molecular entity as a mark, and the drug and the target entity directly interact to generate pharmacological effect. The non-signal target molecules are represented by protein, nucleic acid, polysaccharide, lipid and other monomer or complex molecules. The signal path target points are physiological and biochemical phenomena, such as cell apoptosis, inflammatory reaction, aging and the like. In such cases, the physical target of the corresponding drug is unknown, or can mediate a related physiological phenomenon after binding to the physical target. Other types target ambiguous data.

In embodiments of the invention, the targets may be classified as follows: can be directly related to the targets of published proteins, gene libraries and the like, and belongs to the non-signal path target; data which can not be classified into non-signal path target points and other target points is classified into 'signal path class target points'; targets that could not be classified into the two above cases were classified as "other targets".

The entity molecules are classified into proteins, nucleic acids, lipids and saccharides. Wherein, most of the entity targets are proteins, such as single molecule protein, multi-subunit protein or multi-protein complex. The protein target can realize automatic maintenance and verification, and a small number of nucleic acids, lipids and saccharides can be manually maintained.

For example, the maintenance of a single-molecule protein target can be updated in several ways:

1) public source maintenance, such as the ID number of GeneCard, Uniprot, Chembl;

2) introducing functions, namely acquiring Function field content of a public database Unit;

3) sequence information, namely acquiring a classical Sequence of Unit, and storing the classical Sequence according to a FASTA format;

4) alias information, summarizing and de-duplicating aliases of Uniprot and GeneCards, and removing naming information such as enzyme number EC;

5) the hierarchical relationship is based on a MeSH hierarchical framework as a tree structure, and hierarchical assistance is performed by a gene family of HGNC, a Chembl hierarchy, and the like. For protein targets not contained in MeSH, a similar functional protein classification process was performed.

As another example, for more up-to-date maintenance of the composite protein target, only the Uniprot number, GeneCards number, of each subunit component may be maintained.

It should be noted that, considering that a large amount of data in the field of medicine is in the form of english as a main character, the scheme of the invention not only can extract medical entities from Chinese medical related data to establish a chinese medicine research and development knowledge base, but also can extract medical entities from data sources of other languages, and combines a chinese localization verification process to form data internationalization and localization knowledge and store the data internationalization and localization knowledge in a localization branch entity of the knowledge base by cross-comparison between an original text (other languages such as english) and a translated text (chinese), between a translated text (chinese) and an original text (other languages), word frequency comparison and expert comparison. For example, english and chinese terms are associated with UMLS (Unified Medical Language System), and the latest ICD-11 and the like not covered by UMLS are maintained; the Chinese terms are checked by Chinese clinical trial registration and information disclosure platform, Chinese medicine instruction book, keywords of published documents, etc.

Step 102, determining entity relationships; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types.

The relationships between entities of the same type include but are not limited to: drug entity relationship, company entity relationship. Wherein: the drug entity relationship includes: synergistic effects (such as additive, potentiating, sensitizing, etc.), antagonistic effects (such as subtractive, resistant, etc.); the corporate entity relationship includes: parent, subsidiary; collaboration, transfer, and assignment of business areas.

The relationships between the different types of entities include, but are not limited to: drug-to-indication relationships (e.g., therapeutic, prophylactic, diagnostic, inductive, etc.), drug-to-target relationships (e.g., sensitization, degradation, stabilization, inhibition, stimulation, agonism, increased degradation, partial agonism, increased, induced, inverse agonism, activation, binding, antagonism, donor, catalysis, regulation, clearance, block, etc.), drug-to-company relationships (e.g., original, in, out, assigned).

In the embodiment of the present invention, the core entity is connected in series with each general entity, and the general entity has an association relationship with the core entity, as shown in an example in fig. 2. In this way, drug information is greatly enriched in a relational manner from different dimensions by general entities. For example, the mechanism of action of a drug is described by the relationship between the drug and the target entity; describing the drug use through the entity relationship of the drug and the indication; the medicine development history, business information and the like are described through the relation between medicines and company entities.

Therefore, the relationship reasoning between the entities can be carried out, and the relationship between any two common entities can jump through the core entity. For example, the research and development investment of company a in the disease field needs to be inferred through the drug research and development situation of company a.

In the embodiment of the present invention, determining the entity relationship may adopt any one or more of the following manners:

1. extracting entity relationships from the medically relevant corpus using a rule-based method

For example, a rule mapping template is made for a guide (or textbook) type text with a fixed format, for example, as shown in fig. 3, the following rules are set:

extracting the disease name from paragraph "(one) applicable object"; extracting symptoms from the diagnosis basis of the paragraph II, wherein the extracted entities in the paragraph are 'phenomenological expression' relations of the disease; the drug name was extracted from paragraph "(seven) drug selection and timing of use", the entity extracted in this paragraph being the "treated.

And extracting the entities in the paragraphs by using a professional vocabulary and an AC automata algorithm, and filling the relation between the entities according to a preset rule.

2) Extracting entity relationships from the medically relevant corpus using a deep learning model-based method

For example, the number of co-occurrence times of entities (for example, 5 times) is counted from text data such as news, literature, and patents, a co-occurrence entity set is obtained by screening, a co-occurrence entity context paragraph is extracted as a prediction corpus to be used by a deep learning model, and the relationship between entities is predicted by using a Natural Language Processing (NLP) algorithm model based on deep learning.

And 103, establishing a knowledge graph corresponding to the medical entity library according to the entity relationship by taking the core entity as a key node and the general entity as a common node.

The knowledge graph is a graph organization form and associates various entities through semantic association. The knowledge graph combines structured data and unstructured data through data extraction and fusion, embodies the ideas of data management and semantic connection, and is beneficial to utilization and migration of large-scale data.

According to the method for constructing the drug research and development knowledge base, the entity relation is determined by establishing the medical entity base, the core entity is used as a key node, the general entity is used as a common node, and the knowledge graph corresponding to the medical entity base is established according to the entity relation. According to the scheme of the embodiment of the invention, the logic relation between related data and entities is displayed on the complex medicine data relating to multiple fields of medicines in a knowledge map form, so that a user can conveniently understand the relation between the medicine data and quickly read or describe the complex medicine knowledge; moreover, after the knowledge base is constructed, the medicine data processing is convenient, and the application direction exploration and reasoning of the medicine data are convenient.

Correspondingly, the embodiment of the invention also provides a drug development knowledge base construction device, which is a structural block diagram of the device as shown in fig. 4.

In this embodiment, the apparatus includes the following modules:

an entity library establishing module 401, configured to establish a medical entity library, where the medical entity library includes medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug; the general entities are drug-related entities, such as targets, indications, companies, and the like;

an entity relationship determining module 402, configured to determine an entity relationship; the entity relationship comprises: relationships between entities of the same type and relationships between entities of different types;

a knowledge graph generating module 403, configured to establish a knowledge graph corresponding to the medical entity library according to the entity relationship, where the core entity is a key node, and the general entity is a common node.

In the embodiment of the invention, each entity data mainly comes from unstructured data, semi-structured data and structured data such as open documents, official website data and intra-industry databases.

As shown in fig. 5, a specific implementation manner of the entity library establishing module 401 may include the following units:

a medical entity library establishing unit 51, configured to extract medical entities from the medically-related structured data, and establish a medical entity library 50;

a data acquisition unit 52 for acquiring medically related corpora; for example, medically relevant corpora may be collected from any one or more of the following data sources: medical related documents, patents, news, web pages and the like, and other data sources can be provided, which are not limited;

an entity extracting unit 53, configured to extract a medical entity from the corpus;

a maintenance unit 54 for supplementing the medical entities extracted by the entity extraction unit 53 into the medical entity repository 50.

It should be noted that, in practical applications, a relational database (such as MySQL) may be used to store entity attributes (fields not used for association), for example, a picture URL of an entity; graph databases (e.g., Neo4 j) are employed to store the main fields and relationships of entities for knowledge reasoning (graph reasoning, context reasoning, equivalence reasoning, inconsistency detection, knowledge discovery).

Further, in another embodiment of the entity library establishing module, a statistical unit (not shown) may be further included, for counting the occurrence frequency of each medical entity extracted by the entity extracting unit 53, and when the occurrence frequency of the medical entity is higher than a set threshold (for example, 10 times), triggering the maintaining unit 54 to supplement the medical entity library with the medical entity.

It should be noted that, in practical applications, for different types of medical entities, different methods and data sources may be used according to characteristics of the medical entities to obtain corresponding medical entities, and the corresponding medical entities are supplemented to the medical entity library, and the medical entity library is updated and maintained, which is not limited to the embodiment of the present invention.

In the embodiment of the present invention, the relationships between entities of the same type include: drug entity relationship, company entity relationship; the drug entity relationship includes: synergy, antagonism, etc.; the corporate entity relationship includes: parent, subsidiary; collaboration, transfer, and assignment of business areas, etc. The relationships among the different types of entities comprise: the relationship between the drug and the indication, the relationship between the drug and the target, the relationship between the drug and the company, etc.

Determining entity relationships may be performed in any one or more of the following ways, for example, the entity relationship determination module 402 may include: a first determination unit, and/or a second determination unit. Wherein:

the first determining unit is used for extracting entity relations from the medical related linguistic data by adopting a rule-based method;

the second determination unit is used for extracting entity relations from the medically related linguistic data by adopting a deep learning model-based method.

It should be noted that, for the above embodiments of the apparatus of the present invention, since the functional implementation of each module and unit is similar to that in the corresponding method, the description of each embodiment of the quality control material selection apparatus is relatively simple, and the relevant points can be referred to the description of the corresponding parts of the method embodiments.

The drug research and development knowledge base construction device provided by the embodiment of the invention determines the entity relationship by establishing the medical entity base, and establishes the knowledge graph corresponding to the medical entity base according to the entity relationship by taking the core entity as the key node and the general entity as the common node. According to the scheme of the embodiment of the invention, the logic relation between related data and entities is displayed on the complex medicine data relating to multiple fields of medicines in a knowledge map form, so that a user can conveniently understand the relation between the medicine data and quickly read or describe the complex medicine knowledge; moreover, after the knowledge base is constructed, the medicine data processing is convenient, and the application direction exploration and reasoning of the medicine data are convenient.

According to the scheme of the embodiment of the invention, the repeated labor force of field experts for maintaining and managing the knowledge base is greatly reduced and the convenience and flexibility for field users to acquire the field knowledge are improved by establishing the diversity association of the drug entities and other medical entities and organizing and managing the knowledge in the knowledge base in the form of the knowledge map.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Furthermore, the above-described system embodiments are merely illustrative, wherein modules and units illustrated as separate components may or may not be physically separate, i.e., may be located on one network element, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those skilled in the art will appreciate that all or part of the steps in the above method embodiments may be implemented by a program to instruct relevant hardware to perform the steps, and the program may be stored in a computer-readable storage medium, referred to herein as a storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

Correspondingly, the embodiment of the invention also provides a device for the drug development knowledge base construction method, and the device is an electronic device, such as a mobile terminal, a computer, a tablet device, a personal digital assistant and the like. The electronic device may include one or more processors, memory; wherein the memory is used for storing computer executable instructions and the processor is used for executing the computer executable instructions to realize the method of the previous embodiments.

The present invention has been described in detail with reference to the embodiments, and the description of the embodiments is provided to facilitate the understanding of the method and apparatus of the present invention, and is intended to be a part of the embodiments of the present invention rather than the whole embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any creative effort shall fall within the protection scope of the present invention, and the content of the present description shall not be construed as limiting the present invention. Therefore, any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for constructing a drug development knowledge base, the method comprising:

establishing a medical entity library, wherein the medical entity library comprises medical entities and entity attributes; the medical entity comprises: core entities and general entities; the core entity is a drug, and the general entity is an entity related to the drug;

2. The method of claim 1, wherein the establishing a medical entity library comprises:

collecting medical related linguistic data;

3. The method of claim 2, wherein the collecting medically relevant corpora comprises:

4. The method of claim 2, wherein determining entity relationships comprises any one or more of:

5. The method of claim 1, wherein the generic entities comprise: target, indications, company;

6. A drug development knowledge base construction apparatus, the apparatus comprising:

7. The apparatus of claim 6, wherein the entity library establishing module comprises:

the data acquisition unit is used for acquiring medical related corpora;

8. The apparatus of claim 7,

the data acquisition unit is specifically configured to acquire medically related corpora from any one or more of the following data sources: medically relevant documents, patents, news, web pages.

9. The apparatus of claim 7, wherein the entity relationship determination module comprises:

10. The apparatus of claim 6, wherein the generic entity comprises: target, indications, company;