WO2022021958A1 - 药品知识图谱的构建方法和装置 - Google Patents

药品知识图谱的构建方法和装置 Download PDF

Info

Publication number
WO2022021958A1
WO2022021958A1 PCT/CN2021/088889 CN2021088889W WO2022021958A1 WO 2022021958 A1 WO2022021958 A1 WO 2022021958A1 CN 2021088889 W CN2021088889 W CN 2021088889W WO 2022021958 A1 WO2022021958 A1 WO 2022021958A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
entities
usage
precondition
drug
Prior art date
Application number
PCT/CN2021/088889
Other languages
English (en)
French (fr)
Inventor
杨帅
谢佩
文豪
韩磊
张亚
Original Assignee
北京京东拓先科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东拓先科技有限公司 filed Critical 北京京东拓先科技有限公司
Priority to US18/016,896 priority Critical patent/US20230352192A1/en
Priority to EP21849263.5A priority patent/EP4191439A4/en
Publication of WO2022021958A1 publication Critical patent/WO2022021958A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present disclosure relates to the field of computer technology, in particular to the field of knowledge graph technology, and in particular to a method and apparatus for constructing a knowledge graph of medicines, electronic equipment, and computer-readable media.
  • Knowledge graph is a branch of knowledge engineering in artificial intelligence, and it has a relatively mature application in the general field. It cannot be parsed by traditional syntactic analysis methods, and there is no mature and public drug knowledge map yet.
  • the embodiments of the present disclosure propose a method and apparatus for constructing a drug knowledge graph, an electronic device, and a computer-readable medium.
  • the embodiments of the present disclosure provide a method for constructing a drug knowledge graph.
  • the method includes: identifying entities in drug text; replacing key medical entities in the entities with strings that conform to preset rules, and obtaining the replacement text; restore the character string in the word segmentation result determined based on the replacement text to the key medical entity replaced by the character string; form the entity linear relationship between the entities based on the entity; syntactically syntactic the entity linear relationship
  • the analytical results obtained are analyzed to generate a drug knowledge map.
  • the above method further includes: establishing a mapping relationship table between the strings and the medical key entities replaced by the strings.
  • the above method further includes: identifying non-critical pharmaceutical entities in the word segmentation result; and forming entity linear relationships between entities based on the entities, including: classifying critical pharmaceutical entities and non-critical pharmaceutical entities as pharmaceuticals The order of each entity in the text is sorted to obtain the entity linear relationship corresponding to the drug text.
  • the above-mentioned medical key entities include: disease name and drug name; the above-mentioned medical non-critical entities include: population, dose, frequency, course of treatment, administration route, and administration timing.
  • the above entities include: a precondition entity, a usage and dosage entity, and the precondition entity includes: a key pharmaceutical entity; the above-mentioned analysis result obtained by syntactic analysis of the linear relationship of the entities is used to generate a drug knowledge graph, including: Based on the precondition entities obtained by identifying the linear relationship of the entities, the combined result of the preconditions is obtained; based on the usage entities obtained by identifying the linear relationship of the entities, the combined results of usage are obtained; based on each precondition entity in the linear relationship of the entities and The positional relationship between each usage entity, combine the precondition combined result and the usage combined result, and obtain the root node set with at least one element in the precondition combined result and at least one element in the usage combined result as the set elements ; Merge all different set elements in the root node set; take the merge result with the highest merge probability among the merge results of the root node set as the parsing result, and add the parsing result to the knowledge graph.
  • obtaining the precondition merged result based on the precondition entities obtained by identifying the entity linear relationship includes: identifying the precondition entities in the entity linear relationship, and combining the precondition entities in the identified precondition entities.
  • Each precondition entity and the combination of each precondition entity are used as a set element to form a precondition entity set; all different set elements in the precondition entity set are merged to obtain a precondition merge result.
  • obtaining the combined usage result based on the usage entity obtained by identifying the entity linear relationship includes: identifying the usage entity in the entity linear relationship, and combining each usage entity in the identified usage entity And the combination between each usage entity is used as a set element to form a usage entity set; all the different set elements in the usage entity set are combined to obtain a combined usage result.
  • the method further includes: identifying attributes of entities in the drug text; adding the attributes of the entities to the drug knowledge graph.
  • the above method further includes: performing at least one of the following formatting processing on the medicine text: normalizing different punctuation marks representing the same meaning in the medicine text; converting Chinese numbers in the medicine text into Arabic number.
  • embodiments of the present disclosure provide an apparatus for constructing a knowledge graph of medicines, the apparatus comprising: an identification unit configured to identify entities in medicine text, and a replacement unit configured to use characters that conform to preset rules String, replace the medical key entity in the entity to obtain the replacement text; the restoring unit is configured to restore the character string in the word segmentation result determined based on the replacement text to the medical key entity replaced by the character string; the forming unit is configured by It is configured to form an entity linear relationship between the entities based on the entities; the parsing unit is configured to generate a drug knowledge graph according to a parsing result obtained by syntactically parsing the entity linear relationship.
  • the above-mentioned apparatus further includes: a mapping unit configured to establish a mapping relationship table between the character string and the medical key entity replaced by the character string.
  • the above-mentioned apparatus further includes: a distinguishing unit configured to identify non-critical medical entities in the word segmentation result; the above-mentioned forming unit is further configured to classify the critical medical entities and non-critical medical entities according to the respective entities in the drug text order to get the entity linear relationship corresponding to the drug text.
  • the above-mentioned medical key entities include: disease name and drug name; the above-mentioned medical non-critical entities include: population, dose, frequency, course of treatment, administration route, and administration timing.
  • the precondition entity, the usage entity, and the precondition entity include: a medical key entity;
  • the above-mentioned parsing unit includes: a precondition obtaining module configured to obtain a precondition based on a linear relationship between the identified entities entity, obtains the result of combining the preconditions;
  • the usage obtaining module is configured to obtain the combined result of usage based on the usage entity obtained by identifying the linear relationship of the entity;
  • the combination module is configured to be based on each precondition in the linear relationship of the entity The positional relationship between the condition entity and each usage entity, and combining the precondition combined result and the usage combined result, obtain a set element with at least one element in the precondition combined result and at least one element in the usage combined result as set elements.
  • the merge module is configured to merge all different set elements in the root node set;
  • the parsing module is configured to use the merge result with the highest merge probability among the merge results of the root node set as the parsing result;
  • the add module is configured Successfully add the parsing results to the knowledge graph.
  • the above-mentioned precondition obtaining module includes: a pre-identification sub-module configured to identify pre-condition entities in the entity linear relationship; a pre-combination sub-module configured to identify the identified pre-conditions Each precondition entity in the entity and the combination of each precondition entity are used as a set element to form a precondition entity set; the precondition merge sub-module is configured to combine all the different set elements in the precondition entity set Merge to get the result of the precondition merge.
  • the above-mentioned usage amount obtaining module includes: a usage amount identification sub-module configured to identify a usage amount entity in a linear relationship of entities; a usage amount combination sub-module configured to Each usage entity and the combination between each usage entity are used as a set element to form a usage entity set; the usage merging sub-module is configured to merge all the different set elements in the usage entity set to get the usage merge. result.
  • the above-mentioned apparatus further includes: a distinguishing unit configured to identify attributes of entities in the drug text; an adding unit configured to add the attributes of the entities to the drug knowledge graph.
  • the above-mentioned apparatus further includes: a formatting unit and/or a conversion unit, the formatting unit is configured to perform normalization processing on different punctuation marks representing the same meaning in the medicine text; the conversion unit is configured to Convert Chinese numerals in medicine text to Arabic numerals.
  • embodiments of the present disclosure provide an electronic device, the electronic device includes: one or more processors; a storage device on which one or more programs are stored; when the one or more programs are stored by one or more A plurality of processors execute such that one or more processors implement a method as described in any implementation of the first aspect.
  • an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the implementation manners of the first aspect.
  • the method and device for constructing a drug knowledge graph provided by the embodiments of the present disclosure firstly identify entities in drug texts; secondly, use character strings that conform to preset rules to replace key medical entities in the entities to obtain replacement texts; The string in the word segmentation result determined by the replacement text is restored to the key medical entity replaced by the string; then, based on the entity, the entity linear relationship between the entities is formed; finally, the linear relationship between the entities is obtained by syntactic analysis. Parse the results and generate a drug knowledge map.
  • the key medical entities are first replaced with the characters of the preset rules, which ensures the accuracy of the word segmentation of the medical text; and the drug knowledge map is obtained by syntactic analysis on the basis of the linear relationship of the entities. It is convenient to convert the natural language of drug usage and dosage into a data structure that can be recognized by computers, which is beneficial to the knowledge graph mining in the medical field, and ensures the accuracy and interpretability of the drug knowledge graph.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure may be applied;
  • FIG. 2 is a flowchart of an embodiment of a method for constructing a drug knowledge graph according to the present disclosure
  • FIG. 3 is a flowchart of another embodiment of a method for constructing a drug knowledge graph according to the present disclosure
  • FIG. 4 is a flowchart of an embodiment of a method for generating a drug knowledge graph according to the present disclosure
  • Figure 5 is a flow diagram of one embodiment of a method for obtaining a combined usage result according to the present disclosure
  • FIG. 6 is a flowchart of a third embodiment of a method for constructing a drug knowledge graph according to the present disclosure
  • FIG. 7 is a schematic structural diagram of an embodiment of an apparatus for constructing a drug knowledge graph according to the present disclosure.
  • FIG. 8 is a schematic structural diagram of an electronic device suitable for implementing embodiments of the present disclosure.
  • FIG. 1 illustrates an exemplary system architecture 100 to which the method of constructing a drug knowledge graph of the present disclosure may be applied.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, and may typically include wireless communication links and the like.
  • the terminal devices 101, 102, and 103 interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as instant messaging tools, email clients, and the like.
  • the terminal devices 101, 102, and 103 may be hardware or software; when the terminal devices 101, 102, and 103 are hardware, they may be user devices with communication and control functions, and the above-mentioned user settings can communicate with the server 105.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the above-mentioned user equipment; the terminal devices 101, 102, and 103 can be implemented into multiple software or software modules (for example, software or software modules for providing distributed services) , can also be implemented as a single software or software module. There is no specific limitation here.
  • the server 105 may be a server that provides various services, for example, a knowledge graph server that provides support for the knowledge graph system on the terminal devices 101 , 102 , and 103 .
  • the knowledge graph server can analyze and process the relevant information of each target image in the network, and feed back the processing results (such as the generated knowledge graph) to the terminal device.
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster composed of multiple servers, or can be implemented as a single server.
  • the server is software, it can be implemented as a plurality of software or software modules (such as software or software modules for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
  • the method for constructing the drug knowledge graph is generally executed by the server 105 .
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the drug knowledge map is generated from the drug prescription.
  • the corresponding usage and dosage of the patient can be matched according to the patient's personal situation, and the items to be selected are given for the pharmacist to screen, which can help the pharmacist to reduce the workload and save the audit. time.
  • FIG. 2 shows a process 200 of an embodiment of a method for constructing a drug knowledge graph according to the present disclosure, and the method for constructing a drug knowledge graph includes the following steps:
  • Step 201 identifying entities in the medicine text.
  • the execution subject (such as a server, a terminal device) on which the method for constructing the drug knowledge graph runs may acquire the drug text by real-time acquisition or by means of memory reading.
  • the drug text includes drugs used to treat different diseases of different groups and the usage and dosage of various drugs.
  • the drug text can be the drug instructions on the medicine box or the prescription of the drug.
  • a drug instruction is: "Oral, Adults take two tablets once a day; children take one tablet once a day.”
  • the entities in the drug text can include: population, disease name, drug name, dose, frequency, course of treatment, route of administration, administration Timing, conjunctions, etc.
  • entities may be classified into key pharmaceutical entities and non-critical pharmaceutical entities, where key pharmaceutical entities are exclusive to pharmaceuticals
  • key pharmaceutical entities include: drug name, disease name; non-pharmaceutical non-key entities are non-pharmaceutical specific nouns, which are common nouns in drug texts, such as dose, course of treatment, and route of administration.
  • the key pharmaceutical entities include: drug name and disease name; and the non-critical pharmaceutical entities include: population, dose, frequency, course of treatment, route of administration, and timing of administration.
  • the key pharmaceutical entities and non-critical pharmaceutical entities provided by this optional implementation can cover the types of entities in the drug instructions, ensuring the comprehensiveness of entity type division.
  • the entity in the medicine text is identified: it may be identifying a key medical entity, or identifying a key medical entity and a non-key medical entity.
  • the identification of key medical entities can be achieved through dictionary enumeration. Among them, the enumeration is used to examine the situation that each entity belongs to the key medical entities one by one, and the dictionary enumeration and identification of the disease names in the key medical entities is performed through a maintained dictionary table of diseases. For the medicine text, traverse the dictionary table of the disease, if a word A in the dictionary table is in the medicine text, then the word A in the medicine text is marked as the disease name.
  • the entities can also be classified into precondition entities and usage entities, wherein the precondition entities are used to represent drugs, diseases, drug use or subjects suffering from a certain disease , for example, population, disease name, drug name, and the connectives between the above three, etc., the precondition entity includes the key entity of medicine.
  • Dosage entities are entities related to drug use, such as frequency, duration of treatment, route of administration, timing of administration, and connectives between the aforementioned four.
  • Different drug texts contain different entities, wherein some drug texts may include: a precondition entity and a dosage entity; some drug texts may include: a dosage entity.
  • the entity in the drug text is identified: it may be the identification of the precondition entity, or the identification of the precondition entity and the dosage entity.
  • Step 202 using a character string that conforms to a preset rule to replace the entity key entity of traditional Chinese medicine to obtain a replacement text.
  • medicine since medicine is an exclusive noun and has a certain complexity, it may affect the subsequent generation of the medicine knowledge graph, so it is necessary to replace the identified key pharmaceutical entities.
  • the generated character string conforming to the preset rule may be a self-defined character string with certain rules, and different character strings may be defined for different types of medical key entities to be distinguished.
  • the text of a drug is: "Treat whooping cough: once a day, one tablet at a time”, identify “whooping cough” as a key medical entity, define the string “indication_0” to replace “pertussis”, and get the replacement text: "indication_0: Once a day, one tablet at a time.”
  • the character string that conforms to the preset rules can also be automatically assigned by the character template and conform to the preset rules, and the character template can also record the relationship between the character string that conforms to the preset rules and the key pharmaceutical entities.
  • the character string of the preset rule replaces the key entity of traditional Chinese medicine
  • the corresponding relationship between the character string of the preset rule and the key pharmaceutical entity can be obtained through the module, so as to restore the key pharmaceutical entity.
  • step 203 the character string in the word segmentation result determined based on the replacement text is restored to the medical key entity replaced by the character string.
  • the Chinese word segmentation function of the Chinese word segmentation tool can be used to segment the replacement text to obtain a word segmentation result.
  • the Chinese word segmentation is to segment a sequence of Chinese characters to obtain individual words.
  • available Chinese word segmentation tools include: jieba, SnowNLP (Simplified Chinese Text Processing, Simplified Chinese Text Processing) and so on.
  • Step 204 based on the entities, an entity linear relationship between the entities is formed.
  • the entities identified in the medicine text can be sorted according to the positions of the entities in the medicine text, so as to form an entity linear relationship between the entities.
  • the text of a drug is: "Oral, once a day for adults, two tablets at a time; once a day for children, one tablet at a time", after entity recognition, the entity linear relationship between each entity is formed as: ⁇ administration route: oral ⁇ population: adults ⁇ frequency: once a day ⁇ dose: two tablets at a time ⁇ population: children ⁇ frequency: once a day ⁇ dose: one tablet at a time ⁇ .
  • the entities identified in the drug text can also be formatted (for example, text spaces are removed, full-width half-width conversion, Chinese numerals are converted into Arabic numerals, and symbols are normalized), and the format processing results are processed according to each entity.
  • the positions in the drug text are sorted to form an entity linear relationship between the various entities.
  • the text of a drug is: "Oral, once a day for adults, two tablets at a time; once a day for children, one tablet at a time", after the formatting of Chinese numerals into Arabic numerals, an entity linear relationship between each entity is formed It is: ⁇ administration route: oral ⁇ population: adults ⁇ frequency: once a day ⁇ dose: 2 tablets at a time ⁇ population: children ⁇ frequency: once a day ⁇ dose: 1 tablet at a time ⁇ .
  • Step 205 Generate a drug knowledge graph according to the parsing result obtained by syntactic parsing of the entity linear relationship.
  • the syntactic analysis of the entity linear relationship refers to extracting the entity relationship that can best express the drug text in the entity linear relationship according to the type of each entity and the grammatical rules between the various entities, combined with a preset syntactic analysis algorithm .
  • the syntax parsing process is to parse the entity linear relationship into a tree-like entity relationship, and is also a process of converting the natural language of the medicine text into a data structure that can be recognized by a computer.
  • the preset syntax analysis algorithm may include: rule-based syntax analysis and/or statistics-based syntax analysis, for a given input text, the rule-based syntax analysis is to map the input text to the knowledge base rules written by implementing Parsing results; statistical-based syntactic analysis is to exhaust all possible parsing results and find the one with the highest probability.
  • the preset syntactic analysis algorithm can further use rule-based syntactic analysis and statistical-based Syntactic analysis, thus making full use of the rigor of rule writing and the flexibility of statistical analysis.
  • the text of a drug is: "Oral, once a day for adults, two tablets at a time; children once a day, one tablet at a time”.
  • the entity linear relationship between the entities is formed as follows: ⁇ administration route: oral ⁇ population: adults ⁇ frequency:1 time a day ⁇ dose:1 time 2 Tablets ⁇ population: children ⁇ Frequency: 1 time a day ⁇ Dose: 1 tablet once a time ⁇ .
  • the generated drug knowledge graph retains the relationship between the entities in the entity linear relationship, which can ensure the accuracy and interpretability of the drug knowledge graph mining results.
  • the method for constructing a drug knowledge graph firstly identifies the entities in the drug text; secondly, replaces the key medical entities in the entities with strings that meet preset rules to obtain the replacement text; The string in the determined word segmentation result is restored to the key pharmaceutical entity replaced by the string; then, based on the entity, the entity linear relationship between the entities is formed; finally, the parsing result obtained by syntactic parsing of the entity linear relationship , to generate a drug knowledge graph.
  • the key medical entities are first replaced with the characters of the preset rules, which ensures the accuracy of the word segmentation of the medical text; and the drug knowledge map is obtained by syntactic analysis on the basis of the linear relationship of the entities. It is convenient to convert the natural language of drug usage and dosage into a data structure that can be recognized by computers, which is beneficial to the knowledge graph mining in the medical field, and ensures the accuracy and interpretability of the drug knowledge graph.
  • the entities of the medicine text include: medicine keywords and medicine non-key entities.
  • FIG. 3 shows a process 300 of another embodiment of the method for constructing a drug knowledge graph of the present disclosure.
  • the construction method of the drug knowledge graph includes the following steps:
  • Step 301 identifying key medical entities in the medicine text.
  • the key medical entities can be identified through dictionary enumeration.
  • the enumeration is used to examine the situation that each entity belongs to the key medical entities one by one, and the dictionary enumeration and identification of the disease names in the key medical entities is performed through a maintained dictionary table of diseases.
  • the medicine text traverse the dictionary table of the disease, if a word A in the dictionary table is in the medicine text, then the word A in the medicine text is marked as the disease name.
  • Step 302 replace the key medical entity with the character string that conforms to the preset rule, and obtain the replacement text.
  • the strings that meet the preset rules are custom strings, and different strings can be defined for different types of key pharmaceutical entities.
  • the text of a drug is: "treatment of whooping cough: once a day, one tablet at a time”, "whooping cough” is recognized as a key entity of medicine, and the definition string “indication_0” is used to replace "pertussis”, and the replacement text is: "indication_0: Once a day, one tablet at a time.”
  • Step 303 establishing a mapping relationship table between the character string and the medical key entity replaced by the character string.
  • characters corresponding to the key pharmaceutical entities that conform to the preset rules can be generated, and the relationship between the character strings that conform to the preset rules and the key pharmaceutical entities can be recorded in real time through the mapping relationship table. corresponding relationship.
  • establishing a mapping relationship between character strings that conform to the preset rules and key medical entities replaced by the character strings can facilitate subsequent restoration of key medical entities.
  • the text of a drug is: "treatment of whooping cough: once a day, one tablet at a time”
  • the key entity of the drug is identified as “whooping cough”
  • the definition string "indication_0” is used to replace "pertussis”
  • the replacement text is: "indication_0: Once a day, one tablet at a time.”
  • the established mapping relationship is: indication_0 ⁇ ->pertussis.
  • step 304 the character string in the word segmentation result determined based on the replacement text is restored to the key medical entity replaced by the character string using the mapping table.
  • mapping relationship table since the mapping relationship table records the mapping relationship between character strings that conform to the preset rules and key medical entities, after determining the character strings conforming to the preset rules, the mapping relationship table can be searched to obtain the mapping relationship that conforms to the preset rules. Set the key entity of medicine corresponding to the string of the rule.
  • Step 305 identifying non-key medical entities in the word segmentation result.
  • a text template matching tool such as a regular expression, can be used to write template matching rules to identify other entities other than medical key entities (drug name and disease name)—medical non-key entities.
  • entity recognition model can also be used to identify non-critical medical entities.
  • Entity recognition models include: CRF (Conditional Random Fields, conditional random fields) model, BERT (Bidirectional Encoder Representations from Transformers, bidirectional encoder) model.
  • step 306 the key medical entities and the non-key medical entities are sorted according to the order of the entities in the medicine text to obtain the entity linear relationship corresponding to the medicine text.
  • the drug text will be expressed as an abstract sentence composed of multiple entities in a linear relationship, that is, the entity linear relationship.
  • the positional relationship between the entities in the entity linear relationship maintains the positional relationship of each entity in the drug text.
  • the entity linear relationship can ensure that the information content of the drug text is not lost.
  • Step 307 Generate a drug knowledge graph according to the parsing result obtained by performing syntax parsing on the entity linear relationship.
  • this step 307 correspond to the operations and features of the foregoing step 205 , and the descriptions of the operations and features in the step 205 are also applicable to the step 307 .
  • a mapping relationship table is used to establish the connection between the character string and the key pharmaceutical entities replaced by the character string, so as to ensure the accuracy of the word segmentation. Further, the identified key pharmaceutical entities and non-critical pharmaceutical entities are sorted according to the order of each entity in the drug text, and the obtained entity linear relationship retains the positional relationship between the various entities in the original drug text, and is reorganized into a tree for subsequent reorganization.
  • the state entity relationship provides the basis.
  • a drug knowledge graph with tree-like entity relationship can be generated.
  • the entities include: a precondition entity and a usage entity
  • the precondition entities include: a medical key entity.
  • Step 401 based on the precondition entities obtained by identifying the linear relationship of the entities, obtain a precondition merging result.
  • entities are classified into precondition entities and usage and dosage entities, wherein the precondition entities are used to represent drugs, diseases, drug use or subjects suffering from a certain disease, for example, population, disease name , the name of the drug and the conjunctions between the above three, etc.
  • the identified different precondition entities can be merged (for example, two-by-two, or one-three, etc.) Place merged sub-items, multiple pre-merged sub-items are combined to form a pre-condition merge result, the pre-condition merge result is a set with multiple pre-merged sub-items, and each pre-merged sub-item is a pre-condition Merge an element in the result.
  • the pre-merge subitem can also be a single precondition entity.
  • obtaining the precondition merged result based on the precondition entity obtained by identifying the entity linear relationship includes: identifying the precondition entity in the entity linear relationship, and identifying the precondition entity in the entity linear relationship.
  • Each precondition entity in the precondition entity and the combination of each precondition entity are used as a set element to form a precondition entity set; all different set elements in the precondition entity set are merged to obtain a precondition Conditional merge results.
  • merging all different set elements in the precondition entity set can be implemented by combining the rule analysis method and the statistical analysis method; wherein, the rule analysis method is to comply with the preset rules, for example, merging the precondition entity set
  • the preset rule is: the precondition entity set can only include any one or more of the three set elements: disease set element, population set element, and drug set element, wherein, the disease set element includes a single disease name or multiple disease names connected by connectives.
  • Crowd set elements include: a single Crowd object or multiple Crowd objects connected by connectives.
  • Drug set elements include: a single drug name or multiple drug names connected by a connective.
  • the statistical analysis method can be an exhaustive method. Exhaustive is to try all the possibilities.
  • the outer loop is used to wrap the inner loop. When a certain jumping condition is satisfied, the inner loop and the outer loop are ended. All the mergeable different set elements in the precondition entity set can be merged by statistical analysis method.
  • each precondition entity in the identified precondition entities and the combination between the precondition entities are combined. , as a set element to form a precondition entity set, and combine all the different set elements in the precondition entity set to obtain the precondition merge result, thus providing a reliable sentence parsing for the realization of the precondition entity to form a tree-like entity relationship way to ensure the reliability of the generated drug map.
  • Step 402 based on the usage entity obtained by identifying the linear relationship of the entities, obtain a combined result of usage.
  • the dosage entity is an entity related to drug use, such as frequency, course of treatment, route of administration, timing of administration, and conjunctions between the aforementioned four.
  • the identified different usage entities can be merged (for example, two-by-two, or one-three, etc.) to obtain a combined usage.
  • Sub-items, multiple usage-usage-combined sub-items are combined to form a usage-combined result, the usage-combined result is a collection with multiple usage-combined sub-items, and each usage-combined sub-item is one of the usage-combined results element.
  • the usage merge sub-item may also be a single usage entity.
  • Step 403 based on the positional relationship between each precondition entity and each usage entity in the entity linear relationship, combine the precondition combination result and the usage combination result to obtain at least one element and usage in the precondition combination result. At least one element in the usage merge result is the root node collection of collection elements.
  • the set elements of the root node set are composed of at least one pre-merged sub-item and at least one usage-combined sub-item in the pre-condition merge result.
  • each root node set may be expressed as a combination of 0-1 pre-merged sub-items and one or more usage-combined sub-items.
  • an entity linear relationship includes: "Adult (PP1) once a day (MM1), two tablets at a time (MM2); pediatric (PP2) once a day (MM3), one tablet at a time (MM4)", its preconditions are combined
  • MM1 two tablets at a time
  • PP2 PP2 once a day
  • MM4 one tablet at a time
  • the combined result of usage and dosage is: ⁇ [MM1 MM2][MM3 MM4];[MM1,MM2][MM3 MM4];[MM1][MM2 MM3][MM4];[MM1][MM2 MM3][MM4];[ MM1 MM2][MM3 MM4];[MM1 MM2][MM3 MM4] ⁇ .
  • Step 404 merge all different set elements in the root node set.
  • merging all different set elements in the root node set refers to merging all mergeable different set elements in the root node set, for example, different set elements that do not meet the preset rules are not included in the merging.
  • Step 405 taking the merge result with the highest merge probability among the merge results of the root node set as the parsing result, and adding the parsing result to the knowledge graph.
  • N combination results are obtained by merging different set elements in the root node set.
  • Classify N combination results calculate the frequency of occurrence of each type, and normalize the frequency to obtain the corresponding proportion of each type of combination result. For all the proportions of the multi-type combination results of a certain root node set, the probability of each type of combination result is calculated, and the combination result with the highest probability is the final analysis result.
  • the different identified precondition entities are merged to obtain a precondition merge result.
  • M PP types
  • the identified different usage entities are combined to obtain the combined result of usage.
  • N MM N MM >1 types of combined results in this process.
  • P S P S >1 kinds of set elements in this process.
  • all mergeable set elements are merged, a total of M PP ⁇ N MM ⁇ P S merged results can appear, and a set of merged results with the highest probability is selected as the analysis result.
  • the precondition entity in the linear relationship of the entities is identified, and the precondition entities are merged to obtain the precondition.
  • Merge results identify the usage entities in the entity linear relationship, merge the usage entities to obtain the usage combination result; combine the preconditions based on the positional relationship between each precondition entity and each usage entity in the entity linear relationship
  • Conditional merging results and usage merging results get the root node set, merge all different set elements in the root node set, take the merge result with the highest merging probability among the merging results of the root node set as the parsing result, and add the parsing result to the knowledge Therefore, the comprehensiveness of the entities in the drug text is ensured by synthesizing the combined results of the preconditions and the usage and dosage.
  • Accuracy of atlas generation It can improve the knowledge map in the medical field and is conducive to knowledge mining.
  • FIG. 5 shows a flow 500 of an embodiment of a method for obtaining a combined usage result.
  • the method for generating a drug knowledge graph includes the following steps:
  • Step 501 identifying the usage entity in the entity linear relationship.
  • the usage and dosage entities are entities related to the use of medicines, such as dosage, frequency, course of treatment, administration route, administration timing, and conjunctions between the aforementioned five.
  • the usage quantum phrase can include one or more usage entities, for example, the usage entity in the entity linear relationship includes: “1 time a day”, “1 tablet once a day”, “1 time a day or 2 tablets once a day”.
  • step 502 each usage entity among the identified usage entities and the combination between the various usage entities are used as set elements to form a usage entity set.
  • the set elements in the usage entity set include: usage entities or a combination of usage entities.
  • the usage entity in the entity linear relationship includes: “1 time a day”, “1 tablet once a time”, “1 time a day”, “2 tablets a time”.
  • Combinations between dosage entities may include: ⁇ 1 time a day> or ⁇ 1 tablet once a day, 1 tablet at a time> or ⁇ 1 time a day, 2 tablets at a time>.
  • Step 503 Combine all different set elements in the usage entity set to obtain a usage combination result.
  • merging all the different set elements in the usage entity set can be implemented by combining the rule analysis method and the statistical analysis method; wherein, the rule analysis method is to comply with the preset rules, for example, one of the merged usage entity set
  • the preset rule is: the usage and dosage entity set can only contain dose set elements, frequency set elements, treatment course set elements, administration route set elements, and administration timing set elements.
  • dose set elements include: a single dose or multiple doses connected by a connective.
  • Frequency set elements include: a single frequency or multiple frequencies connected by connectives.
  • Course set elements include: a single course or multiple courses linked by connectives.
  • the Routes of Administration collection elements include: a single route of administration or multiple routes of administration linked by a connective.
  • the dosing occasion set elements include: a single dosing occasion or multiple dosing occasions linked by a connective.
  • another preset rule for merging usage entity collections is that collection elements of the same type cannot be merged.
  • a usage entity set MM includes: "once a day (MM1), two tablets at a time (MM2); once a day (MM3), one tablet at a time (MM4)", where MM1 to MM4 are usage entity sets
  • MM1, MM1 and MM3 are frequency set elements, they cannot be combined; MM2 and MM4 are dose set elements, and the two cannot be combined.
  • the statistical analysis method can be an exhaustive method. Exhaustive is to try all the possibilities.
  • the outer loop is used to wrap the inner loop. When a certain jumping condition is satisfied, the inner loop and the outer loop are ended.
  • a specific example of the statistical analysis method is as follows: for all mergeable collection elements in the usage entity set MM, take each collection element as a starting item, and perform the following merge operation: Merge, and then continue to merge the collection elements that have not been merged until all mergeable collection elements in the usage entity set MM are merged, and return to re-determine the starting item.
  • the combined result of the usage of the above usage entity set MM is: ⁇ [MM1 MM2][MM3 MM4];[MM1,MM2][MM3 MM4];[MM1][MM2 MM3][MM4];[MM1][MM2 MM3 ][MM4];[MM1 MM2][MM3 MM4];[MM1 MM2][MM3 MM4] ⁇ .
  • each usage entity in the identified usage entities and the combination between the various usage entities are taken as a set element
  • the usage entity set is formed, and all the different set elements in the usage entity set are merged to obtain the combined result of usage, which provides a reliable sentence parsing method for the formation of a tree-like entity relationship between the usage entities and ensures the generated drug map. reliability.
  • FIG. 6 it shows a process 600 of another embodiment of the method for constructing a drug knowledge graph of the present disclosure.
  • the construction method of the drug knowledge graph includes the following steps:
  • Step 601 formatting the medicine text.
  • the formatting process may include at least one of the following: 1) normalizing different punctuation marks representing the same meaning in the drug text; 2) converting Chinese numbers in the drug text into Arabic numbers.
  • the normalization process is used to unify the punctuation symbols representing the same meaning.
  • the punctuation symbols expressing the same meaning may not be uniformly expressed in the drug text.
  • the processing ensures the unity of different punctuation marks with the same meaning. For example, after normalizing " ⁇ ”, "-", and "-", they all become "-".
  • the punctuation symbols include but are not limited to: period, comma, comma, quotation marks, brackets, etc., and the punctuation symbols of various expression ranges are all within the protection scope of the present application.
  • Step 602 identifying entities in the drug text.
  • Step 603 replace the entity key entity of traditional Chinese medicine with a character string that conforms to the preset rule, and obtain the replacement text.
  • step 604 the character string in the word segmentation result determined based on the replacement text is restored to the medical key entity replaced by the character string.
  • a mapping relationship table between the character string and the medical key entity replaced by the character string may be established. And when restoring the key medical entity, it is restored to the key medical entity replaced by the character string based on the mapping relationship table.
  • Step 605 based on the entities, form an entity linear relationship between the entities.
  • it may further include: identifying non-key pharmaceutical entities in the word segmentation result; the foregoing entity-based entity forming entity linear relationship between entities includes: classifying key pharmaceutical entities and non-critical pharmaceutical entities according to the following steps: The order of each entity in the drug text is sorted to obtain the entity linear relationship corresponding to the drug text.
  • the key pharmaceutical entities include: disease name and drug name; and the non-critical pharmaceutical entities include: population, dose, frequency, course of treatment, route of administration, and timing of administration.
  • Step 606 Generate a drug knowledge graph according to the parsing result obtained by performing the syntactic parsing of the entity linear relationship.
  • the entities include: a precondition entity, a usage entity, and the precondition entity includes: a medical key entity.
  • the above step 606 includes: obtaining a precondition combination result based on the precondition entity obtained by identifying the entity linear relationship; obtaining a usage combination result based on the usage entity obtained by identifying the entity linear relationship; The positional relationship between the precondition entity and each usage entity, combining the precondition combined result and the usage combined result to obtain at least one element in the precondition combined result and at least one element in the usage combined result as a set
  • Step 607 identifying attributes of entities in the drug text.
  • each entity in the drug text has its corresponding attribute.
  • the original text of the medicine text is: "Children aged 2-5", and the recognized attribute result is: ⁇ "crowdType”: "child”, “crowdAgeFrom”: 2, “crowdAgeTo”: 5. "crowdAgeUnit”: "Age” ⁇ .
  • Step 608 adding the attributes of the entity to the drug knowledge graph.
  • Table 1 is an entity in a drug knowledge graph, the description of the entity, and the attribute of the entity.
  • adding each entity attribute to the drug knowledge graph can ensure the comprehensiveness of entity information in the drug knowledge graph.
  • the construction method of the drug knowledge graph provided by this embodiment before all entities in the drug text are identified, the drug text is formatted, which ensures the efficiency of entity recognition. Identify the attributes of the entities in the drug text, and add the identified entity attributes to the drug knowledge map, thereby enriching the content of the drug knowledge map and ensuring the comprehensiveness of the drug knowledge map information.
  • the present disclosure provides an embodiment of a device for constructing a drug knowledge graph.
  • the device embodiment corresponds to the method embodiment shown in FIG. 2 .
  • an embodiment of the present disclosure provides an apparatus 700 for constructing a drug knowledge graph.
  • the apparatus 700 includes an identification unit 701 , a replacement unit 702 , a restoration unit 703 , a shaping unit 704 and an analysis unit 705 .
  • the recognition unit 701 may be configured to recognize entities in the drug text.
  • the replacement unit 702 may be configured to replace the medical key entity in the entity with a character string conforming to a preset rule to obtain the replacement text.
  • the restoring unit 703 may be configured to restore the character string in the word segmentation result determined based on the replacement text to the medical key entity replaced by the character string.
  • the shaping unit 704 may be configured to, based on the entities, form an entity linear relationship between the various entities.
  • the parsing unit 705 may be configured to generate a drug knowledge graph according to a parsing result obtained by syntactically parsing the entity linear relationship.
  • the specific processing of the identifying unit 701, the replacing unit 702, the restoring unit 703, the forming unit 704, and the analyzing unit 705 and the technical effects brought by them can be referred to FIG. 2 respectively.
  • the above-mentioned apparatus 700 further includes: a mapping unit (not shown in the figure).
  • the mapping unit may be configured to establish a mapping relationship table between the strings and the medical key entities replaced by the strings.
  • the above-mentioned apparatus 700 further includes: a resolution unit (not shown in the figure).
  • the distinguishing unit can be configured to identify the non-critical medical entities in the word segmentation result;
  • the above-mentioned forming unit 704 is also configured to sort the critical medical entities and non-critical medical entities according to the order of each entity in the drug text, and obtain the corresponding medical text. entity linear relationship.
  • the above-mentioned pharmaceutical key entities include: disease name, drug name; the above-mentioned pharmaceutical non-critical entities include: population, dose, frequency, course of treatment, route of administration, and timing of administration.
  • the above-mentioned entities include: a precondition entity, a dosage entity, and the precondition entity includes: a medical key entity;
  • the above-mentioned parsing unit 703 includes: a precondition extraction module (not shown in the figure), a usage amount Obtaining module (not shown in the figure), combining module (not shown in the figure), merging module (not shown in the figure), parsing module (not shown in the figure), adding module (not shown in the figure).
  • the precondition extraction module may be configured to obtain a precondition merge result based on the precondition entities obtained by identifying the linear relationship of the entities.
  • the usage obtaining module may be configured to obtain the combined usage result based on the usage entity obtained by identifying the linear relationship of the entities.
  • the combination module may be configured to combine the precondition combined result and the usage combined result based on the positional relationship between each precondition entity and each usage entity in the entity linear relationship to obtain at least one of the precondition combined results At least one element in the result of combining elements and usage is the root node collection of collection elements.
  • the merge module can be configured to merge all the different collection elements in the root node collection.
  • the parsing module may be configured to use the merge result with the highest merge probability among the merge results of the root node set as the parsing result.
  • the adding module can be configured to add the parsing results to the knowledge graph.
  • the above-mentioned precondition obtaining module includes: a pre-identification sub-module (not shown in the figure), a pre-combination sub-module (not shown in the figure), and a pre-merging sub-module (not shown in the figure) out).
  • the pre-identification sub-module may be configured to identify pre-condition entities in a linear relationship of entities.
  • the precondition combination sub-module may be configured to use each precondition entity in the identified precondition entities and the combination among the precondition entities as a set element to form a precondition entity set.
  • the precondition merge submodule can be configured to merge all different set elements in the precondition entity set to obtain the precondition merge result.
  • the above-mentioned usage and amount obtaining module includes: a usage and amount identification sub-module (not shown in the figure), a usage and amount combination sub-module (not shown in the figure), and a usage and amount combination sub-module (not shown in the figure) ).
  • the usage identification sub-module may be configured to identify usage entities in a linear relationship of entities.
  • the usage combination sub-module may be configured to use each of the identified usage entities and the combination between the various usage entities as a set element to form a usage entity set.
  • the usage merge submodule can be configured to merge all the different collection elements in the usage entity collection to get the usage merge result.
  • the above-mentioned apparatus further includes: a distinguishing unit (not shown in the figure) and an adding unit (not shown in the figure).
  • the distinguishing unit may be configured to identify attributes of entities in the drug text.
  • the adding unit may be configured to add attributes of the entity to the drug knowledge graph.
  • the above-mentioned apparatus further comprises: a formatting unit (not shown in the figure) and/or a conversion unit (not shown in the figure).
  • the formatting unit may be configured to normalize different punctuation marks representing the same meaning in the drug text.
  • the conversion unit may be configured to convert the Chinese numerals in the medicine text into Arabic numerals.
  • FIG. 8 a schematic structural diagram of an electronic device 800 suitable for implementing embodiments of the present disclosure is shown.
  • an electronic device 800 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 801 that may be loaded into random access according to a program stored in a read only memory (ROM) 802 or from a storage device 808 Various appropriate actions and processes are executed by the programs in the memory (RAM) 803 . In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored.
  • the processing device 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804.
  • An input/output (I/O) interface 805 is also connected to bus 804 .
  • the following devices can be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, etc.; output devices including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc. 807; storage devices 808 including, for example, magnetic tapes, hard disks, etc.; and communication devices 809.
  • Communication means 809 may allow electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 8 shows an electronic device 800 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in FIG. 8 can represent one device, and can also represent multiple devices as required.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods illustrated in the flowcharts.
  • the computer program may be downloaded and installed from the network via the communication device 809, or from the storage device 808, or from the ROM 802.
  • the processing device 801 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium of the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical cable, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned server; or may exist alone without being assembled into the server.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the server, the server can: recognize the entity in the drug text; Obtain the replacement text for key pharmaceutical entities; restore the character string in the word segmentation result determined based on the replacement text to the key pharmaceutical entity replaced by the character string; form the entity linear relationship between entities based on the entity; The analysis result obtained by syntactic analysis of the linear relationship is used to generate a drug knowledge graph.
  • Computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, including crowd-oriented programming languages—such as Java, Smalltalk, C++, and also A conventional procedural programming language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in software or hardware.
  • the described unit can also be provided in the processor, for example, it can be described as: a processor including an identification unit, a replacement unit, a restoration unit, a shaping unit and a parsing unit.
  • the names of these units do not constitute a limitation of the unit itself in some cases, for example, the identification unit may also be described as a unit "configured to identify an entity in the drug text".

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Toxicology (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

本申请公开了药品知识图谱的构建方法和装置。该方法的一具体实施方式包括:识别药品文本中的实体;采用符合预设规则的字符串,替换实体中的医药关键实体,得到替换文本;将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体;基于实体,形成各个实体之间的实体线性关系;根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。该实施方式提高了医学知识图谱的准确性。

Description

药品知识图谱的构建方法和装置
本专利申请要求于2020年7月30日提交的、申请号为202010750770.3、申请人为北京京东拓先科技有限公司、发明名称为“药品知识图谱的构建方法和装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本公开涉及计算机技术领域,具体涉及知识图谱技术领域,尤其涉及药品知识图谱的构建方法和装置、电子设备、计算机可读介质。
背景技术
知识图谱是人工智能中知识工程的一个分支,在通用领域已有比较成熟的应用;药品说明书的用法用量部分,在通用语法规则的基础上,具有其特有的缩写形式和语法结构,这一特点无法被传统的句法分析方法所解析,目前尚未有成熟公开的药品知识图谱。
发明内容
本公开的实施例提出了药品知识图谱的构建方法和装置、电子设备、计算机可读介质。
第一方面,本公开的实施例提供了一种药品知识图谱的构建方法,该方法包括:识别药品文本中的实体;采用符合预设规则的字符串,替换实体中的医药关键实体,得到替换文本;将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体;基于所述实体,形成各个实体之间的实体线性关系;根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
在一些实施例中,上述方法还包括:建立字符串与被字符串所替换的医药关键实体之间的映射关系表。
在一些实施例中,上述方法还包括:识别分词结果中的医药非关 键实体;以及基于所述实体,形成各个实体之间的实体线性关系,包括:将医药关键实体、医药非关键实体按照药品文本中的各个实体的顺序进行排序,得到与药品文本对应的实体线性关系。
在一些实施例中,上述医药关键实体包括:疾病名、药品名;上述医药非关键实体包括:人群、剂量、频次、疗程、给药途径、给药时机。
在一些实施例中,上述实体包括:前置条件实体、用法用量实体,前置条件实体包括:医药关键实体;上述根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱,包括:基于识别实体线性关系所得到的前置条件实体,得到前置条件合并结果;基于识别实体线性关系所得到的用法用量实体,得到用法用量合并结果;基于实体线性关系中的各个前置条件实体与各个用法用量实体之间的位置关系,组合前置条件合并结果与用法用量合并结果,得到以前置条件合并结果中的至少一个元素与用法用量合并结果中的至少一个元素为集合元素的根节点集合;合并根节点集合中的所有不同集合元素;将根节点集合的合并结果中合并概率最高的合并结果作为解析结果,并将解析结果添加入知识图谱。
在一些实施例中,上述基于识别实体线性关系所得到的前置条件实体,得到前置条件合并结果,包括:识别实体线性关系中的前置条件实体,将识别到的前置条件实体中的各个前置条件实体以及各个前置条件实体之间的组合,作为集合元素形成前置条件实体集合;将前置条件实体集合中的所有不同集合元素进行合并,得到前置条件合并结果。
在一些实施例中,上述基于识别实体线性关系所得到的用法用量实体,得到用法用量合并结果,包括:识别实体线性关系中的用法用量实体,将识别到的用法用量实体中的各个用法用量实体以及各个用法用量实体之间的组合,作为集合元素形成用法用量实体集合;将用法用量实体集合中的所有不同集合元素进行合并,得到用法用量合并结果。
在一些实施例中,还包括:识别药品文本中实体的属性;将实体 的属性添加至药品知识图谱中。
在一些实施例中,上述方法还包括:对药品文本进行以下至少一项格式化处理:对药品文本中表征相同意义的不同标点符号进行归一化处理;将药品文本中的中文数字转换为阿拉伯数字。
第二方面,本公开的实施例提供了一种药品知识图谱的构建装置,该装置包括:识别单元,被配置成识别药品文本中的实体,替换单元,被配置成采用符合预设规则的字符串,替换实体中的医药关键实体,得到替换文本;还原单元,被配置成将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体;成形单元,被配置成基于所述实体,形成各个实体之间的实体线性关系;解析单元,被配置成根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
在一些实施例中,上述装置还包括:映射单元,被配置成建立字符串与被字符串所替换的医药关键实体之间的映射关系表。
在一些实施例中,上述装置还包括:分辨单元,被配置成识别分词结果中的医药非关键实体;上述成形单元还被配置成将医药关键实体、医药非关键实体按照药品文本中的各个实体的顺序进行排序,得到与药品文本对应的实体线性关系。
在一些实施例中,上述医药关键实体包括:疾病名、药品名;上述医药非关键实体包括:人群、剂量、频次、疗程、给药途径、给药时机。
在一些实施例中,前置条件实体、用法用量实体,前置条件实体包括:医药关键实体;上述解析单元包括:前置条件得到模块,被配置成基于识别实体线性关系所得到的前置条件实体,得到前置条件合并结果;用法用量得到模块,被配置成基于识别实体线性关系所得到的用法用量实体,得到用法用量合并结果;组合模块,被配置成基于实体线性关系中的各个前置条件实体与各个用法用量实体之间的位置关系,组合前置条件合并结果与用法用量合并结果,得到以前置条件合并结果中的至少一个元素与用法用量合并结果中的至少一个元素为集合元素的根节点集合;合并模块,被配置成合并根节点集合中的所 有不同集合元素;解析模块,被配置成将根节点集合的合并结果中合并概率最高的合并结果作为解析结果;添加模块,被配置成将解析结果添加入知识图谱。
在一些实施例中,上述前置条件得到模块包括:前置识别子模块,被配置成识别实体线性关系中的前置条件实体;前置组合子模块,被配置成将识别到的前置条件实体中的各个前置条件实体以及各个前置条件实体之间的组合,作为集合元素形成前置条件实体集合;前置合并子模块,被配置成将前置条件实体集合中的所有不同集合元素进行合并,得到前置条件合并结果。
在一些实施例中,上述用法用量得到模块包括:用法用量识别子模块,被配置成识别实体线性关系中的用法用量实体;用法用量组合子模块,被配置成将识别到的用法用量实体中的各个用法用量实体以及各个用法用量实体之间的组合,作为集合元素形成用法用量实体集合;用法用量合并子模块,被配置成将用法用量实体集合中的所有不同集合元素进行合并,得到用法用量合并结果。
在一些实施例中,上述装置还包括:区分单元,被配置成识别药品文本中实体的属性;添加单元,被配置成将实体的属性添加至药品知识图谱中。
在一些实施例中,上述装置还包括:格式化单元和/或转化单元,格式化单元,被配置成对药品文本中表征相同意义的不同标点符号进行归一化处理;转化单元,被配置成将药品文本中的中文数字转换为阿拉伯数字。
第三方面,本公开的实施例提供了一种电子设备,该电子设备包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现如第一方面中任一实现方式描述的方法。
第四方面,本公开的实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面中任一实现方式描述的方法。
本公开的实施例提供的药品知识图谱的构建方法和装置,首先识 别药品文本中的实体;其次采用符合预设规则的字符串,替换实体中的医药关键实体,得到替换文本;然后,将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体;然后,基于实体,形成各个实体之间的实体线性关系;最后根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。由此,在对药品文本进行分词之前,首先将医药关键实体采用预设规则的字符进行替换,保证了医药文本分词的准确性;而在实体线性关系的基础上进行句法解析得到的药品知识图谱,便于将药品用法用量的自然语言转换成计算机能够识别的数据结构,有利于医学领域的知识图谱挖掘,保证了药品知识图谱的准确性和可解释性。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本公开的其它特征、目的和优点将会变得更明显:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本公开的药品知识图谱的构建方法的一个实施例的流程图;
图3是根据本公开的药品知识图谱的构建方法的另一个实施例的流程图;
图4是根据本公开的生成药品知识图谱的方法的一个实施例的流程图;
图5是根据本公开的得到用法用量合并结果的方法的一个实施例的流程图;
图6是根据本公开的药品知识图谱的构建方法的第三个实施例的流程图;
图7是根据本公开的药品知识图谱的构建装置的一个实施例的结构示意图;
图8是适于用来实现本公开的实施例的电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本公开作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。
图1示出了可以应用本公开的药品知识图谱的构建方法的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,通常可以包括无线通信链路等等。
终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如即时通信工具、邮箱客户端等。
终端设备101、102、103可以是硬件,也可以是软件;当终端设备101、102、103为硬件时,可以是具有通信和控制功能的用户设备,上述用户设置可与服务器105进行通信。当终端设备101、102、103为软件时,可以安装在上述用户设备中;终端设备101、102、103可以实现成多个软件或软件模块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如为终端设备101、102、103上知识图谱系统提供支持的知识图谱服务器。知识图谱服务器可以对网络中各目标图像的相关信息进行分析处理,并将处理结果(如生成的知识图谱)反馈给终端设备。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模 块(例如用来提供分布式服务的软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
需要说明的是,本公开的实施例所提供的药品知识图谱的构建方法一般由服务器105执行。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
在实体医院和互联网医院,每天医生都会开具大量的药品处方,其中,这些处方中药品的用法用量是否合理,需要药师的审核。然而大量的药品处方的审核给药师带来了极大的工作量和负担。而将药品处方生成药品知识图谱,在应用知识图谱时,能够根据患者的个人情况,匹配到患者对应的用法用量,给出待选择项供药师进行筛选,从而可以帮助药师减轻工作量,节省审核时间。
如图2,示出了根据本公开的药品知识图谱的构建方法的一个实施例的流程200,该药品知识图谱的构建方法包括以下步骤:
步骤201,识别药品文本中的实体。
本实施例中,药品知识图谱的构建方法运行于其上的执行主体(比如服务器、终端设备)可以通过实时获取或者通过内存读取的方式获取药品文本。
药品文本包括用于治疗不同人群的不同疾病的药品以及各种药品的用法用量,该药品文本可以是药盒上的药品说明书,还可以是药品的处方,例如,一药品说明书为:“口服,成人一日一次,一次两片;小儿一日一次,一次一片”。
由于药品文本包括用于治疗不同人群的不同疾病的药品以及各种药品的用法用量,药品文本中的实体可以包括:人群,疾病名,药品名,剂量,频次,疗程,给药途径,给药时机,连接词等。
针对药品或疾病的专有性,为了便于对实体进行识别,在本实施例的一些可选实现方式中,可以将实体分类为医药关键实体和医药非关键实体,其中,医药关键实体是医药专属名词,例如,医药关键实体包括:药品名称、疾病名称;医药非关键实体则是非医药专属名词, 其是药品文本中通用的名词,例如,剂量、疗程、给药途径等。
在本实施例的一些可选实现方式中,医药关键实体包括:药品名称、疾病名称;医药非关键实体包括:人群、剂量、频次、疗程、给药途径、给药时机。本可选实现方式提供的医药关键实体和医药非关键实体可以涵盖药品说明书中实体的类型,保证了实体类型划分的全面性。
本实施例中识别药品文本中的实体:可以是识别医药关键实体,或者识别医药关键实体和医药非关键实体。针对医药关键实体的识别可以通过字典枚举的方式实现。其中,枚举用于是逐个考察每个实体属于医药关键实体的情况,对医药关键实体中疾病名进行的字典枚举识别是通过一个已维护的疾病的字典表。对于药品文本,遍历这个疾病的字典表,如果字典表里的某个词A在药品文本中,那么药品文本中的词A就被标注为疾病名。
为了便于对实体进行句法解析,可选地,还可以将实体分类为前置条件实体和用法用量实体,其中,前置条件实体用于表示药品、疾病、药品使用或患有某种疾病的主体,例如,人群、疾病名、药品名和前述三者之间的连接词等,前置条件实体包括医药关键实体。用法用量实体是与药品使用有关的实体,例如,频次,疗程,给药途径,给药时机,和前述四者之间的连接词等。
针对不同的药品文本包含的实体不同,其中,一些药品文本可以包括:前置条件实体和用法用量实体;一些药品文本可以包括:用法用量实体。
本实施例中识别药品文本中的实体:可以是识别前置条件实体,或者识别前置条件实体和用法用量实体。
步骤202,采用符合预设规则的字符串,替换实体中医药关键实体,得到替换文本。
本可选实现方式中,由于药品是专属名词,具有一定的复杂性,可以会对后续的药品知识图谱的生成造成影响,因此需要对识别到的医药关键实体进行替换。
医药关键实体的替换过程中,识别到医药关键实体之后,产生与 医药关键实体相应的符合预设规则的字符,并实时记录符合预设规则的字符串与医药关键实体之间的对应关系。
本实施例中,产生的符合预设规则的字符串可以为自定义的具有一定规则的字符串,针对不同种类的医药关键实体,可以定义不同的字符串进行区别。例如,一药品文本为:“治疗百日咳:一日一次,一次一片”,识别到“百日咳”为医药关键实体,定义字符串“indication_0”对“百日咳”进行替换,得到替换文本为:“indication_0:一日一次,一次一片”。
可选地,符合预设规则的字符串还可以是由字符模板自动分配的、符合预设规则的字符串,在字符模板中还可以记录有符合预设规则的字符串与医药关键实体之间的对应关系,在预设规则的字符串替换实体中医药关键实体之后,可以通过模块得到预设规则的字符串与医药关键实体之间的对应关系,从而还原医药关键实体。
步骤203,将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体。
本实施例中,首选对替换文本进行分词,识别替换文本的分词结果中的符合预设规则的字符串,并将符合预设规则的字符串还原回医药关键实体。
具体地,可以采用中文分词工具的中文分词功能对替换文本进行分词得到分词结果,中文分词即将一个汉字序列进行切分,得到一个个单独的词。进一步地,可以利用的中文分词工具包括:jieba,SnowNLP(Simplified Chinese Text Processing,简体中文文本处理)等等。
步骤204,基于实体,形成各个实体之间的实体线性关系。
本实施例中,可以将药品文本中识别到的实体按照各个实体在药品文本中的位置进行排序,形成各个实体之间的实体线性关系。例如,一药品文本为:“口服,成人一日一次,一次两片;小儿一日一次,一次一片”,经过实体识别之后,形成各个实体之间的实体线性关系为:{给药途径:口服}{人群:成人}{频次:一日一次}{剂量:一次二片}{人群:小儿}{频次:一日一次}{剂量:一次一片}。
可选地,还可以对药品文本中识别到的实体进行格式化处理(比如,去除文本空格、全角半角转化、中文数字转换为阿拉伯数字、符号的归一化),将格式处理结果按照各个实体在药品文本中的位置进行排序,形成各个实体之间的实体线性关系。例如,一药品文本为:“口服,成人一日一次,一次两片;小儿一日一次,一次一片”,经过中文数字转换为阿拉伯数字的格式化处理之后,形成各个实体之间的实体线性关系为:{给药途径:口服}{人群:成人}{频次:1日1次}{剂量:1次2片}{人群:小儿}{频次:1日1次}{剂量:1次1片}。
步骤205,根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
本实施例中,对实体线性关系进行句法解析是指,根据各个实体的类型以及各个实体之间的语法规则,结合预设的句法分析算法,抽取实体线性关系中最能表达药品文本的实体关系。该句法解析过程为将实体线性关系解析为树状实体关系,也是将药品文本的自然语言转换成计算机能够识别的数据结构的过程。
其中,预设的句法分析算法可以包括:基于规则的句法分析和/或基于统计的句法分析,对于给定的输入文本,基于规则的句法分析是通过实现编写的知识库规则将输入文本映射到解析结果;基于统计的句法分析是穷举所有可能的解析结果,找到其中概率最大的一个。
本实施例中,由于药品文本中实体是针对药品,药品具有特有的缩写形式和语法(例如,用法用量方法等),预设的句法分析算法进一步地,可以采用基于规则的句法分析和基于统计的句法分析,从而充分利用了规则编写的严谨性和统计分析的灵活性。
例如,一药品文本为:“口服,成人一日一次,一次两片;小儿一日一次,一次一片”。
经过中文数字转换为阿拉伯数字的格式化处理之后,形成各个实体之间的实体线性关系为:{给药途径:口服}{人群:成人}{频次:1日1次}{剂量:1次2片}{人群:小儿}{频次:1日1次}{剂量:1次1片}。
经过句法解析后得到的树状关系为:
{给药途径:口服}
{人群:成人}{频次:1日1次}{剂量:1次2片}
{人群:小儿}{频次:1日1次}{剂量:1次1片}。
通过对实体线性关系进行句法解析,生成的药品知识图谱保留了实体线性关系中各个实体之间的关系,可以保证药品知识图谱挖掘结果的准确性以及可解释性。
本公开的实施例提供的药品知识图谱的构建方法,首先识别药品文本中的实体;其次采用符合预设规则的字符串,替换实体中的医药关键实体,得到替换文本;然后,将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体;然后,基于实体,形成各个实体之间的实体线性关系;最后根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。由此,在对药品文本进行分词之前,首先将医药关键实体采用预设规则的字符进行替换,保证了医药文本分词的准确性;而在实体线性关系的基础上进行句法解析得到的药品知识图谱,便于将药品用法用量的自然语言转换成计算机能够识别的数据结构,有利于医学领域的知识图谱挖掘,保证了药品知识图谱的准确性和可解释性。
本申请的另一个实施例中,药品文本的实体包括:医药关键词和医药非关键实体。进一步参考图3,其示出了本公开的药品知识图谱的构建方法的另一个实施例的流程300。该药品知识图谱的构建方法包括以下步骤:
步骤301,识别药品文本中的医药关键实体。
本实施例中,医药关键实体可以通过字典枚举的方式进行识别。其中,枚举用于是逐个考察每个实体属于医药关键实体的情况,对医药关键实体中疾病名进行的字典枚举识别是通过一个已维护的疾病的字典表。对于药品文本,遍历这个疾病的字典表,如果字典表里的某个词A在药品文本中,那么药品文本中的词A就被标注为疾病名。
步骤302,采用符合预设规则的字符串,替换医药关键实体,得到替换文本。
本实施例中,由于医药专属名词,具有一定的复杂性,会对后续的实体处理造成影响,因此需要对识别到的医药关键实体进行替换。
符合预设规则字符串为自定义的字符串,针对不同种类的医药关键实体,可以定义不同的字符串进行区别。例如,一药品文本为:“治疗百日咳:一日一次,一次一片”,识别到“百日咳”为医药关键实体,定义字符串“indication_0”对“百日咳”进行替换,得到替换文本为:“indication_0:一日一次,一次一片”。
步骤303,建立字符串与被字符串所替换的医药关键实体之间的映射关系表。
医药关键实体的替换过程中,识别到医药关键实体之后,可以产生与医药关键实体相应的符合预设规则的字符,并通过映射关系表实时记录符合预设规则的字符串与医药关键实体之间的对应关系。
本实施例中,建立符合预设规则的字符串与被字符串所替换的医药关键实体之间的映射关系,可以便于后续医药关键实体的还原。
例如,一药品文本为:“治疗百日咳:一日一次,一次一片”,识别到医药关键实体为“百日咳”,定义字符串“indication_0”对“百日咳”进行替换,得到替换文本为:“indication_0:一日一次,一次一片”。建立的映射关系为:indication_0<->百日咳。
步骤304,将基于替换文本所确定的分词结果中的字符串,利用映射关系表还原为被字符串所替换的医药关键实体。
本实施例中,由于映射关系表记录了符合预设规则的字符串与医药关键实体之间的映射关系,在确定了符合预设规则的字符串之后,通过查找映射关系表可以得到与符合预设规则的字符串对应的医药关键实体。
步骤305,识别分词结果中的医药非关键实体。
本实施例中,可以利用文本模板匹配工具,如正则表达式,编写模板匹配规则,识别除医药关键实体(药品名称和疾病名称)之外的其它实体—医药非关键实体。进一步地,还可以采用实体识别模型识别医药非关键实体。实体识别模型包括:CRF(Conditional Random Fields,条件随机场)模型,BERT(Bidirectional Encoder Representations  from Transformers,双向编码器)模型。
由于在药品说明书用法用量部分,这些实体的表达方式是有限的,所以通过模板匹配可以做到精确识别。
步骤306,将医药关键实体、医药非关键实体按照药品文本中的各个实体的顺序进行排序,得到与药品文本对应的实体线性关系。
本实施例中,药品文本会被表达为由多种实体以线性关系组成的抽象句子,即实体线性关系,实体线性关系中各个实体之间的位置关系维持了药品文本中各个实体位置关系,通过实体线性关系可以保证药品文本的信息量没有丢失。
步骤307,根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
本步骤307操作和特征与上述步骤205中的操作和特征相对应,步骤205中对于操作和特征的描述,同样适用于步骤307。
本实施例提供的药品知识图谱的构建方法,在对药品文本进行分词之前,采用映射关系表建立字符串与被字符串所替换的医药关键实体之间的联系,保证了分词的准确性。进一步,将识别到的医药关键实体、医药非关键实体按照药品文本中各个实体的顺序排序,得到的实体线性关系,保留了原药品文本之间各个实体之间的位置关系,为后续重组为树状实体关系提供了基础。
针对包括不同种类实体的实体线性关系,可以生成具有树状实体关系的药品知识图谱。在本实施例的一些可选实现方式中,实体包括:前置条件实体和用法用量实体,前置条件实体包括:医药关键实体。进一步参考图4,其示出了生成药品知识图谱的方法一个实施例的流程400。该生成药品知识图谱的方法,包括以下步骤:
步骤401,基于识别实体线性关系所得到的前置条件实体,得到前置条件合并结果。
本可选实现方式中,将实体分类为前置条件实体和用法用量实体,其中,前置条件实体用于表示药品、疾病、药品使用或患有某种疾病的主体,例如,人群、疾病名、药品名和前述三者之间的连接词等。
本可选实现方式中,在识别到实体线性关系中的所有前置条件实体之后,可以将识别到的不同的前置条件实体进行合并(比如两两合并,或者一三合并等),得到前置合并子项,多个前置合并子项组合在一起形成前置条件合并结果,前置条件合并结果为具有多个前置合并子项的集合,每个前置合并子项为前置条件合并结果中的一个元素。
当然,根据识别到的前置条件实体的个数,前置合并子项也可以是单个前置条件实体。
在本实施例的一些可选实现方式中,上述基于识别实体线性关系所得到的前置条件实体,得到前置条件合并结果,包括:识别实体线性关系中的前置条件实体,将识别到的前置条件实体中的各个前置条件实体以及各个前置条件实体之间的组合,作为集合元素形成前置条件实体集合;将前置条件实体集合中的所有不同集合元素进行合并,得到前置条件合并结果。
本可选实现方式中,合并前置条件实体集合中的所有不同集合元素可以结合规则分析方法与统计分析方法实现;其中,规则分析方式是遵守预设的规则,例如,合并前置条件实体集合的预设的规则为:前置条件实体集合中只能包括:疾病集合元素,人群集合元素,药品集合元素这三种集合元素中的任一种或多种,其中,疾病集合元素包括单个疾病名或通过连接词连接的多个疾病名。人群集合元素包括:单个人群对象或通过连接词连接的多个人群对象。药品集合元素包括:单个药品名或通过连接词连接的多个药品名。
统计分析方法可以是穷举方法,穷举就是把所有的可能性都试一遍,在软件中就是用外循环套内循环,当满足某一跳出条件,则结束内循环与外循环。通过统计分析方法可以将前置条件实体集合中的所有可合并的不同集合元素进行合并。
本可选实现方式提供的前置条件合并结果的方法,在识别到的前置条件实体之后,将识别到的前置条件实体中的各个前置条件实体以及各个前置条件实体之间的组合,作为集合元素形成前置条件实体集合,将前置条件实体集合中的所有不同集合元素进行合并,得到前置条件合并结果,从而为实现前置条件实体形成树状实体关系提供了可 靠句子解析方式,保证了生成的药品图谱的可靠性。
步骤402,基于识别实体线性关系所得到的用法用量实体,得到用法用量合并结果。
本实施例中,用法用量实体是与药品使用有关的实体,例如,频次,疗程,给药途径,给药时机,和前述四者之间的连接词等。
本可选实现方式中,在识别到实体线性关系中的所有用法用量实体之后,可以将识别到的不同的用法用量实体进行合并(比如两两合并,或者一三合并等),得到用法用量合并子项,多个用法用法合并子项组合在一起形成用法用量合并结果,用法用量合并结果为具有多个用法用量合并子项的集合,每个用法用量合并子项为用法用量合并结果中的一个元素。
当然,根据识别到的不同的用法用量实体的个数,用法用量合并子项也可以是单个用法用量实体。
步骤403,基于实体线性关系中的各个前置条件实体与各个用法用量实体之间的位置关系,组合前置条件合并结果与用法用量合并结果,得到以前置条件合并结果中的至少一个元素与用法用量合并结果中的至少一个元素为集合元素的根节点集合。
本可选实现方式中,根节点集合的集合元素由至少一个前置条件合并结果中的前置合并子项和至少一个用法用量合并子项组成。
可选地,每个根节点集合的集合元素可以表达为0~1个前置合并子项与1个或多个用法用量合并子项的组合。
例如,一实体线性关系包括:“成人(PP1)一日一次(MM1),一次两片(MM2);小儿(PP2)一日一次(MM3),一次一片(MM4)”,其前置条件合并结果为:{[PP1][PP2]}。
其用法用量合并结果为:{[MM1 MM2][MM3 MM4];[MM1,MM2][MM3 MM4];[MM1][MM2 MM3][MM4];[MM1][MM2 MM3][MM4];[MM1 MM2][MM3 MM4];[MM1 MM2][MM3 MM4]}。
基于实体线性关系中的各个前置条件实体与各个用法用量实体之间的位置关系,组合前置条件合并结果与用法用量合并结果,得到的根节点集合为:{[PP1][MM1 MM2][PP2][MM3 MM4];[PP1][MM1  MM2][PP2][MM3 MM4];[PP1][MM1][PP2][MM2 MM3][MM4];[PP1][MM1][PP2][MM2 MM3][MM4];[PP1][MM1 MM2][PP2][MM3 MM4];[PP1][MM1 MM2][PP2][MM3 MM4]},其中,[PP1][MM1 MM2][PP2][MM3 MM4]为一个集合元素。
步骤404,合并根节点集合中的所有不同集合元素。
本可选实现方式中,将根节点集合中不同的集合元素进行合并(比如两两合并,或者一三合并等),得到合并结果。需要说明的是,合并根节点集合中的所有不同集合元素是指合并根节点集合中的所有可合并的不同集合元素,例如,不满足预设规则的不同集合元素则不在合并之列。
步骤405,将根节点集合的合并结果中合并概率最高的合并结果作为解析结果,并将解析结果添加入知识图谱。
本可选实现方式中,合并根节点集合中不同集合元素中,得到N个组合结果。对N个组合结果归类,计算每一类出现的频次,将频次归一化后就得到每一类组合结果对应的占比。对于某一根节点集合的多类组合结果的所有占比,计算每一类组合结果的概率,概率最大的组合结果就是最后的解析结果。
具体地,本可选实现方式中,首先将识别到的不同的前置条件实体进行合并,得到前置条件合并结果,此过程可能出现M PP种(M PP>1或M PP=1)合并结果,其次将识别到的不同的用法用量实体进行合并,得到用法用量合并结果,此过程可能出现N MM(N MM>1)种合并结果。经过这两个步骤后,原来的线性实体关系中只会出现前置条件合并结果和用法用量合并结果两种结果中的两种元素。将这两种元素组合为根节点集合的集合元素,此过程可能出现P S(P S>1)种集合元素。最后,将所有可合并的集合元素进行合并,一共可出现M PP×N MM×P S种合并结果,取合并结果中选取概率最高的一组合并结果作为解析结果。
本可选实现方式提供的生成药品知识图谱的方法,在实体包括前置条件实体和用法用量实体时,识别实体线性关系中的前置条件实体,对前置条件实体进行合并,得到前置条件合并结果;识别实体线性关系中的用法用量实体,对用法用量实体进行合并,得到用法用量合并 结果;基于实体线性关系中各个前置条件实体与各个用法用量实体之间的位置关系,组合前置条件合并结果和用法用量合并结果,得到根节点集合,合并根节点集合中的所有不同集合元素,将根节点集合的合并结果中合并概率最高的合并结果作为解析结果,并将解析结果添加入知识图谱,由此,通过综合前置条件合并结果和用法用量合并结果,保证了药品文本中的实体的全面性;将根节点集合的合并结果中合并概率最高的合并结果作为解析结果,保证药品知识图谱生成的准确性。能够完善医学领域的知识图谱,有利于进行知识挖掘。
在本实施例的一些可选实现方式中,进一步参考图5,其示出了得到用法用量合并结果的方法一个实施例的流程500。该生成药品知识图谱的方法,包括以下步骤:
步骤501,识别实体线性关系中的用法用量实体。
本实施例中,用法用量实体是与药品使用有关的实体,例如,剂量,频次,疗程,给药途径,给药时机,和前述五者之间的连接词等。用法用量子短语可以包括一个或多个用法用量实体,例如:实体线性关系中用法用量实体包括:“1日1次”“1次1片”“1日1次或1次2片”。
步骤502,将识别到的用法用量实体中的各个用法用量实体以及各个用法用量实体之间的组合,作为集合元素形成用法用量实体集合。
本可选实现方式中,用法用量实体集合中集合元素包括:用法用量实体或者用法用量实体之间的组合。
例如,实体线性关系中用法用量实体包括:“1日1次”、“1次1片”、“1日1次”、“1次2片”。
用法用量实体之间的组合可以包括:<1日1次>或<1日1次,1次1片>或<1日1次,1次2片>。
步骤503,将用法用量实体集合中的所有不同集合元素进行合并,得到用法用量合并结果。
本可选实现方式中,合并用法用量实体集合中的所有不同集合元素可以结合规则分析方法与统计分析方法实现;其中,规则分析方式 是遵守预设的规则,例如,合并用法用量实体集合的一个预设的规则为:用法用量实体集合中只能包含剂量集合元素,频次集合元素,疗程集合元素,给药途径集合元素,给药时机集合元素。其中,剂量集合元素包括:单个剂量或通过连接词连接的多个剂量。频次集合元素包括:单个频次或通过连接词连接的多个频次。疗程集合元素包括:单个疗程或通过连接词连接的多个疗程。给药途径集合元素包括:单个给药途径或通过连接词连接的多个给药途径。给药时机集合元素包括:单个给药时机或通过连接词连接的多个给药时机。
可选地,合并用法用量实体集合的另一个预设的规则为:相同类型的集合元素不能合并。
例如:一用法用量实体集合MM包括:“一日一次(MM1),一次两片(MM2);一日一次(MM3),一次一片(MM4)”,其中,MM1~MM4均为用法用量实体集合MM的集合元素,MM1与MM3为频次集合元素两者不能合并;MM2与MM4为剂量集合元素,两者不能合并。
统计分析方法可以是穷举方法,穷举就是把所有的可能性都试一遍,在软件中就是用外循环套内循环,当满足某一跳出条件,则结束内循环与外循环。
统计分析方法一具体示例如下:针对用法用量实体集合MM中所有可合并的集合元素,以每一个集合元素为起始项,进行以下合并操作:将所有能够与起始项相邻的集合元素进行合并,然后将还未合并的集合元素继续合并,直到用法用量实体集合MM中所有可合并的集合元素都合并完成,返回重新确定起始项。
上述用法用量实体集合MM的用法用量合并结果为:{[MM1 MM2][MM3 MM4];[MM1,MM2][MM3 MM4];[MM1][MM2 MM3][MM4];[MM1][MM2 MM3][MM4];[MM1 MM2][MM3 MM4];[MM1 MM2][MM3 MM4]}。
本可选实现方式提供的得到用法用量合并结果的方法,在识别到的用法用量实体之后,将识别到的用法用量实体中的各个用法用量实体以及各个用法用量实体之间的组合,作为集合元素形成用法用量实体集合,将用法用量实体集合中的所有不同集合元素进行合并,得到 用法用量合并结果,从而为实现用法用量实体形成树状实体关系提供了可靠句子解析方式,保证了生成的药品图谱的可靠性。
进一步参考图6,其示出了本公开的药品知识图谱的构建方法的另一个实施例的流程600。该药品知识图谱的构建方法,包括以下步骤:
步骤601,对药品文本进行格式化处理。
其中,格式化处理可以包括以下至少一项:1)对药品文本中表征相同意义的不同标点符号进行归一化处理;2)将药品文本中的中文数字转换为阿拉伯数字。
本实施例中,归一化处理用于将表征相同意义的标点符号做统一化处理,在药品文本来源较复杂时,表达相同意思的标点符号在药品文本中可能表现方式不统一,通过归一化处理保证了具有相同意义的不同标点符号的统一性。例如,将“~”、“-”、“—”做归一化处理后,均变为“-”。
本实施例中,标点符号包括但不限于:句号,逗号,顿号,引号,括号等,各种表达范围的标点符号均在本申请的保护范围之内。
本实施例中,中文有一些习惯性的表达,会增加后续生成知识图谱的难度。因此需要将所有的中文数字转为阿拉伯数字。例如,一药品文本是:"两日一次,一次五分之四片或一片半。”,进行中文数字转换为阿拉伯数字处理后的文本为:"2日1次,1次4/5片或1.5片。"。
步骤602,识别药品文本中的实体。
步骤603,采用符合预设规则的字符串,替换实体中医药关键实体,得到替换文本。
步骤604,将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体。
本实施例的一些可选实现方式中,在得到替换文本之后,可以建立字符串与被字符串所替换的医药关键实体之间的映射关系表。并在还原医药关键实体时,基于映射关系表还原为被字符串所替换的医药关键实体。
步骤605,基于实体,形成各个实体之间的实体线性关系。
本实施例的一些可选实现方式中,还可以包括:识别分词结果中的医药非关键实体;上述基于实体,形成各个实体之间的实体线性关系包括:将医药关键实体、医药非关键实体按照药品文本中的各个实体的顺序进行排序,得到与药品文本对应的实体线性关系。
进一步,本实施例的一些可选实现方式中,医药关键实体包括:疾病名、药品名;医药非关键实体包括:人群、剂量、频次、疗程、给药途径、给药时机。
步骤606,根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
本实施例的一些可选实现方式中,实体包括:前置条件实体、用法用量实体,前置条件实体包括:医药关键实体。上述步骤606包括:基于识别实体线性关系所得到的前置条件实体,得到前置条件合并结果;基于识别实体线性关系所得到的用法用量实体,得到用法用量合并结果;基于实体线性关系中的各个前置条件实体与各个用法用量实体之间的位置关系,组合前置条件合并结果与用法用量合并结果,得到以前置条件合并结果中的至少一个元素与用法用量合并结果中的至少一个元素为集合元素的根节点集合;合并根节点集合中的所有不同集合元素;将根节点集合的合并结果中合并概率最高的合并结果作为解析结果,并将解析结果添加入知识图谱。
应当理解,上述步骤602-604中操作和特征,分别与上述步骤201-205中的操作和特征相对应,因此,上述步骤201-205中对于操作和特征的描述,同样适应于步骤602-604,在此不再赘述。
步骤607,识别药品文本中实体的属性。
本实施例中,药品文本中每个实体都有其对应的属性。例如,药品文本的原文是:"2-5岁的儿童",识别到的属性结果为:{“crowdType”:“儿童”,“crowdAgeFrom”:2,“crowdAgeTo”:5。“crowdAgeUnit”:“岁”}。
步骤608,将实体的属性添加至药品知识图谱中。
具体地,参见表1所示,为一药品知识图谱中实体、实体的说明 以及实体的属性。
表1
Figure PCTCN2021088889-appb-000001
本实施例中,将每个实体属性添加到药品知识图谱中,可以保证药品知识图谱中实体信息的全面性。
本实施例提供的药品知识图谱的构建方法,在对药品文本中所有 实体进行识别之前,对药品文本进行格式化处理,保证了实体识别的效率。识别药品文本中实体的属性,将识别到的实体属性添加到药品知识图谱中,从而丰富了药品知识图谱的内容,保证了药品知识图谱信息的全面性。
进一步参考图7,作为对上述各图所示方法的实现,本公开提供了药品知识图谱的构建装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图7所示,本公开的实施例提供了一种药品知识图谱的构建装置700,该装置700包括:识别单元701、替换单元702、还原单元703、成形单元704和解析单元705。其中,识别单元701可以被配置成识别药品文本中的实体。替换单元702可以被配置成采用符合预设规则的字符串,替换实体中的医药关键实体,得到替换文本。还原单元703可以被配置成将基于替换文本所确定的分词结果中的字符串,还原为被字符串所替换的医药关键实体。成形单元704可以被配置成基于所述实体,形成各个实体之间的实体线性关系。解析单元705可以被配置成根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
在本实施例中,药品知识图谱的构建装置700中,识别单元701、替换单元702、还原单元703、成形单元704和解析单元705的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201、步骤202、步骤203、步骤204和步骤205。
在一些实施例中,上述装置700还包括:映射单元(图中未示出)。映射单元可以被配置成建立字符串与被字符串所替换的医药关键实体之间的映射关系表。
在一些实施例中,上述装置700还包括:分辨单元(图中未示出)。分辨单元可以被配置成识别分词结果中的医药非关键实体;上述成形单元704还被配置成将医药关键实体、医药非关键实体按照药品文本中的各个实体的顺序进行排序,得到与药品文本对应的实体线性关系。
在一些实施例中,上述医药关键实体包括:疾病名、药品名;上 述医药非关键实体包括:人群、剂量、频次、疗程、给药途径、给药时机。
在一些实施例中,上述实体包括:前置条件实体、用法用量实体,前置条件实体包括:医药关键实体;上述解析单元703包括:前置条件提取模块(图中未示出)、用法用量得到模块(图中未示出)、组合模块(图中未示出)、合并模块(图中未示出)、解析模块(图中未示出)、添加模块(图中未示出)。前置条件提取模块可以被配置成基于识别实体线性关系所得到的前置条件实体,得到前置条件合并结果。用法用量得到模块可以被配置成基于识别实体线性关系所得到的用法用量实体,得到用法用量合并结果。组合模块可以被配置成基于实体线性关系中的各个前置条件实体与各个用法用量实体之间的位置关系,组合前置条件合并结果与用法用量合并结果,得到以前置条件合并结果中的至少一个元素与用法用量合并结果中的至少一个元素为集合元素的根节点集合。合并模块可以被配置成合并根节点集合中的所有不同集合元素。解析模块可以被配置成将根节点集合的合并结果中合并概率最高的合并结果作为解析结果。添加模块可以被配置成将解析结果添加入知识图谱。
在一些实施例中,上述前置条件得到模块包括:前置识别子模块(图中未示出)、前置组合子模块(图中未示出)、前置合并子模块(图中未示出)。前置识别子模块可以被配置成识别实体线性关系中的前置条件实体。前置组合子模块可以被配置成将识别到的前置条件实体中的各个前置条件实体以及各个前置条件实体之间的组合,作为集合元素形成前置条件实体集合。前置合并子模块可以被配置成将前置条件实体集合中的所有不同集合元素进行合并,得到前置条件合并结果。
在一些实施例中,上述用法用量得到模块包括:用法用量识别子模块(图中未示出)、用法用量组合子模块(图中未示出)、用法用量合并子模块(图中未示出)。用法用量识别子模块可以被配置成识别实体线性关系中的用法用量实体。用法用量组合子模块可以被配置成将识别到的用法用量实体中的各个用法用量实体以及各个用法用量实体之间的组合,作为集合元素形成用法用量实体集合。用法用量合并子 模块可以被配置成将用法用量实体集合中的所有不同集合元素进行合并,得到用法用量合并结果。
在一些实施例中,上述装置还包括:区分单元(图中未示出)、添加单元(图中未示出)。区分单元可以被配置成识别药品文本中实体的属性。添加单元可以被配置成将实体的属性添加至药品知识图谱中。
在一些实施例中,上述装置还包括:格式化单元(图中未示出)和/或转化单元(图中未示出)。格式化单元可以被配置成对药品文本中表征相同意义的不同标点符号进行归一化处理。转化单元可以被配置成将药品文本中的中文数字转换为阿拉伯数字。
下面参考图8,其示出了适于用来实现本公开的实施例的电子设备800的结构示意图。
如图8所示,电子设备800可以包括处理装置(例如中央处理器、图形处理器等)801,其可以根据存储在只读存储器(ROM)802中的程序或者从存储装置808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有电子设备800操作所需的各种程序和数据。处理装置801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。
通常,以下装置可以连接至I/O接口805:包括例如触摸屏、触摸板、键盘、鼠标、等的输入装置806;包括例如液晶显示器(LCD,Liquid Crystal Display)、扬声器、振动器等的输出装置807;包括例如磁带、硬盘等的存储装置808;以及通信装置809。通信装置809可以允许电子设备800与其他设备进行无线或有线通信以交换数据。虽然图8示出了具有各种装置的电子设备800,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图8中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程 序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置809从网络上被下载和安装,或者从存储装置808被安装,或者从ROM 802被安装。在该计算机程序被处理装置801执行时,执行本公开的实施例的方法中限定的上述功能。
需要说明的是,本公开的实施例的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述服务器中所包含的;也可以是单独存在,而未装配入该服务器中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该服务器执行时,使得该服务器:识别药品文本中的实体;采用符合预设规则的字符串,替换实体中的医药关键实体,得到替换文本;将基于替换文本所确定的分词 结果中的字符串,还原为被字符串所替换的医药关键实体;基于实体,形成各个实体之间的实体线性关系;根据对实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的实施例的操作的计算机程序代码,程序设计语言包括面向人群的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开的各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开的实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器,包括识别单元、替换单元、还原单元、成形单元和解析单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,识别单元还可以被描述为“被配置成识别所述药品文本中的实体”的单元。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (12)

  1. 一种药品知识图谱的构建方法,所述方法包括:
    识别所述药品文本中的实体;
    采用符合预设规则的字符串,替换所述实体中的医药关键实体,得到替换文本;
    将基于所述替换文本所确定的分词结果中的所述字符串,还原为被所述字符串所替换的所述医药关键实体;
    基于所述实体,形成各个实体之间的实体线性关系;
    根据对所述实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
  2. 根据权利要求1所述的方法,所述方法还包括:
    建立所述字符串与被所述字符串所替换的所述医药关键实体之间的映射关系表。
  3. 根据权利要求2所述的方法,所述方法还包括:识别所述分词结果中的医药非关键实体;以及
    所述基于所述实体,形成各个实体之间的实体线性关系,包括:
    将所述医药关键实体、所述医药非关键实体按照所述药品文本中的各个实体的顺序进行排序,得到与所述药品文本对应的实体线性关系。
  4. 根据权利要求3所述的方法,其中,所述医药关键实体包括:疾病名、药品名;所述医药非关键实体包括:人群、剂量、频次、疗程、给药途径、给药时机。
  5. 根据权利要求1所述方法,其中,所述实体包括:前置条件实体、用法用量实体,所述前置条件实体包括:所述医药关键实体;
    所述根据对所述实体线性关系进行句法解析得到的解析结果,生 成药品知识图谱,包括:
    基于识别所述实体线性关系所得到的前置条件实体,得到前置条件合并结果;
    基于识别所述实体线性关系所得到的用法用量实体,得到用法用量合并结果;
    基于所述实体线性关系中的各个前置条件实体与各个用法用量实体之间的位置关系,组合所述前置条件合并结果与所述用法用量合并结果,得到以所述前置条件合并结果中的至少一个元素与所述用法用量合并结果中的至少一个元素为集合元素的根节点集合;
    合并所述根节点集合中的所有不同集合元素;
    将所述根节点集合的合并结果中合并概率最高的合并结果作为解析结果,并将解析结果添加入知识图谱。
  6. 根据权利要求5所述的方法,其中,所述基于识别所述实体线性关系所得到的前置条件实体,得到前置条件合并结果,包括:
    识别所述实体线性关系中的前置条件实体,将识别到的前置条件实体中的各个前置条件实体以及各个前置条件实体之间的组合,作为集合元素形成前置条件实体集合;
    将所述前置条件实体集合中的所有不同集合元素进行合并,得到前置条件合并结果。
  7. 根据权利要求5所述的方法,其中,所述基于识别所述实体线性关系所得到的用法用量实体,得到用法用量合并结果,包括:
    识别所述实体线性关系中的用法用量实体,将识别到的用法用量实体中的各个用法用量实体以及各个用法用量实体之间的组合,作为集合元素形成用法用量实体集合;
    将所述用法用量实体集合中的所有不同集合元素进行合并,得到用法用量合并结果。
  8. 根据权利要求1-7之一所述的方法,还包括:
    识别所述药品文本中所述实体的属性;
    将所述实体的属性添加至所述药品知识图谱中。
  9. 根据权利要求1-7之一所述的方法,还包括:
    对所述药品文本进行以下至少一项格式化处理:
    对所述药品文本中表征相同意义的不同标点符号进行归一化处理;
    将所述药品文本中的中文数字转换为阿拉伯数字。
  10. 一种药品知识图谱的构建装置,所述装置包括:
    识别单元,被配置成识别所述药品文本中的实体,
    替换单元,被配置成采用符合预设规则的字符串,替换所述实体中的医药关键实体,得到替换文本;
    还原单元,被配置成将基于所述替换文本所确定的分词结果中的所述字符串,还原为被所述字符串所替换的所述医药关键实体;
    成形单元,被配置成基于所述实体,形成各个实体之间的实体线性关系;
    解析单元,被配置成根据对所述实体线性关系进行句法解析得到的解析结果,生成药品知识图谱。
  11. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-9中任一所述的方法。
  12. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-9中任一所述的方法。
PCT/CN2021/088889 2020-07-30 2021-04-22 药品知识图谱的构建方法和装置 WO2022021958A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US18/016,896 US20230352192A1 (en) 2020-07-30 2021-04-22 Method and apparatus for constructing drug knowledge graph
EP21849263.5A EP4191439A4 (en) 2020-07-30 2021-04-22 METHOD AND DEVICE FOR CONSTRUCTING A DRUG KNOWLEDGE GRAPH

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010750770.3 2020-07-30
CN202010750770.3A CN112307216B (zh) 2020-07-30 2020-07-30 药品知识图谱的构建方法和装置

Publications (1)

Publication Number Publication Date
WO2022021958A1 true WO2022021958A1 (zh) 2022-02-03

Family

ID=74483181

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088889 WO2022021958A1 (zh) 2020-07-30 2021-04-22 药品知识图谱的构建方法和装置

Country Status (4)

Country Link
US (1) US20230352192A1 (zh)
EP (1) EP4191439A4 (zh)
CN (1) CN112307216B (zh)
WO (1) WO2022021958A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780083A (zh) * 2022-06-17 2022-07-22 之江实验室 一种知识图谱系统的可视化构建方法及装置
CN115376705A (zh) * 2022-10-24 2022-11-22 北京京东拓先科技有限公司 药品说明书的解析方法和装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307216B (zh) * 2020-07-30 2024-06-18 北京京东拓先科技有限公司 药品知识图谱的构建方法和装置
CN117057348A (zh) * 2022-05-05 2023-11-14 北京京东拓先科技有限公司 文本的处理方法、装置和计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073569A (zh) * 2017-06-21 2018-05-25 北京华宇元典信息服务有限公司 一种基于多层级多维度语义理解的法律认知方法、装置和介质
CN110377755A (zh) * 2019-07-03 2019-10-25 江苏省人民医院(南京医科大学第一附属医院) 基于药品说明书的合理用药知识图谱构建方法
CN110390021A (zh) * 2019-06-13 2019-10-29 平安科技(深圳)有限公司 药品知识图谱构建方法、装置、计算机设备及存储介质
US10706104B1 (en) * 2019-07-25 2020-07-07 Babylon Partners Limited System and method for generating a graphical model
CN112307216A (zh) * 2020-07-30 2021-02-02 北京沃东天骏信息技术有限公司 药品知识图谱的构建方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8433715B1 (en) * 2009-12-16 2013-04-30 Board Of Regents, The University Of Texas System Method and system for text understanding in an ontology driven platform
CN104933024B (zh) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 中文地址分词标注方法
US10496749B2 (en) * 2015-06-12 2019-12-03 Satyanarayana Krishnamurthy Unified semantics-focused language processing and zero base knowledge building system
CN109192321A (zh) * 2018-09-26 2019-01-11 北京理工大学 药品知识图谱的构建方法及计算存储装置
CN109284396A (zh) * 2018-09-27 2019-01-29 北京大学深圳研究生院 医学知识图谱构建方法、装置、服务器及存储介质
CN109509556A (zh) * 2018-11-09 2019-03-22 天津开心生活科技有限公司 知识图谱生成方法、装置、电子设备及计算机可读介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073569A (zh) * 2017-06-21 2018-05-25 北京华宇元典信息服务有限公司 一种基于多层级多维度语义理解的法律认知方法、装置和介质
CN110390021A (zh) * 2019-06-13 2019-10-29 平安科技(深圳)有限公司 药品知识图谱构建方法、装置、计算机设备及存储介质
CN110377755A (zh) * 2019-07-03 2019-10-25 江苏省人民医院(南京医科大学第一附属医院) 基于药品说明书的合理用药知识图谱构建方法
US10706104B1 (en) * 2019-07-25 2020-07-07 Babylon Partners Limited System and method for generating a graphical model
CN112307216A (zh) * 2020-07-30 2021-02-02 北京沃东天骏信息技术有限公司 药品知识图谱的构建方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4191439A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780083A (zh) * 2022-06-17 2022-07-22 之江实验室 一种知识图谱系统的可视化构建方法及装置
US11907390B2 (en) 2022-06-17 2024-02-20 Zhejiang Lab Method and apparatus for visual construction of knowledge graph system
CN115376705A (zh) * 2022-10-24 2022-11-22 北京京东拓先科技有限公司 药品说明书的解析方法和装置

Also Published As

Publication number Publication date
EP4191439A1 (en) 2023-06-07
CN112307216A (zh) 2021-02-02
EP4191439A4 (en) 2023-11-15
US20230352192A1 (en) 2023-11-02
CN112307216B (zh) 2024-06-18

Similar Documents

Publication Publication Date Title
WO2022021958A1 (zh) 药品知识图谱的构建方法和装置
CN109299472B (zh) 文本数据处理方法、装置、电子设备及计算机可读介质
Doan et al. Natural language processing in biomedicine: a unified system architecture overview
Xu et al. An end-to-end system to identify temporal relation in discharge summaries: 2012 i2b2 challenge
CN105574103A (zh) 基于分词编码自动构建医学术语映射关系的方法以及系统
CN113094477B (zh) 数据结构化方法、装置、计算机设备及存储介质
Jouffroy et al. Hybrid deep learning for medication-related information extraction from clinical texts in French: MedExt algorithm development study
CN111986759A (zh) 电子病历的解析方法、系统、计算机设备与可读存储介质
CN110609910A (zh) 医学知识图谱构建方法及装置、存储介质和电子设备
Alfattni et al. Extracting drug names and associated attributes from discharge summaries: text mining study
Vidal et al. Semantic data integration techniques for transforming big biomedical data into actionable knowledge
Yan et al. Chemical name extraction based on automatic training data generation and rich feature set
Doan et al. Using natural language processing to extract health-related causality from Twitter messages
Khnaisser et al. Using an ontology to derive a sharable and interoperable relational data model for heterogeneous healthcare data and various applications
CN112582073A (zh) 医疗信息获取方法、装置、电子设备和介质
US10832809B2 (en) Case management model processing
Lonsdale et al. Assessing clinical trial eligibility with logic expression queries
Xie et al. Traditional Chinese medicine prescription mining based on abstract text
Wang et al. Opioid2FHIR: A system for extracting FHIR-compatible opioid prescriptions from clinical text
Tarbell et al. Towards understanding the generalization of medical text-to-sql models and datasets
Li et al. Mapping client messages to a unified data model with mixture feature embedding convolutional neural network
CN111400759A (zh) 访视时间表生成方法及装置、存储介质、电子设备
CN115376705B (zh) 药品说明书的解析方法和装置
CN111046666A (zh) 事件识别方法及装置、计算机可读存储介质、电子设备
WO2023213166A1 (zh) 文本的处理方法、装置和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21849263

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021849263

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021849263

Country of ref document: EP

Effective date: 20230228

NENP Non-entry into the national phase

Ref country code: DE