CN115293161A

CN115293161A - Reasonable medicine taking system and method based on natural language processing and medicine knowledge graph

Info

Publication number: CN115293161A
Application number: CN202210999490.5A
Authority: CN
Inventors: 唐珂轲; 黄毅宁; 陈美莲; 林少泽; 韦宜均; 梁锐; 钟冬赐
Original assignee: Guangzhou Zhongkang Zixun Co ltd
Current assignee: Guangzhou Zhongkang Zixun Co ltd
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-11-04

Abstract

The invention discloses a reasonable medication system and a method based on natural language processing and a medicine knowledge graph.

Description

Reasonable medicine taking system and method based on natural language processing and medicine knowledge graph

Technical Field

The invention relates to the technical field of natural language processing technology and knowledge maps, in particular to a rational medicine administration system and a method based on natural language processing and a medicine knowledge map.

Background

In the big data age, data is a cornerstone of innovative business models and leading-edge technological development. Along with the popularization of big data concept in recent years, a plurality of enterprises pay more and more attention to the data governance inside the enterprises, the walls among the data of all the previous sub-modules are broken through, and the data resources of the enterprises can be utilized to the maximum extent through further integration.

In the aspect of medicines, the knowledge graph of the medicines can be formed by rapidly and effectively processing non-mechanization specification data based on an artificial intelligence technology. On the basis of the knowledge graph, a plurality of applications in different directions can be derived, such as reasonable medication system functions of an internet hospital, a medicine knowledge service platform required by a DTP pharmacy pharmacist, intelligent medicine services required by a patient user and the like. In an internet hospital setting, a doctor has the ability to make prescriptions on-line. On one hand, risks of repeated medication, drug merchant action, inconsistency of prescription drug indications and the like need to be avoided, on the other hand, compliance of the prescription needs to be checked to prevent the situation of indiscriminate prescription, and the function of reasonable medication function warning plays an important role here and can remind doctors and pharmacists of risk information when the prescription is made. In the DTP pharmacy scenario, pharmacists need to improve the level of service to patients, but due to lack of clinical knowledge and experience, they need to take more time to follow up with the knowledge of the drug. However, the large business volume of the DTP pharmacy store causes the great work pressure of pharmacists, many pharmacists have to use private time to learn the medicine knowledge, but due to the limited background knowledge, the follow-up of the medicine knowledge is a huge challenge for them. Under the scene, the medicine knowledge service platform is very important, and based on the latest medicine knowledge map and a convenient query search interface, pharmacists can conveniently query the knowledge of medicines, so that the time of the pharmacists is greatly saved. In the patient user scenario, the lack of background knowledge leads to a very poor understanding of the information about drug reuse methods, contraindications, etc. The user has the defects of limited knowledge acquired by depending on a search engine and low timeliness when inquiring doctors on line, so that the intelligent medicine service based on man-machine conversation can help the user to quickly know medicine information. Meanwhile, the deployment of the drug affair service to the terminals such as the WeChat small program and the like is also an important way for expanding the user traffic entrance.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a rational medicine taking system and method based on natural language processing and medicine knowledge base.

In order to achieve the purpose, the invention adopts the following technical scheme:

a rational medicine administration system based on natural language processing and medicine knowledge maps comprises an automatic prescription auditing module, a medicine recommending module, a prompt information backtracking module, a medicine information inquiring module and a data updating and maintaining module;

the prescription automatic auditing module: the system is used for automatically auditing the rationality of a prescription and giving an auditing result according to a medicine knowledge graph after a doctor makes the prescription in a diagnosis and treatment scene and according to input basic information, diagnosis and treatment information of a patient and information of specific medicines in the prescription, wherein the rationality auditing content comprises whether the medicines in the prescription are suitable with symptoms and disease diagnosis results in the diagnosis and treatment information or not, whether interaction exists between the medicines in the prescription and medicines being taken by the patient or not, whether the crowd type of the patient belongs to taboo crowds of the medicines in the prescription, whether the patient has an allergy history of taking the medicines in the prescription, whether repeated medicine taking exists in the prescription or not, and whether the usage amount of the medicines in the prescription is consistent with the usage amount of the corresponding medicines in the medicine knowledge graph or not; the basic information of the patient comprises age, sex, population type, allergy history and recent medication; the diagnosis and treatment information comprises symptoms and disease diagnosis results; the information of the specific medicines in the prescription comprises the universal name, manufacturer, specification, standard size and usage amount of each medicine;

a medicine recommending module: the system is used for generating a recommended medication list by utilizing a medicine knowledge graph according to input basic information and disease and symptom information of a patient in a diagnosis and treatment scene or a consultation medication scene;

the information backtracking module: a query interface for providing sources of prescription plausibility audit results and recommended medication lists, the sources including drug description texts and interaction databases; for the original text of the specification, the results of checking the reasonability of the prescription and recommending a medication list come from indications, interaction, contraindications and usage and dosage of the medicine specification; the interaction database definitely indicates the interaction among the components, and the recommended medication module or the automatic prescription auditing module searches the interaction database for the interaction through the components of the medicines in the medicine knowledge map so as to give the interaction, so that the interaction result is the interaction detail information inquired in the database when tracing the source of the interaction result;

a medicine information query module: the system is used for providing an inquiry interface and an interface for inquiring medicine information, wherein the medicine information comprises a specification original text, a medicine knowledge map, adverse reactions, indications, contraindications and a medicine dosage;

a data update maintenance module: and the data updating and maintaining module is used for processing the text data of the uploaded medicine specification by using a natural language processing model and updating the medicine knowledge graph according to the processed text data.

Furthermore, the recommended medication module screens out applicable medicines by using a medicine knowledge map according to the input disease and symptom information of the patient, and then filters the screened medicines according to the crowd type, the allergy history and the recent medication condition of the patient, so as to filter out medicines which are not suitable for the crowd type corresponding to the patient to take, can cause allergy of the patient and have repeated or interactive effects with the medicines being taken by the patient; and sorting the rest of the filtered medicines according to a sorting rule set by a user to generate a recommended medication list.

Further, when inquiring the original text of the specification, the knowledge map of the medicine and the usage amount, returning specific information according to the universal name, the manufacturer, the specification and the standard size of the medicine; the query of adverse reactions, indications and taboos supports forward query and reverse query, wherein the forward query is to return information by the common name, manufacturer, specification and standard letter number of the medicine, and the reverse query is to return a specific medicine list by a specific adverse reaction, indication or taboo.

Furthermore, the data updating and maintaining module supports uploading of medicine specifications in various data formats, including PDF, IMG and text; for the text data, the data updating and maintaining module directly utilizes the natural language processing model to process the uploaded specification text data and updates the medicine knowledge graph according to the uploaded specification text data; for PDF and IMG type data, the data updating and maintaining module extracts content and converts the content into text by using an image processing technology, and then the extracted text data is processed by using a natural language processing model.

Furthermore, the data updating and maintaining module provides a data quality control function, and a user can compare the automatic labeling and map forming results of the text with the original text of the specification to ensure the data quality.

Furthermore, the data updating and maintaining module provides a drug catalog docking function, a user can upload drug data comprising universal names, manufacturers, specifications and quasi-character number fields of drugs through the drug catalog docking function, and the data updating and maintaining module matches the drug data uploaded by the user to an existing drug catalog through an automatic alignment algorithm of a natural language processing model; for the medicines which do not exist in the original medicine catalog, the data updating and maintaining module automatically structures the medicines through the uploaded medicine specification to form a medicine knowledge graph; and finally, completing the butt joint work of the medicine data and putting the medicine data into use.

The invention also provides a method for constructing the system, which comprises the following specific processes:

1. designing a knowledge graph of the medicine:

designing a medicine knowledge graph structure containing corresponding fields based on consideration of business requirements of a rational medicine system, wherein the medicine knowledge graph structure comprises components, indications, adverse reactions, contraindications, interaction and usage amount;

2. designing a label system:

defining a label system according to the design of a medicine knowledge graph;

3. data annotation:

after the label system is designed, marking sample text data of the drug specification by using a marking tool; the data annotation work is performed according to field division tasks designed by the medicine knowledge graph structure; each different field comprises three labeling tasks of named entity labeling, entity relation labeling and labeling term alignment; the named entity labeling is to perform range division and entity type selection on specific words according to entity labels by a labeling tool, and comprises continuous entity labeling and discontinuous entity labeling; the entity relation marking is to specify the type of the directed relation between the marked named entities; the entity alignment marking is to determine which standardized term the entity is aligned to according to the named entity and the entity type label;

4. constructing a natural language processing model:

the natural language processing model automatically analyzes text data obtained from a medicine specification uploaded by a user, extracts entity and entity relation information in the text data, and further aligns to a standard term to construct a medicine knowledge graph; the construction process of the natural language processing model comprises the following steps:

4.1, construct information extraction Module

The model of the information extraction module comprises the tasks of drug text classification, chapter structure analysis, BERT coding task, entity extraction and relationship combined extraction;

the medicine text classification task is used for classifying the text data of the input medicine specification according to the Chinese patent medicines and the chemical medicines;

the BERT coding task is used for coding by adopting massive unsupervised linguistic data or self-adaptive pre-trained BERT in the medical field;

the chapter structure analysis task is used for carrying out a task of dividing the semantic meaning of the chapter medical block on the input medicine specification text; the method comprises the following steps that a medicine specification text relates to the roles of multiple fields such as indications, taboos, usage amount and the like, and a chapter structure analysis task is used for dividing the fields of different text segments and identifying the medical semantic role to which a certain text segment belongs;

the medical entity recognition task is used for extracting the medicine related entities mentioned in the input medicine specification text;

the medical relation extraction task is used for identifying and judging specific relations existing between entity pairs in the input medicine specification text;

the structure of the model integrating the tasks comprises a BERT encoder module, a head entity marking module, a tail entity marking module of the relationship, a chapter extraction module, a specification type classification module and a loss function calculation module; the whole process passes through a shared BERT coding layer and a multi-task joint learning reverse optimization model; the data processing process of the model is as follows:

suppose that there is a text sequence of X = (X) ₁ ,x ₂ ,x ₃ ,...,x _n )，x _t A character or word representing the t-th position, n representing the length of the text sequence;

s4.1.1, adding a set start character [ CLS ] at the beginning of a text sequence X by a BERT encoder module, and then carrying out BERT encoding:

H＝BERT([CLS]+X)...............(1)；

wherein H represents the hidden state of the text sequence X after being coded by a function BERT (-), H belongs to R ^n×d N represents the number of characters or words of the text sequence X, and d represents the vector dimension after encoding;

s4.1.2, the head entity marking module calculates the boundary probability of the head entity mark as follows:

wherein, the first and the second end of the pipe are connected with each other,

the probability of the beginning of the head entity is represented for the ith character or word,

representing the probability of the end of the head entity for the ith character or word, W _s ∈R ^d×1 ,W _e ∈R ^d×1 ,b _s ∈R,b _e E is R to represent the parameter to be learned, R represents a real number set, and sigma (-) represents an activation function;

the likelihood function of the head entity can thus be calculated:

where s represents the head entity, I (-) represents the indicator function,

the mark representing the starting position or the ending position of the ith character or word takes the value of {0,1};

s4.1.3, the calculation relationship of the tail entity marking module and the tail entity boundary is as follows:

wherein the content of the first and second substances,

representing the probability that the ith character or word represents the beginning of the tail entity,

representing the probability that the ith character or word represents the end of the tail entity, v ^k ∈R ^d Denotes the kth head entity vector, d denotes the dimension of the code, h _i E, H represents a hidden vector of the character or the word; if the entity is composed of a plurality of characters or words, a round of average value calculation is carried out;

respectively representing parameters to be learned by the model, m representing a relation type number, R representing a real number set, and sigma (·) representing an activation function;

thus, the likelihood functions of the relationships and entities can be calculated:

where o represents the tail entity, I (-) represents the indicator function,

s4.1.3, assuming that the output tag sequence is y = (y) ₁ ,y ₂ ,y ₃ ,...,y _n ) The chapter abstraction module calculates the total score of the chapter analysis sequence as follows:

wherein A ∈ R ^n×n To transfer the matrix, W _crf ∈R ^d×n ,b _crf ∈R ⁿ Represents a parameter to be learned;

calculating the probability of the text sequence corresponding to the target sequence as follows:

wherein，Y _x Representing a set of possible target sequences;

in the prediction phase, the optimal sequence is solved by using a Viterbi algorithm:

s4.1.4, calculating text category probability by a specification type classification module: (ii) a

p _c ＝W _c ×h _cls +b _c ............(11)；

Wherein, W _c ∈R ^d ,b _c e.R denotes the parameter to be learned, h _cls Represents [ CLS]The hidden vector of (2);

s4.1.5, the loss function calculation module calculates the loss function as follows:

L＝-(l _s+o +l _crf +l _c )......................(12)；

wherein the content of the first and second substances,

wherein M represents the total number of samples, n represents the text length, and lambda and gamma are regularization parameters;

4.2 construction of term standardization Module

The term standardization module is mainly based on a two-stage term standard model, recalls candidate terms from standard terms firstly, and then calculates candidate terms in a refined mode; the construction process of the term standardized module is as follows:

s4.2.1, collecting Chinese and English term corpora and an open term library, and constructing a medical term library after arrangement, wherein fields of the medical term library comprise unified codes CUI, english standard words, english synonyms, chinese standard words and Chinese synonyms;

s4.2.2, establishing indexes for Chinese and English based on an index tool or by adopting a self-defined index, translating the Chinese terms into English when inquiring the Chinese terms, and taking the English and the Chinese as input together;

4.2.3 building a search engine for keywords by integrating recall scores s through multiple recalls _recall Suppose s ₁ And s ₂ Two terms are used:

s _recall ＝α ₁ ×s _bm25 +α ₂ ×s _Jaccard +α ₃ ×s _MED +α ₄ ×s _DICE .......(16)

wherein alpha is ₁ ,α ₂ ,α ₃ ,α ₄ Respectively represent the weight of each road score, and alpha ₁ +α ₂ +α ₃ +α ₄ ＝1；

Wherein BM25 scores s _bm25 The calculation is as follows:

wherein, ω is _i Denotes the query term s ₁ The ith word segmentation; f. of _i Is the word omega _i In the term s ₂ Frequency of occurrence of, k ₁ B is an adjustment factor, len (-) is a function for calculating the length of the sentence, and avgsl is the average length of all the documents in the index; n denotes the number of all documents in the index, N (ω) _i ) To comprise omega _i The number of documents;

jaccard coefficient s _Jaccard The calculation is as follows:

wherein A and B each represent s ₁ And s ₂ Represents the number of collection elements;

edit Distance (MED) similarity score s _MED The calculation is as follows:

where len (·) is a function for calculating sentence length, d(s) ₁ ,s ₂ ) Denotes s ₁ And s ₂ The edit distance of (d);

DICE distance similarity score s _DICE The calculation is as follows:

wherein A and B each represent s ₁ And s ₂ The word segmentation set, |, represents the number of the elements of the set;

s4.2.4, training a refined model:

s4.2.4.1, constructing a positive sample: dividing Chinese and English of all words in the same concept CUI code in term library (term _ db) into two groups of terms, wherein the first group is a Chinese term set, and the second group is an English term set; and combining every two term sets of each group to form a term training pair to form a positive sample set ⁺ ；

S4.2.4.2, constructing a negative sample: traversing each group of set elements Q in the concept CUI code, performing similarity calculation on the preferred words in a medical term library term _ db through a formula (16), then taking the preferred words of Top100 and Q to form a term pair set, and finally removing a positive sample set ⁺ Element obtaining negative sample set ^- ；

S4.2.4.3, constructing a training sample: respectively randomly disorder set ⁺ 、set ^- The ratio of 1:10 from set ^- Taking negative samples and merging to set ⁺ A training sample set is obtained, which is added with 0,1, 0 represents a negative sample, 1 represents a positive sample;

s4.2.4.4, scoring by adopting a BERT model: the terms in set select y select {0,1} as the sequence classification task for the input sequence constituting "[ cls ] s1[ seq ] s2", forming an example, the loss function is a cross entropy loss function;

s4.2.4.5, dividing the sample samples into a training set and a testing set, and training; the model evaluation adopts F1 value, the model with the highest F1 value in the test set evaluation is selected for storage, and the middle-English-ranking dual model std _ model is formed _zh And std _ model _en ；

S4.2.5, two-stage term standard prediction:

s4.2.5.1 input Chinese term Q needing standardization, translating into Q ^EN ；

S4.2.5.2 based on the formula (16), retrieving Q from the first Chinese word and the synonymous Chinese word in term _ db of the medical term library, taking Top30, constructing a term pair, and inputting std _ model of the second stage _zh In the model, score is obtained, and candidate set C of Top5 is taken _zh-top5 (ii) a In the same way, Q ^EN Retrieving from the first English word and the synonymous English word in term _ db of the medical term library, taking Top30, constructing a term pair, and inputting std _ model of the second stage _en In the model, score is obtained, and candidate set C of Top5 is taken _en-top5 ；

S4.2.5.3, integrating the final result; setting Chinese and English scoring weight weighting lambda _zh ,λ _en ，

Are respectively to C _zh-top5 And C _en-top5 Calculating scores, averaging the scores of CUIs with the same concept to form a final standardized set C, and taking Top1 from C as the optimal standardized result;

5. construction of drug knowledge maps

The medicine specification can obtain a basic map after being analyzed by a natural language processing model, and the basic map mainly comprises defined entity labels, entities, relation labels and relations; converting the basic map into the final target map also needs a data processing process, describing the map schema by using an RDFS/OWL technology, and reasoning the basic map by using an SWRL language to finally form the target map;

6. continuously updating module for constructed data

Packaging deep learning natural language processing application service by using a micro-service architecture, and constructing an MLOPS process based on a Docker + Kubernates + Gitlab technology; automatically deploying services and continuously processing data by using the technology; in the face of a new medical institution or a pharmacy, only the medicine specification data needs to be uploaded, and the data continuous updating module can automatically process the data by using a natural language processing model, construct a medicine knowledge graph and deploy medicine application.

Further, in the drug knowledge graph, the drug knowledge graph structure includes blank nodes to expand different scenarios.

Further, in the medicine knowledge graph, the component field is used for recording information including main components and auxiliary materials of the medicine; the indication field is used for recording diseases and symptoms applicable to the medicine; the adverse reaction field is used for recording information including the name, type and occurrence frequency grade of the adverse reaction of the medicine; the contraindication field is used for recording the allergic contraindication of the medicine, the corresponding contraindications of symptoms and diseases and the contraindication information of specific crowds; the interaction field is used for recording specific interaction information between medicine components and between medicine classes; the usage amount field is used to record information on the purpose of administration, route of administration, time of administration, population, frequency of administration, number of administrations, type of dose, value of dose per dose and dosage unit of the drug.

Further, in the above method, the tag system design is an iterative design process, and when the performance of the natural language processing model does not meet the requirements, the tag system needs to be dynamically modified.

The invention has the beneficial effects that: the invention constructs a reasonable medication system based on natural language processing and a medicine knowledge graph, and the system can automatically carry out data management on medicine specification data and form the medicine knowledge graph to realize the function of the reasonable medication system based on the knowledge graph.

Drawings

Fig. 1 is a flow chart of the construction of a natural language processing model in embodiment 2 of the present invention.

Detailed Description

The present invention will be further described below, and it should be noted that the present embodiment is based on the technical solution, and a detailed implementation manner and a specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

Example 1

The embodiment provides a reasonable medication system based on natural language processing and a medicine knowledge graph, which comprises an automatic prescription auditing module, a medication recommending module, a prompt information backtracking module, a medicine information inquiring module and a data updating and maintaining module;

the prescription automatic auditing module: the method is used for automatically auditing the rationality of the prescription and giving an auditing result according to a medicine knowledge graph after a doctor makes a prescription in a diagnosis and treatment scene and according to input basic information, diagnosis and treatment information of a patient and information of specific medicines in the prescription, wherein the rationality auditing content comprises whether the medicines in the prescription are suitable with symptoms and disease diagnosis results in the diagnosis and treatment information or not, whether interaction exists between the medicines in the prescription and medicines being taken by the patient or not, whether the crowd type of the patient belongs to taboo crowds of the medicines in the prescription, whether the patient has an allergy history of taking the medicines in the prescription, whether repeated medicine taking exists in the prescription or not, and whether the usage amount of the medicines in the prescription is consistent with the usage amount of the corresponding medicines in the medicine knowledge graph or not. The basic information of the patient includes age, sex, population type (pregnant woman, children, old people, etc.), allergy history and recent medication; the diagnosis and treatment information comprises symptoms and disease diagnosis results; the information of the specific medicines in the prescription comprises the common name, manufacturer, specification, standard size and usage amount of each medicine.

A medicine recommending module: the method is used for generating a recommended medication list according to the input basic information of the patient and the disease and symptom information in a diagnosis and treatment scene or a consultation medication scene. Specifically, the recommended medication module screens out applicable medicines by using a medicine knowledge map according to input disease and symptom information of the patient, and then filters the screened medicines according to the crowd type, allergy history and recent medication condition of the patient, so as to filter out medicines which are not suitable for the crowd type corresponding to the patient to take, can cause allergy of the patient and have repeated or interactive effects with the medicines being taken by the patient; and sorting the rest of the filtered medicines according to a sorting rule set by a user to generate a recommended medication list. The filtering according to the type of population and the allergy history is to prevent the contraindications field of the instructions for recommending a drug to the drug from explicitly referring to persons who are not suitable for use. The filtering according to the recent medication use situation is to prevent the recommendation of a drug that interacts with the drug being taken by the patient and to avoid the situation of repeated medication use.

The information backtracking module: a source for providing review results of prescription rationality and recommended medication lists, the source including a drug order script and an interaction database. The information sources of the prescription rationality auditing results or the recommended results are required to be checked in the diagnosis and treatment scene or the recommended medication scene, and the purpose is to ensure the rationality and the safety of medication. The information backtracking direction is two, one is the original text of the medicine specification, and the other is the interaction database. For the original text of the specification, the results of checking the reasonability of the prescription and recommending a medication list come from fields of indications, interaction, contraindications, usage and dosage and the like of the medicine specification. The interaction database definitely indicates the interaction among the components, and the recommended medication module or the automatic prescription auditing module searches the interaction database for the interaction through the components of the medicines in the medicine knowledge map, so that the interaction result is the interaction detail information inquired in the database when tracing the source of the interaction result.

A medicine information query module: the system is used for providing an inquiry interface and an interface for inquiring medicine information, wherein the medicine information comprises a specification original text, a medicine knowledge map, adverse reactions, indications, contraindications and a medicine dosage. It should be noted that in both the diagnosis and treatment and pharmacy scenarios, there is a need for drug information query to assist medical workers/pharmacy workers in querying relevant drug information about the specific situation of a patient. The original text, the knowledge map and the usage amount of the specification return specific information according to the common name, the manufacturer, the specification and the standard font size of the medicine, and the adverse reaction, the indication and the contraindication support forward query and reverse query; the forward query is to return information by the common name, manufacturer, specification and quasi-character number of the medicine, and the reverse query is to return a specific medicine list by a specific adverse reaction, indication or taboo.

A data update maintenance module: the system is used for uploading the medicine specification, provides an uploading interface and an interface, and supports the uploading of the medicine specification in various data formats, including PDF, IMG and text. For text data, the data updating and maintaining module processes uploaded specification text data by using the natural language processing model, updates the medicine knowledge graph according to the uploaded specification text data, provides a data quality control function, and ensures data quality by comparing the results of automatic labeling and graph forming of the text with the specification text by a user. For PDF and IMG type data, the data updating and maintaining module extracts content and converts the content into text by using an image processing technology, and then the extracted text data is processed by using a natural language processing model.

In addition, in order to facilitate the system to be deployed and landed in different mechanisms, the data updating and maintaining module provides a drug catalog docking function, a user can upload drug data comprising fields of drug universal names, manufacturers, specifications and quasi-character numbers through the drug catalog docking function, and the data updating and maintaining module matches the drug data uploaded by the user to an existing drug catalog through an automatic alignment algorithm of a natural language processing model. For the medicines which do not exist in the original medicine catalog, the data updating and maintaining module automatically structures the medicines through the uploaded medicine specification to form a medicine knowledge graph; and finally, completing the butt joint work of the medicine data and putting the medicine data into use.

Example 2

The present embodiment provides a method for constructing the system described in embodiment 1, which includes the following specific processes:

the system described in embodiment 1 is a rational medication system implemented based on natural language processing and knowledge graph technology, and the method for constructing the system described in embodiment 1 will be explained in several aspects of drug knowledge graph design, label system design, data annotation, natural language processing model construction, drug knowledge graph construction, and data continuous update process construction.

1. Knowledge map for designing medicine

The system of embodiment 1 is based on knowledge graph technology to realize various reasonable medication information prompt functions, and knowledge graph design is in fit with application scenarios. Based on the consideration of business requirements of a rational medication system, a medicine knowledge graph structure containing corresponding fields is designed, wherein the medicine knowledge graph structure comprises components, indications, adverse reactions, contraindications, interactions and usage amounts. Since the usage scenarios of drugs differ under different conditions, including different patients, different times, etc., the design of the drug knowledge-graph structure includes blank nodes to extend the different scenarios. The component field is used for recording information including main components and auxiliary materials of the medicine; the indication field is used for recording diseases and symptoms applicable to the medicine; the adverse reaction field is used for recording information including the name, type and occurrence frequency grade of the adverse reaction of the medicine; the contraindication field is used for recording the contraindication of allergy, symptoms and diseases of the medicine, and the contraindication information of specific people; the interaction field is used for recording specific interaction information between the medicine components and between the medicine categories; the usage amount field is used to record information on the purpose of administration, administration route, administration time, population, administration frequency, number of administrations, dose type, dose value per dose and dose unit of the drug.

2. Designing a label system:

the label system is defined according to the design of the medicine knowledge graph, and the purpose is to make the conversion from the labeling result to the medicine knowledge graph easier. On the other hand, the difficulty of natural language processing also needs to be considered when designing tags. The label system design is also an iterative design process, and when the performance of the natural language processing model does not meet the requirement, the label system needs to be dynamically modified. Finally forming a set of standard label system through an iterative process.

3. Data annotation:

and after the label system is designed, marking the sample text data of the medicine specification by using a marking tool. The data marking work comprises two stages of marking and auditing. And the administrator constructs tasks and distributes the tasks to specific marking personnel for marking. And after the annotation personnel finishes the annotation, submitting the annotated data to an administrator for auditing work, returning the data which do not pass the auditing to the annotation personnel for checking and modifying, wherein the data which pass the auditing can be used for the subsequent training process of the natural language processing model. And the data annotation work is performed according to the field division task designed by the medicine knowledge graph structure. Each different field comprises three labeling tasks of named entity labeling, entity relationship labeling and labeling term alignment. The named entity labeling is to perform range division and entity type selection on specific words according to entity labels through a labeling tool, and comprises continuous entity labeling and discontinuous entity labeling; an entity relationship annotation specifies a type of directed relationship between annotated named entities. The entity alignment label determines which standardized term the entity is aligned to according to the named entity and the entity type label. In order to more fully align the entities to the standard terms, the embodiment adopts three alignment modes of equivalence, upper position and lower position.

In addition, in the labeling process, an autonomous learning mode is used for dynamically evaluating the labeling result of the labeling personnel and providing labeling data with the highest model confusion degree, so that the labeling personnel are subjected to emphasis and strengthened labeling. And the data marking work and the model training work are continuously and alternately carried out to obtain better model representation effect. And adding a data annotation task to the data with poor model expression effect.

4. Natural language processing model construction

The embodiment automatically analyzes text data obtained from a medicine specification uploaded by a user by using a natural language processing model, extracts entities and entity relationship information in the text data, and further aligns the text data to standard terms to construct a medicine knowledge graph. Therefore, the text data processing flow of the medicine specification is mainly divided into two modules, namely an information extraction module and a medical term alignment module. As shown in fig. 1, the process of constructing the natural language processing model includes:

4.1, construct the information extraction Module

The embodiment is based on the model of the thesis [1], and is improved into a model including tasks such as drug text classification, chapter structure analysis, entity extraction, and relationship joint extraction, and is used as an information extraction module.

the BERT coding task is used for coding by adopting massive unsupervised corpora or self-adaptive pretrained BERT in the medical field. BERT model Structure the BERT (Bidirectional Encoder R expressions from transformations) model was proposed by Google AI in 2018 ^[2] Compared with the algorithm of the earliest language-counting model and the subsequent Word vector technology Word2vec ^[3] Has more semantic expressive and intellectual properties.

The chapter structure analysis task is used for carrying out a task of dividing the semantic meaning of the chapter medical block on the input medicine specification text. The method comprises the following steps that a medicine specification text relates to the roles of multiple fields such as indications, taboos, usage amount and the like, and a chapter structure analysis task is used for dividing the fields of different text segments and identifying the medical semantic role to which a certain text segment belongs;

the medical entity identification task is used for extracting medical related entities such as diseases, symptoms, medicines and the like mentioned in the input medicine specification text.

The medical relationship extraction task is used for identifying and judging specific relationships existing between entity pairs in the input medicine specification texts. For example, "gastric hyperacidity" (manifestation) leads to "stomach pain" (manifestation), which leads to a relationship that gastric hyperacidity leads to manifestation of stomach pain.

The structure of the model integrating the tasks comprises a BERT encoder module, a head entity marking module, a tail entity marking module of the relationship, a chapter extraction module, a specification type classification module and a loss function calculation module. The whole process passes through a shared BERT coding layer and a multi-task joint learning reverse optimization model. The data processing process of the model is as follows:

suppose that there is a text sequence of X = (X) ₁ ,x ₂ ,x ₃ ,...,x _n )，x _t The character or word representing the t-th position, and n represents the length of the text sequence.

S4.1.1, adding a set start symbol [ CLS ] at the beginning of the text sequence X by a BERT encoder module, and then carrying out BERT encoding:

H＝BERT([CLS]+X)...............(1)

wherein H represents the hidden state of the text sequence X after being coded by a function BERT (-), H is equal to R ^n×d N represents the number of characters or words of the text sequence X, and d represents the vector dimension after encoding;

wherein the content of the first and second substances,

for the i-th character or word to indicate the probability of the head entity ending, W _s ∈R ^d×1 ,W _e ∈R ^d×1 ,b _s ∈R,b _e E R represents the parameter to be learned, R represents the set of real numbers, σ (-) represents the activation function.

The likelihood function of the head entity can thus be calculated:

where s represents the head entity, I (-) represents the indicator function,

and the mark representing the start position or the end position of the ith character or word takes the value of {0,1}.

wherein the content of the first and second substances,

representing the probability that the ith character or word represents the end of the tail entity, v ^k ∈R ^d Denotes the kth head entity vector, d denotes the dimension of the code, h _i e.H represents a hidden vector of the character or word; if the entity is composed of a plurality of characters or words, a round of average value calculation is carried out;

respectively, representing the parameters to be learned by the model, m representing the number of relation types, R representing the set of real numbers, and σ (·) representing the activation function.

Thus, the likelihood function of the relation and the entity can be calculated:

where o represents the tail entity, I (-) represents the indicator function,

S4.1.3, assume that the output tag sequence is y = (y) ₁ ,y ₂ ,y ₃ ,...,y _n ) The overall score of the chapter analysis sequence calculated by the chapter extraction module is as follows:

wherein A ∈ R ^n×n To transfer the matrix, W _crf ∈R ^d×n ,b _crf ∈R ⁿ Representing the parameters to be learned.

Calculating the probability corresponding to the text sequence to the target sequence as follows:

wherein Y is _x Representing a set of possible target sequences.

s4.1.4, calculating text category probability by a specification type classification module:

p _c ＝W _c ×h _cls +b _c ............(11)

wherein, W _c ∈R ^d ,b _c e.R denotes the parameter to be learned, h _cls Denotes [ CLS]The hidden vector of (2). S4.1.5, the loss function calculation module calculates the loss function as follows:

L＝-(l _s+o +l _crf +l _c )......................(12)

where M represents the total number of samples, n represents the text length, and λ, γ are regularization parameters.

4.2 construction of term standardization Module

Reference [4], said term normalization module is mainly based on a two-stage term-standard model, first recalling candidate terms from standard terms, then the candidate terms are calculated in a refined way. The embodiment is based on the model framework and combines a Chinese medical term library to expand and recall terms in magnitude. The construction process of the term standardized module is as follows:

s4.2.1, collecting Chinese and English term corpora and an open term library, and constructing a medical term library term _ db after arrangement, wherein fields of the medical term library comprise unified codes CUI, english standard words, english synonyms, chinese standard words, chinese synonyms and the like;

s4.2.2, establishing indexes for Chinese and English based on an index tool (such as ES) or by adopting a self-defined index, translating the Chinese terms into English when inquiring the Chinese terms, and taking the English and the Chinese as input together;

4.2.3 building a search engine for keywords by integrating recall scores s through multiple recalls _reca ^ll Suppose s ₁ And s ₂ Two terms are used:

BM25 score s _bm25 The calculation is as follows:

wherein, ω is _i Denotes the query term s ₁ The ith word segmentation; f. of _i Is the word omega _i In the term s ₂ Frequency of occurrence of, k ₁ B is an adjustment factor, len (-) is a function for calculating the length of the sentence, and avgsl is the average length of all the documents in the index; n denotes the number of all documents in the index, N (ω) _i ) To comprise omega _i The number of documents in the document.

Jaccard coefficient s _Jaccard The calculation is as follows:

wherein A and B each represent s ₁ And s ₂ Represents the number of collection elements.

Edit Distance (MED) similarity score s _MED The calculation is as follows:

where len (·) is a function for calculating sentence length, d(s) ₁ ,s ₂ ) Denotes s ₁ And s ₂ The edit distance of (c).

DICE distanceSimilarity score s _DICE The calculation is as follows:

S4.2.4, training a refined model:

and S4.2.4.1, constructing a positive sample. Dividing Chinese and English of all words in the same concept CUI code in term library term _ db into two groups of terms, the first group is Chinese term set, and the second group is English term set; combining every two term sets of each group to form a term training pair to form a positive sample set ⁺ ；

And S4.2.4.2, constructing a negative sample. Traversing each group of set elements Q in the concept CUI code, performing similarity calculation on the first-choice words in a medical term library term _ db through a formula (16), then taking the first-choice words of Top100 and Q to form a term pair set, and finally removing the positive sample set ⁺ Element obtaining negative sample set ^- ；

And S4.2.4.3, constructing a training sample. Respectively randomly scrambling set ⁺ 、set ^- The ratio of 1: ratio of 10 from set ^- Taking negative samples and merging to set ⁺ Obtaining a training sample set, wherein except for the term pair, a mark of {0,1} is added in the training sample set, 0 represents a negative sample, and 1 represents a positive sample;

and S4.2.4.4, scoring by adopting a BERT model. The terms in set select y select {0,1} as the sequence classification task for the input sequence constituting "[ cls ] s1[ seq ] s2", forming an example, the loss function is a cross entropy loss function;

s4.2.4.5, dividing the sample samples into a training set and a testing set, and training; the model evaluation adopts an F1 value, a model with the highest F1 value in the test set evaluation is selected for storage, and a middle-English-ranking double-model std _ model is formed _zh And std _ model _en 。

S4.2.5, two-stage term standard prediction:

s4.2.5.1 input requires standardized Chinese artLanguage Q, translation to Q ^EN ；

S4.2.5.2 based on formula (16), retrieving Q from first Chinese word and synonym of Chinese in term _ db of medical term library, taking Top30, constructing term pair, and inputting std _ model of second stage _zh In the model, score is obtained, and candidate set C of Top5 is taken _zh-top5 (ii) a In the same way, Q ^EN Retrieving from the first English word and the synonymous English word in term _ db of the medical term library, taking Top30, constructing a term pair, and inputting std _ model of the second stage _en In the model, score is obtained, and candidate set C of Top5 is taken _en-top5 ；

S4.2.5.3, and integrating the final result. Setting Chinese and English scoring weight weighting lambda _zh ,λ _en ，

Are respectively to C _zh-top5 And C _en-top5 Calculating the scores, then averaging the scores of the CUI with the same concept to form a final standardized set C, and taking Top1 from C as the best standardized result.

5. Construction of drug knowledge maps

The drug specification can obtain a basic map after being analyzed by a natural language processing model, and the basic map mainly comprises defined entity labels, entities, relationship labels and relationships. But this profile is not the final target profile. The conversion from the basic map to the final target map also requires a process of data processing. The present embodiment uses RDFS/OWL/SWRL techniques for the conversion. Considering that the knowledge graph constructed based on the RDFS/OWL technology has a plurality of data graph reasoning languages, the RDFS/OWL technology is used for describing a graph schema, and the SWRL language is used for reasoning a basic graph to finally form a target graph.

6. Continuously updating module for constructed data

In this embodiment, a micro service architecture is used to encapsulate deep learning natural language processing application services, and an MLOPS process is constructed based on a Docker + Kubernates + gillab technology. The technology is used to automatically deploy services and continuously process data. In the face of a new medical institution or a pharmacy, only the medicine specification data needs to be uploaded, and the data continuous updating module can automatically process the data by using a natural language processing model, construct a medicine knowledge graph and deploy medicine application.

Reference:

[1]Wei，Zhepei，et al.＂A novel cascade binary tagging framework for relational triple extraction.＂arXiv preprint arXiv：1909.03227(2019).

[2]Devlin，Jacob，et al.＂Bert：Pre－training of deep bidirectional transformers for language understanding.＂arXiv preprint arXiv：1810.04805(2018).

[3]Mikolov，Tomas，et al.＂Efficient estimation of word representations in vector space.＂arXiv preprint arXiv：1301.3781(2013).

[4] sunyueh et al, "BERT-based clinical terminology standardization," chinese information bulletin 35.4 (2021): 8.

various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A rational medication system based on natural language processing and a medicine knowledge graph is characterized by comprising an automatic prescription auditing module, a medication recommending module, a prompt information backtracking module, a medicine information inquiring module and a data updating and maintaining module;

the automatic prescription auditing module: the system is used for automatically auditing the rationality of a prescription and giving an auditing result according to a medicine knowledge graph after a doctor makes the prescription in a diagnosis and treatment scene and according to input basic information, diagnosis and treatment information of a patient and information of specific medicines in the prescription, wherein the rationality auditing content comprises whether the medicines in the prescription are suitable with symptoms and disease diagnosis results in the diagnosis and treatment information or not, whether interaction exists between the medicines in the prescription and medicines being taken by the patient or not, whether the crowd type of the patient belongs to taboo crowds of the medicines in the prescription, whether the patient has an allergy history of taking the medicines in the prescription, whether repeated medicine taking exists in the prescription or not, and whether the usage amount of the medicines in the prescription is consistent with the usage amount of the corresponding medicines in the medicine knowledge graph or not; the basic information of the patient comprises age, sex, population type, allergic history and recent medication condition; the diagnosis and treatment information comprises symptoms and disease diagnosis results; the information of the specific medicines in the prescription comprises the universal name, manufacturer, specification, standard size and usage amount of each medicine;

the information backtracking module: a query interface for providing sources of prescription rationality review results and recommended medication lists, the sources including drug description text and interaction databases; for the original text of the specification, the results of checking the reasonability of the prescription and recommending a medication list come from indications, interaction, contraindications and usage and dosage of the medicine specification; the interaction database definitely indicates the interaction among the components, and the recommended medication module or the automatic prescription auditing module searches the interaction database for the interaction through the components of the medicines in the medicine knowledge map so as to give the interaction, so that the interaction result is the interaction detail information inquired in the database when tracing the source of the interaction result;

the drug information query module: the system is used for providing an inquiry interface and an interface for inquiring medicine information, wherein the medicine information comprises a specification original text, a medicine knowledge map, adverse reactions, indications, contraindications and a dosage;

2. The rational medication system according to claim 1, wherein the medication recommending module screens out applicable drugs by using a drug knowledge graph according to inputted disease and symptom information of the patient, and then filters out the screened drugs according to the crowd type, allergy history and recent medication condition of the patient, so as to filter out drugs which are not suitable for the crowd type corresponding to the patient to take, can cause allergy of the patient and have repeated or interactive effects with the drugs being taken by the patient; and sorting the rest of the filtered medicines according to a sorting rule set by a user to generate a recommended medication list.

3. The rational medication system of claim 1, wherein when the original text of the specification, the knowledge map of the drug, and the usage amount are queried, specific information is returned according to the universal name, manufacturer, specification, and quasi-character number of the drug; the query of adverse reactions, indications and taboos supports forward query and reverse query, wherein the forward query is to return information by the common name, manufacturer, specification and standard letter number of the medicine, and the reverse query is to return a specific medicine list by a specific adverse reaction, indication or taboo.

4. The rational drug administration system of claim 1, wherein the data update maintenance module supports uploading of drug specifications in a plurality of data formats, including PDF, IMG, text; for the text data, the data updating and maintaining module directly utilizes the natural language processing model to process the uploaded specification text data and updates the medicine knowledge graph according to the uploaded specification text data; for PDF and IMG type data, the data updating maintenance module extracts contents and converts the contents into texts by using an image processing technology, and then processes the extracted text data by using a natural language processing model.

5. The rational medicine system of claim 1 or 4, wherein the data update maintenance module provides data quality control function, and the user can compare the automatic labeling and mapping result with the original text of the specification according to the text to ensure the data quality.

6. The rational medication system according to claim 1, wherein the data update maintenance module provides a drug catalog docking function through which a user can upload drug data including a drug universal name, a manufacturer, a specification, and a quasi-character number field, and the data update maintenance module matches the drug data uploaded by the user to an existing drug catalog through an automatic alignment algorithm of a natural language processing model; for the medicines which do not exist in the original medicine catalog, the data updating and maintaining module automatically structures the medicines through the uploaded medicine specification to form a medicine knowledge graph; and finally, completing the butt joint work of the medicine data and putting the medicine data into use.

7. A method for constructing the system of any one of claims 1 to 6, characterized in that the specific process is as follows:

1. designing a knowledge graph of the medicine:

2. designing a label system:

defining a label system according to the design of a medicine knowledge graph;

3. data labeling:

after the label system is designed, marking sample text data of the medicine specification by using a marking tool; the data annotation work is performed according to field division tasks designed by the medicine knowledge graph structure; each different field comprises three labeling tasks of named entity labeling, entity relationship labeling and labeling term alignment; the named entity labeling is to perform range division and entity type selection on specific words according to entity labels through a labeling tool, and comprises continuous entity labeling and discontinuous entity labeling; the entity relation marking is to specify the type of the directed relation between the marked named entities; the entity alignment marking is to determine which standardized term the entity is aligned to according to the named entity and the entity type label;

4. constructing a natural language processing model:

the natural language processing model automatically analyzes text data obtained from a medicine specification uploaded by a user, extracts entity and entity relation information in the text data, and further aligns the text data to standard terms to construct a medicine knowledge graph; the construction process of the natural language processing model comprises the following steps:

4.1, construct the information extraction Module

the BERT coding task is used for coding by adopting massive unsupervised corpora or self-adaptive pretrained BERT in the medical field;

the medical relation extraction task is used for identifying and judging a specific relation existing between entity pairs in the input medicine specification text;

the structure of the model integrating the tasks comprises a BERT encoder module, a head entity marking module, a tail entity marking module of the relationship, a chapter extraction module, a specification type classification module and a loss function calculation module; the whole process is realized through a shared BERT coding layer and a multi-task joint learning reverse optimization model; the data processing process of the model is as follows:

H＝BERT([CLS]+X)...............(1)；

for the i-th character or word to indicate the probability of the head entity ending, W _s ∈R ^d×1 ,W _e ∈R ^d×1 ,b _s ∈R,b _e E is R to represent the parameter to be learned, R represents a real number set, and sigma (-) represents an activation function;

the likelihood function of the head entity can thus be calculated:

where s represents the head entity, I (-) represents the indicator function,

indicating the starting position of the ith character or word orThe mark of the end position takes the value of {0,1};

indicating the probability that the ith character or word indicates the end of the final entity, v ^k ∈R ^d Denotes the kth head entity vector, d denotes the dimension of the code, h _i E, H represents a hidden vector of the character or the word; if the entity is composed of a plurality of characters or words, a round of average value calculation is carried out;

thus, the likelihood function of the relation and the entity can be calculated:

where o represents the tail entity, I (-) represents the indicator function,

representing the ith character or wordThe starting position or the ending position of (2) is marked, and the value of the mark is {0,1};

s4.1.3, assuming that the output tag sequence is y = (y) ₁ ,y ₂ ,y ₃ ,...,y _n ) The overall score of the chapter analysis sequence calculated by the chapter extraction module is as follows:

wherein, Y _x Representing a set of possible target sequences;

p _c ＝W _c ×h _cls +b _c ............(11)；

Wherein, W _c ∈R ^d ,b _c e.R represents the parameter to learn, h _cls Denotes [ CLS]The hidden vector of (2);

L＝-(l _s+o +l _crf +l _c )......................(12)；

4.2 construction of term standardization Module

4.2.3 building a search engine for keywords by integrating and recalling scores s through multi-path recalling _recall Suppose s ₁ And s ₂ Two terms are used:

Wherein BM25 scores s _bm25 The calculation is as follows:

jaccard coefficient s _Jaccard The calculation is as follows:

wherein A and B each represent s ₁ And s ₂ The word segmentation set, | · | represents the number of elements in the set;

edit Distance (MED) similarity score s _MED The calculation is as follows:

DICE distance similarity score s _DICE The calculation is as follows:

s4.2.4, training a refined training model:

s4.2.4.1, constructing a positive sample: dividing Chinese and English of all words in the same concept CUI code in term library term _ db into two groups of terms, the first group is Chinese term set, and the second group is English term set; and combining every two term sets of each group to form a term training pair to form a positive sample set ⁺ ；

S4.2.4.2, constructing a negative sample: traversing each group of set elements Q in the concept CUI code, performing similarity calculation on the first-choice words in a medical term library term _ db through a formula (16), then taking the first-choice words of Top100 and Q to form a term pair set, and finally removing the positive sample set ⁺ Element obtaining negative sample set ^- ；

S4.2.4.3, constructing a training sample: respectively randomly scrambling set ⁺ 、set ^- The ratio of 1: ratio of 10 from set ^- Taking negative samples and merging to set ⁺ Obtaining a training sample set, wherein in the set, besides the term pair, a {0,1} mark is added, 0 represents a negative sample, and 1 represents a positive sample;

s4.2.4.4, scoring by adopting a BERT model: the terms in set are used for constituting an input sequence of "[ cls ] s1[ seq ] s2", y selects {0,1} as a sequence classification task to form an example, and a loss function is a cross entropy loss function;

s4.2.4.5, dividing the sample samples into a training set and a test set, and training; the model evaluation adopts an F1 value, a model with the highest F1 value in the test set evaluation is selected for storage, and a middle-English-ranking double-model std _ model is formed _zh And std _ model _en ；

S4.2.5, two-stage term standard prediction:

s4.2.5.1 input needs standardized Chinese term Q, translate into Q ^EN ；

S4.2.5.2 based on formula (16), retrieving Q from first Chinese word and synonym of Chinese in term _ db of medical term library, taking Top30, constructing term pair, and inputting std _ model of second stage _zh In the model, the model is divided into a plurality of models,obtaining score, taking candidate set C of Top5 _zh-top5 (ii) a In the same way, Q ^EN Retrieving and recalling the first choice word and the synonymous word of English from term _ db of the medical term library, taking Top30, constructing a term pair, and inputting std _ model of the second stage _en In the model, score is obtained, and candidate set C of Top5 is taken _en-top5 ；

Are respectively to C _zh-top5 And C _en-top5 Calculating scores, averaging the scores of CUIs with the same concept to form a final standardized set C, and taking Top1 from C as an optimal standardized result;

5. construction of drug knowledge maps

The medicine specification can obtain a basic map after being analyzed by a natural language processing model, and the basic map mainly comprises defined entity labels, entities, relationship labels and relationships; converting the basic map into the final target map by a data processing process, describing a map schema by using an RDFS/OWL technology, and reasoning the basic map by using an SWRL language to finally form the target map;

6. continuously updating module for constructed data

Packaging deep learning natural language processing application service by using a micro-service architecture, and constructing an MLOPS process based on a Docker + Kubernates + Gitlab technology; automatically deploying services and continuously processing data by using the technology; in the face of a new medical institution or pharmacy, only the medicine specification data needs to be uploaded, and the data continuous updating module can automatically utilize the natural language processing model to process the data, construct a medicine knowledge graph and deploy medicine application.

8. The method of claim 7, wherein the drug knowledge-graph structure comprises blank nodes to extend different scenarios in the drug knowledge-graph.

9. The method of claim 7, wherein in the knowledge graph of the drug, the component field is used for recording information including main components and auxiliary materials of the drug; the indication field is used for recording diseases and symptoms applicable to the medicine; the adverse reaction field is used for recording information including the name, type and occurrence frequency grade of the adverse reaction of the medicine; the contraindication field is used for recording the allergic contraindication of the medicine, the corresponding contraindications of symptoms and diseases and the contraindication information of specific crowds; the interaction field is used for recording specific interaction information between the medicine components and between the medicine categories; the usage amount field is used to record information on the purpose of administration, administration route, administration time, population, administration frequency, number of administrations, dose type, dose value per dose and dose unit of the drug.

10. The method of claim 7, wherein the tag hierarchy design is an iterative design process that requires dynamic modification of the tag hierarchy when the natural language processing model performance does not meet requirements.