WO2023065858A1 - Medical term standardization system and method based on heterogeneous graph neural network - Google Patents
Medical term standardization system and method based on heterogeneous graph neural network Download PDFInfo
- Publication number
- WO2023065858A1 WO2023065858A1 PCT/CN2022/116967 CN2022116967W WO2023065858A1 WO 2023065858 A1 WO2023065858 A1 WO 2023065858A1 CN 2022116967 W CN2022116967 W CN 2022116967W WO 2023065858 A1 WO2023065858 A1 WO 2023065858A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- medical
- nodes
- training
- content
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 84
- 238000009826 distribution Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 50
- 230000008569 process Effects 0.000 claims description 21
- 229940079593 drug Drugs 0.000 claims description 20
- 239000003814 drug Substances 0.000 claims description 20
- 238000005259 measurement Methods 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000002372 labelling Methods 0.000 claims description 12
- 201000010099 disease Diseases 0.000 claims description 10
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000007689 inspection Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 2
- WCUXLLCKKVVCTQ-UHFFFAOYSA-M Potassium chloride Chemical compound [Cl-].[K+] WCUXLLCKKVVCTQ-UHFFFAOYSA-M 0.000 description 8
- 239000000243 solution Substances 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 239000001103 potassium chloride Substances 0.000 description 4
- 235000011164 potassium chloride Nutrition 0.000 description 4
- 238000011425 standardization method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Chemical compound OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 229940093181 glucose injection Drugs 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- GSDSWSVVBLHKDQ-JTQLQIEISA-N Levofloxacin Chemical compound C([C@@H](N1C2=C(C(C(C(O)=O)=C1)=O)C=C1F)C)OC2=C1N1CCN(C)CC1 GSDSWSVVBLHKDQ-JTQLQIEISA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000002552 dosage form Substances 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 229940090044 injection Drugs 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 229960003376 levofloxacin Drugs 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 229940073414 potassium chloride oral solution Drugs 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000008354 sodium chloride injection Substances 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Definitions
- the invention belongs to the technical field of standardization of Chinese medical terms and a multi-center medical information platform, and in particular relates to a medical term standardization system and method based on a heterogeneous graph neural network.
- Medical terminology mainly includes terms such as drugs, medical examinations, and diseases generated during clinical operations. Different types of medical terms will contain information of specific key dimensions, which we define as information units of medical terms.
- the drug term "5% glucose injection (base) 500 ml" contains information elements as shown in Table 1:
- Some information units are composed of other finer-grained information units, which are respectively defined as first-level information units and second-level information units.
- the traditional standardization method of medical terms is to understand the meaning of each medical term through machine learning or manual verification for a single category of medical terms, and mark the medical terms with the same semantics. This method regards each medical term as a whole, ignoring the inherent information unit structure within the medical term.
- the main disadvantages are: (1) The knowledge of the correlation and difference between information units cannot be effectively used. The associations and differences between information units of different dimensions of the same medical term will contain a wealth of medical domain knowledge, but existing practices have not explicitly structured and utilized such knowledge; (2) different types of medical Terminology will contain the same or related information units, and the past medical terminology standardization work is to develop independent systems for a single category of medical terminology. On the one hand, the workload is too large, and on the other hand, different types of medical terminology cannot be used comprehensively.
- the purpose of the present invention is to propose a medical terminology standardization system and method based on a heterogeneous graph neural network based on the shortcomings of the current medical terminology standardization method based on the characteristics of the medical terminology itself.
- the present invention constructs a new type of knowledge map based on information units for all medical terms, and standardizes medical terms through an improved heterogeneous graph neural network on the basis of the knowledge map, effectively utilizing the knowledge in the information units of medical terms, and obtaining more accurate Medical terminology normalization results.
- the present invention In order to make full use of the medical field knowledge contained in the medical terminology itself in the process of standardizing the medical terminology, the present invention first constructs key information units for various types of medical terminology, and realizes medical treatment. A structured representation of terminology and building a knowledge graph containing various types of medical terminology based on information units. Based on this knowledge graph, a heterogeneous graph neural network containing various types of medical terms is constructed. During the training process of the heterogeneous graph neural network, the distribution of adjacent nodes and the content coding of nodes are considered comprehensively for the normalization of medical terms.
- the present invention can make full use of the knowledge of the correlation and difference between information units of similar medical terms, and at the same time accommodate various types of medical terms in the system, and can comprehensively learn the knowledge in the medical field, and can conveniently New types of medical terms are added to the system, reducing the workload of standardizing new types of medical terms.
- redundant characters and information will be discarded to avoid introducing excessive noise and errors.
- One aspect of the present invention discloses a medical term standardization system based on a heterogeneous graph neural network, including:
- Information unit building block define key information units for each type of medical term; the information units include first-level information units and second-level information units, and the inclusion relationship between the two-level information units; use sequence annotation
- the model recognizes the information units contained in all medical terms at the character level, and builds an information unit library
- Medical terminology knowledge map module Based on the relationship between medical terminology and information units, a medical terminology knowledge map is constructed.
- the nodes of the knowledge map include medical terminology nodes and information unit nodes.
- the containment relationship between terms and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;
- Heterogeneous graph neural network module based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes start from a node and jump along the direction of the medical terminology knowledge graph edge Transfer to all nodes passed through two levels; the content code of the nodes is specifically:
- its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
- the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
- the first stage of training The adjacent node distribution and node content encoding are used as input.
- the goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
- the second stage of training take the vector representation of the node as input, and calculate the similarity between any two medical term nodes.
- the goal of training is to maximize the similarity between the medical term nodes with the same meaning;
- Prediction result output module Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical Term normalization results.
- the sequence labeling model is the BiLSTM-CRF model; the interval of each information unit is marked on the medical terms used as training data, and the characters of the non-information units are marked at the same time, so that the sequence labeling model can Extraneous characters that have no effect on the overall meaning of the medical term are discarded.
- preliminary standardization is performed on the value and the measurement unit, and the original measurement unit is normalized into a single basic unit or multiple basic units are combined together through different operation symbols, and the corresponding conversion is performed on the value .
- V represents the set of all nodes in the medical terminology knowledge graph, for v i ⁇ V, record value(v i ) as its node content, and e(v i ) is Its content code; for the node v i whose content is numerical, its content code is:
- v i is the value of node v i itself
- e I represents a unit vector, which is randomly initialized and obtained through heterogeneous graph neural network training
- M 0 is the parameter matrix obtained by the training of the heterogeneous graph neural network
- e(q l ) is the semantic vector of each basic unit or operation symbol, which is randomly initialized and obtained through the training of the heterogeneous graph neural network
- concatenation operator for vectors is the parameter matrix obtained by the training of the heterogeneous graph neural network
- the pre-trained language model adopts the BERT model, and the calculation method is:
- Z k+1 is the hidden state of the k+1 layer of the BERT model
- M 1 , M 2 , M 3 , M 4 , M 5 , b 1 and b 2 are parameters obtained from training
- d is the dimension of Z k+1
- Z k is the hidden state of layer k of the BERT model
- the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the knowledge graph of medical terminology; for the node v i ⁇ V in the knowledge graph of medical terminology, Use N 1 (v i ) to represent the set of nodes directly pointed to by the arrow starting from v i , if v i represents a medical term node, then N 1 (v i ) is the first-level information unit set of v i , is the secondary information unit set of v i ; define the adjacent node set N 1 (v i ) of v i as:
- M 6 and M 7 are matrix parameters obtained from training, and f( ⁇ ) is a nonlinear activation function.
- the parameter set that can be trained is recorded as ⁇ , and the training goal is to optimize the following objective function:
- v i and v j are the medical term nodes in the medical term knowledge map
- sim(v i , v j ) is the similarity between v i and v j
- W and b are the parameters obtained from training
- the goal of the second stage is to minimize the following loss function L:
- the similarity between v * and other medical term nodes in the medical terminology knowledge map is calculated and sorted based on the trained heterogeneous graph neural network, taking Among them, the medical term node with the greatest similarity with v *
- Another aspect of the present invention discloses a medical term standardization method based on a heterogeneous graph neural network, comprising the following steps:
- the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms Identify the information units contained in it at the character level, and build an information unit library;
- the nodes of the knowledge graph include medical terminology nodes and information unit nodes.
- the edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;
- the adjacent nodes are all nodes that start from a node and jump two stages along the direction of the medical terminology knowledge graph edge. ;
- the code of the node content is specifically:
- its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
- the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
- the first stage of training The adjacent node distribution and node content encoding are used as input.
- the goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
- the second stage of training take the vector representation of the node as input, and calculate the similarity between any two medical term nodes.
- the goal of training is to maximize the similarity between the medical term nodes with the same meaning;
- the present invention defines a unified information unit structure for different types of medical terms, and realizes a relatively unified structured representation, so that knowledge in the medical field can be better utilized in the process of standardizing medical terms, Fully learn the associations and differences of the information units between the same kind of medical terms and between different kinds of medical terms.
- the unified heterogeneous graph neural network realizes the normalization of different types of medical terminology, which can improve the integrity and uniformity of the output results while improving the efficiency of medical terminology standardization.
- FIG. 1 is a structural diagram of a medical term standardization system based on a heterogeneous graph neural network provided by an embodiment of the present invention
- Fig. 2 is the sequence labeling model training data provided by the embodiment of the present invention.
- Fig. 3 is a schematic diagram of a medical terminology knowledge map provided by an embodiment of the present invention.
- the standardization of medical terms refers to: combining knowledge in the medical field and natural language processing methods, analyzing various medical terms generated in real clinical environments, identifying medical terms with the same meaning and distinguishing medical terms with different meanings, so that in A process in which medical terminology within a certain range is harmonized for optimal order and social benefit.
- Establishing a unified medical terminology standard and terminology set helps to solve problems such as term duplication, unclear connotation, semantic expression and understanding inconsistency, and is of great significance to effectively promote the dissemination, sharing and use of medical information on a wider and deeper level .
- Heterogeneous graph neural network refers to: Traditional deep learning methods have achieved great success on linear and matrix-shaped data, but the data in many practical application scenarios is graph-structured. In recent years, researchers have used the ideas of convolutional networks and recurrent networks to define and design graph neural network models for processing graph data. Ordinary graph neural networks can achieve good performance by only using the adjacent node information of graphs for graphs with a single node and relationship type. However, graph data in the real world usually has many types of nodes and relationships with large differences. This type of graph is called a heterogeneous graph.
- An embodiment of the present invention provides a medical terminology standardization system based on a heterogeneous graph neural network, as shown in Figure 1, the system includes the following modules:
- Information unit building blocks including:
- medical term types include drug terms, disease terms, surgical terms, test terms and inspection terms
- information units include first-level information units and second-level information units, and two Inclusion relationship between level information units
- Medical terminology knowledge map module Based on the relationship between medical terms and information units, construct a medical terminology knowledge map.
- the nodes of the knowledge map include medical terminology nodes and information unit nodes.
- the edges are directed edges, and the edges include two types of relationships: medical terminology The containment relationship between and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;
- Heterogeneous graph neural network module based on the distribution of adjacent nodes and node content coding of the medical terminology knowledge map, train the heterogeneous graph neural network;
- the adjacent nodes are starting from a node, jumping two levels along the direction of the medical terminology knowledge graph edge, and passing through all the nodes;
- the node content encoding is specifically:
- its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
- the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
- the first stage of training The adjacent node distribution and node content encoding are used as input.
- the goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
- the second stage of training take the vector representation of the node as input, and calculate the similarity between any two medical term nodes.
- the goal of training is to maximize the similarity between the medical term nodes with the same meaning;
- Prediction result output module Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical terminology Normalize results.
- An information unit defining a medical term At present, there are some international medical terminology standard sets, which define information units of key dimensions for a specific single category of medical terminology. However, different types of medical terminology standard sets do not establish associations between information units. As a result, the information utilized in the past medical terminology standardization process can only be limited to a single category of medical terminology, while ignoring a lot of useful information.
- the present invention combines the existing international general medical terminology standard set and the expert knowledge in the actual clinical process, uniformly defines the key information units for various types of medical terms, and defines the detailed first-level information units and second-level information units structure.
- the types of medical terminology realized in the present invention include drug terminology, disease terminology, surgical terminology, test terminology and inspection terminology. If the new type of medical terminology needs to be standardized later, it can be convenient after defining the information unit for the new type of medical terminology. It can be extended to the system of the present invention.
- Table 3 The specific definitions of the information units of the implemented medical terms are shown in Table 3.
- sequence annotation model is used to predict the probability of each character in the medical term belonging to each information unit, so as to identify all the information units contained in the medical term and realize the structured representation of the medical term.
- the sequence labeling model used in this example is the BiLSTM-CRF model, which first uses the BiLSTM network to understand the context information of medical terms, and then constructs state probability and transition probability matrices based on the output value of the BiLSTM network at each character position of the medical term , and build a CRF model, which has achieved good results in sequence labeling tasks.
- the process of constructing training data for the sequence labeling model is shown in Figure 2.
- the interval of each information unit is marked on the medical terms used as training data, and the characters of non-information units are also marked, so that the sequence labeling model can discard the The redundant characters that have no effect on the overall meaning of medical terms avoid introducing too much noise to the subsequent heterogeneous graph neural network.
- the operation symbols include multiplication and division.
- a total of 90 normalized units of measure are generated. For example: the original measurement unit is l (liter), the corresponding value is 1, the standardized measurement unit is ml (milliliter), and the corresponding value is converted to 1000 accordingly.
- a knowledge graph containing various types of medical terms is constructed, as shown in Figure 3. It contains two types of nodes: circular nodes represent medical term nodes, rectangular nodes represent information unit nodes, and each large type of node contains nodes of various subdivision types, for example, medical term nodes include "drug term” nodes , "disease term” node, etc., and the information unit node includes "drug dose” node, "value” node, etc.
- Edges include two kinds of relations: 1) inclusion relationship between medical term and information unit; 2) inclusion relationship between first-level information unit and second-level information unit.
- the division scope of the first-level information unit and the second-level information unit may change for different types of medical terms, for example, for disease terms, "disease subject” is its first-level information unit, and for surgical terms, “disease subject ” is a second-level information unit included in the first-level information unit “nature of disease”.
- a heterogeneous graph refers to a graph with complex nodes and relationship types.
- the medical terminology knowledge graph shown in Figure 3 is a heterogeneous graph.
- Ordinary graph neural networks can achieve good performance only by relying on the information of adjacent nodes of the graph for graphs with relatively single node and relationship types.
- the present invention designs appropriate calculation methods for different types of nodes.
- v i is the value of node v i itself
- e I represents a unit vector, which is randomly initialized and obtained through heterogeneous graph neural network training
- M 0 is the parameter matrix obtained by the training of the heterogeneous graph neural network
- e(q l ) is the semantic vector of each basic unit or operation symbol, which is randomly initialized and obtained through the training of the heterogeneous graph neural network
- concatenation operator for vectors is the parameter matrix obtained by the training of the heterogeneous graph neural network
- the pre-trained language model used in this embodiment is the BERT model, and the calculation method is:
- Z k+1 is the hidden state of the k+1 layer of the BERT model
- M 1 , M 2 , M 3 , M 4 , M 5 , b 1 and b 2 are parameters obtained from training
- d is the dimension of Z k+1
- Z k is the hidden state of layer k of the BERT model
- the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the medical terminology knowledge graph.
- N 1 (v i ) to represent the set of nodes directly pointed to by the arrow starting from v i , if v i represents the medical term node, then N 1 (v i ) is The first-level information unit set of v i , is the set of secondary information units of v i .
- N 1 (v i ) of v i as:
- M 6 and M 7 are matrix parameters obtained from training, and f( ⁇ ) is a nonlinear activation function.
- f(x) max(0,x)+0.2 ⁇ min(0,x) is taken. Since the relative importance of nodes is asymmetric, so is also asymmetric, that is,
- the training process is divided into two stages: 1) The distribution of adjacent nodes and the code of node content are used as input, and the goal of training is to maximize the conditional probability of each node’s adjacent nodes to it, and obtain the vector representation of each node; 2) use The vector representation of the node is used as input to calculate the similarity of any two medical term nodes, and the training goal is to maximize the similarity of the medical term nodes with the same meaning.
- the parameter set that can be trained is recorded as ⁇ , and the goal of training is to optimize the following objective function:
- v i and v j are the medical term nodes in the medical term knowledge graph
- sim(v i , v j ) is the similarity between v i and v j
- W and b are parameters obtained from training.
- the goal of the second stage is to minimize the following loss function L:
- Potassium chloride needle 10ml 1g Otsuka Pharmaceutical Co., Ltd. 0.96021 Potassium chloride injection (base) 1000mg/10ml 0.90966 (10ml) Potassium Chloride Oral Solution 10%*1 stick 0.80715 Sodium Chloride Injection 0.9% 100ml*1 bag 0.61092
- the embodiment of the present invention also provides a medical term standardization method based on a heterogeneous graph neural network, the method comprising:
- the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms
- the information units contained therein are identified at the character level, and the information unit library is constructed; the realization of this step refers to the information unit building blocks.
- the nodes of the knowledge graph include medical terminology nodes and information unit nodes.
- the edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side.
- the adjacent nodes are all nodes that start from a node and jump two stages along the direction of the medical terminology knowledge graph edge. ;
- the code of the node content is specifically:
- its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
- the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
- the first stage of training The adjacent node distribution and node content encoding are used as input.
- the goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
- the second stage of training take the vector representation of the node as input, and calculate the similarity between any two medical term nodes.
- the goal of training is to maximize the similarity between the medical term nodes with the same meaning;
- the implementation of this step refers to the heterogeneous graph neural network module.
- the invention defines various medical terms and identifies the information units contained therein, so as to realize the structured representation of the medical terms.
- the result of the structured representation of medical terminology can not only improve the standardization effect of medical terminology, but also greatly promote all aspects of medical information work;
- the invention builds a new type of knowledge map for medical terminology based on the information unit of medical terminology, It can effectively promote various medical informatization work including the standardization of medical terminology;
- the present invention constructs a new type of heterogeneous graph neural network for the standardization of medical terminology, and realizes the standardization of different types of medical terminology by a unified model.
- the information units of each realize the appropriate content encoding method, and design a staged training method for the heterogeneous graph neural network.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Public Health (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Animal Behavior & Ethology (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
Provided are a medical term standardization system and method based on a heterogeneous graph neural network, the method comprising: establishing key information units for various types of medical terms so as to achieve structured representation of the medical terms, and, on the basis of the information units, establishing a knowledge map comprising the various types of medical terms; on the basis of the knowledge map, establishing a heterogeneous graph neural network comprising the various types of medical terms, and, while training the heterogeneous graph neural network, comprehensively considering adjacent node distribution and node content coding of the map so as to perform medical term standardization. The knowledge of the association and difference between information units of similar medical terms can be fully utilized, and various types of medical terms are comprised at the same time, thus knowledge in the medical field can be learned comprehensively, and new types of medical terms can be conveniently added into the system, thereby reducing the workload of the standardization of new types of medical terms.
Description
本发明属于中文医学术语标准化及多中心医学信息平台技术领域,尤其涉及一种基于异构图神经网络的医疗术语规范化系统及方法。The invention belongs to the technical field of standardization of Chinese medical terms and a multi-center medical information platform, and in particular relates to a medical term standardization system and method based on a heterogeneous graph neural network.
医疗信息化进程中一个重要的研究方向就是将更高性能的机器学习和人工智能技术应用于解决实际的临床问题。人工智能技术的一个优点是可以从海量数据中发现复杂的规律和特征,因此综合利用多家医疗机构的医疗数据进行分析挖掘和模型设计,进而为医疗研究、临床决策工作提供支持成为医疗信息化的必然趋势。而由于不同医疗机构采用的信息标准众多,并且经常会人为地产出半结构化和非结构化的数据,导致整合利用不同来源的医疗数据变得异常困难。医疗术语是组成医疗数据的基础要素,建立完善的医疗术语规范化体系可以将不同来源的医疗数据对齐到统一的标准和结构,进而为临床决策和医疗研究工作提供更大规模和更高质量的数据。医疗术语主要包括临床操作过程中产生的药物、医学检查、疾病等类型的术语。不同类型的医疗术语会包含特定的关键维度的信息,我们将其定义为医疗术语的信息单元。例如,药物术语“5%葡萄糖注射液(基)500毫升”包含如表1所示的信息单元:An important research direction in the process of medical informatization is to apply higher-performance machine learning and artificial intelligence technologies to solve practical clinical problems. One of the advantages of artificial intelligence technology is that it can discover complex laws and characteristics from massive data. Therefore, the medical data of multiple medical institutions is comprehensively used for analysis, mining and model design, and then provides support for medical research and clinical decision-making. It has become a medical informatization inevitable trend. However, due to the numerous information standards adopted by different medical institutions, and the artificially generated semi-structured and unstructured data, it is extremely difficult to integrate and utilize medical data from different sources. Medical terminology is the basic element of medical data. Establishing a complete medical terminology standardization system can align medical data from different sources to a unified standard and structure, thereby providing larger-scale and higher-quality data for clinical decision-making and medical research. . Medical terminology mainly includes terms such as drugs, medical examinations, and diseases generated during clinical operations. Different types of medical terms will contain information of specific key dimensions, which we define as information units of medical terms. For example, the drug term "5% glucose injection (base) 500 ml" contains information elements as shown in Table 1:
表1药物术语信息单元示例Table 1 Example of Drug Term Information Unit
信息单元名称information unit name | 药物成分drug ingredients | 药物剂型Pharmaceutical dosage form | 药物剂量drug dosage | 药物规格Drug Specifications | |
信息单元的值value of information | 葡萄糖glucose | 注射液Injection | 5%5% | 500毫升500ml |
检查术语“左手指正侧位_X”包含如表2所示的信息单元:Check that the term "Left Finger Ortholateral_X" contains information elements as shown in Table 2:
表2检查术语信息单元示例Table 2 Example of inspection term information unit
信息单元名称information unit name | 身体部位body parts | 身体部位范围range of body parts | 检查视图check view | 检查方法Inspection Method |
信息单元的值value of information unit | 手指finger | 左侧left side | 正位+侧位Anteroposterior + Lateral | X射线摄影X-ray photography |
某些信息单元由其它更细粒度的信息单元组成,分别将其定义为一级信息单元和二级信息单元,例如表1中药物术语包含一级信息单元“药物成分”、“药物剂型”、“药物剂量”和“药物规格”,其中“药物规格”信息单元由二级信息单元“数值”(500)和“计量单位”(毫升)组成。给定一组医疗术语的信息单元即可确定一条完整的医疗术语。Some information units are composed of other finer-grained information units, which are respectively defined as first-level information units and second-level information units. "Drug dose" and "Drug specification", wherein the "Drug specification" information unit is composed of secondary information units "Value" (500) and "Measurement unit" (ml). Given a set of information units of medical terms, a complete medical term can be determined.
在实际临床操作中由于各医疗机构采用的信息标准差异和医护人员个人习惯差异等原因,会产生大量不规范的医疗术语,主要表现为关键信息单元的冗余或缺失、表达方式不规范、 数量单位不统一等问题,例如下列药物术语的含义完全相同,但是形式上差异较大:“左氧氟沙星片(可乐必妥)500毫克”和“可乐必妥0.5g/片”。医疗术语规范化的目标就是识别出含义完全相同但字面形式不同的医疗术语,以便统一它们的表达方式,同时区分出含义不同的医疗术语,最终促进医疗数据整体的规范化。In actual clinical operations, due to differences in information standards adopted by various medical institutions and differences in the personal habits of medical staff, a large number of non-standard medical terms will be generated, mainly manifested in the redundancy or absence of key information units, irregular expressions, and the number of Issues such as inconsistent units, for example, the meanings of the following drug terms are exactly the same, but the form is quite different: "Levofloxacin Tablets (Colabitux) 500 mg" and "Colabitux 0.5g/tablet". The goal of standardization of medical terminology is to identify medical terms with the same meaning but different literal forms, so as to unify their expressions, and at the same time distinguish medical terms with different meanings, and ultimately promote the standardization of medical data as a whole.
传统的医疗术语规范化方法是针对某个单一类别的医疗术语,通过机器学习或人工校验的方法来理解每条医疗术语的含义,标注出语义相同的医疗术语。这样的方法将每条医疗术语作为一个整体,忽略了医疗术语内部固有的信息单元的结构,主要的缺点是:(1)无法有效地利用信息单元互相之间关联与差异的知识。同一条医疗术语的不同维度的信息单元之间的关联和差异会包含丰富的医疗领域知识,而现有的做法没有显式地对这些知识进行结构化表示和利用;(2)不同类型的医疗术语会包含相同或有关联的信息单元,而过去的医疗术语规范化工作都是针对单一类别的医疗术语分别开发独立的系统,这样做一方面工作量过大,另一方面也无法综合利用不同类型医疗术语的信息单元中的知识;(3)会将多余的信息纳入考虑范围。由于表达不规范等原因,大多数医疗术语除了关键的信息单元之外,还会包含一些多余的字符,这些字符与医疗术语整体的含义几乎没有关联,而且作为噪声会使医疗术语的含义产生偏差。The traditional standardization method of medical terms is to understand the meaning of each medical term through machine learning or manual verification for a single category of medical terms, and mark the medical terms with the same semantics. This method regards each medical term as a whole, ignoring the inherent information unit structure within the medical term. The main disadvantages are: (1) The knowledge of the correlation and difference between information units cannot be effectively used. The associations and differences between information units of different dimensions of the same medical term will contain a wealth of medical domain knowledge, but existing practices have not explicitly structured and utilized such knowledge; (2) different types of medical Terminology will contain the same or related information units, and the past medical terminology standardization work is to develop independent systems for a single category of medical terminology. On the one hand, the workload is too large, and on the other hand, different types of medical terminology cannot be used comprehensively. Knowledge in information units of medical terminology; (3) redundant information will be taken into account. Due to irregular expressions and other reasons, most medical terms will contain some redundant characters in addition to the key information units. These characters have little connection with the overall meaning of the medical term, and as noise, the meaning of the medical term will be biased. .
发明内容Contents of the invention
本发明的目的在于针对目前医疗术语规范化方法的缺点,基于医疗术语自身的特性,提出一种基于异构图神经网络的医疗术语规范化系统及方法。本发明对所有医疗术语构建新型的基于信息单元的知识图谱,并在知识图谱的基础上通过改进的异构图神经网络进行医疗术语的规范化,有效利用医疗术语信息单元中的知识,获取更准确的医疗术语规范化结果。The purpose of the present invention is to propose a medical terminology standardization system and method based on a heterogeneous graph neural network based on the shortcomings of the current medical terminology standardization method based on the characteristics of the medical terminology itself. The present invention constructs a new type of knowledge map based on information units for all medical terms, and standardizes medical terms through an improved heterogeneous graph neural network on the basis of the knowledge map, effectively utilizing the knowledge in the information units of medical terms, and obtaining more accurate Medical terminology normalization results.
本发明的目的是通过以下技术方案来实现的:本发明为了在医疗术语规范化的过程中充分利用医疗术语自身蕴含的医疗领域知识,首先对各种类型的医疗术语构建关键的信息单元,实现医疗术语的结构化表示,并基于信息单元构建包含各种类型医疗术语的知识图谱。基于此知识图谱构建包含各种类型医疗术语的异构图神经网络,在异构图神经网络的训练过程中综合考虑图的临近节点分布和节点内容编码,用于进行医疗术语规范化。通过这种方法,本发明能够充分利用同类医疗术语的信息单元互相之间关联与差异的知识,同时在系统中容纳各种类型的医疗术语,能够全面学习医疗领域的知识,并且能够方便地将新类型的医疗术语增加到系统中,减少了新类型医疗术语规范化的工作量。在对医疗术语提取信息单元的过程中会丢弃多余的字符和信息,避免引入过多的噪声和误差。The purpose of the present invention is achieved through the following technical solutions: In order to make full use of the medical field knowledge contained in the medical terminology itself in the process of standardizing the medical terminology, the present invention first constructs key information units for various types of medical terminology, and realizes medical treatment. A structured representation of terminology and building a knowledge graph containing various types of medical terminology based on information units. Based on this knowledge graph, a heterogeneous graph neural network containing various types of medical terms is constructed. During the training process of the heterogeneous graph neural network, the distribution of adjacent nodes and the content coding of nodes are considered comprehensively for the normalization of medical terms. Through this method, the present invention can make full use of the knowledge of the correlation and difference between information units of similar medical terms, and at the same time accommodate various types of medical terms in the system, and can comprehensively learn the knowledge in the medical field, and can conveniently New types of medical terms are added to the system, reducing the workload of standardizing new types of medical terms. In the process of extracting information units for medical terms, redundant characters and information will be discarded to avoid introducing excessive noise and errors.
本发明一方面公开了一种基于异构图神经网络的医疗术语规范化系统,包括:One aspect of the present invention discloses a medical term standardization system based on a heterogeneous graph neural network, including:
(1)信息单元构建模块:对每种类型的医疗术语定义关键的信息单元;所述信息单元包括一级信息单元和二级信息单元,以及两级信息单元之间的包含关系;利用序列标注模型对所有医疗术语在字符级别上识别其中包含的信息单元,构建信息单元库;(1) Information unit building block: define key information units for each type of medical term; the information units include first-level information units and second-level information units, and the inclusion relationship between the two-level information units; use sequence annotation The model recognizes the information units contained in all medical terms at the character level, and builds an information unit library;
(2)医疗术语知识图谱模块:基于医疗术语和信息单元的关系,构建医疗术语知识图谱,知识图谱的节点包括医疗术语节点和信息单元节点,边为有向边,边包括两种关系:医疗术语和信息单元之间的包含关系、一级信息单元和二级信息单元之间的包含关系,边的方向为从包含方指向被包含方;(2) Medical terminology knowledge map module: Based on the relationship between medical terminology and information units, a medical terminology knowledge map is constructed. The nodes of the knowledge map include medical terminology nodes and information unit nodes. The containment relationship between terms and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;
(3)异构图神经网络模块:基于医疗术语知识图谱的临近节点分布和节点内容编码,训练异构图神经网络;所述临近节点为从一个节点出发,沿医疗术语知识图谱边的方向跳转两级经过的所有节点;所述节点内容编码具体为:(3) Heterogeneous graph neural network module: based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes start from a node and jump along the direction of the medical terminology knowledge graph edge Transfer to all nodes passed through two levels; the content code of the nodes is specifically:
对于节点内容为数值型的节点,其内容编码等于节点本身的数值与异构图神经网络训练得到的单位向量的乘积;For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
对于节点内容为计量单位的节点,其内容编码的计算过程为:通过异构图神经网络训练得到每种基础单位和运算符号的语义向量,将该节点包含的所有基础单位和运算符号的语义向量拼接后,经过非线性转换得到内容编码;For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
对于节点内容为文本型的节点,其内容编码通过预训练的语言模型得到;For a node whose content is text, its content code is obtained through a pre-trained language model;
训练的第一个阶段:将临近节点分布和节点内容编码作为输入,训练的目标是最大化每个节点的临近节点对它的条件概率,得到每个节点的向量表示;The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
训练的第二个阶段:将节点的向量表示作为输入,计算任意两个医疗术语节点的相似度,训练的目标是最大化含义相同的医疗术语节点的相似度;The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;
(4)预测结果输出模块:将待规范的医疗术语节点输入训练好的异构图神经网络中,得到待规范的医疗术语节点与医疗术语知识图谱中其它医疗术语节点的相似度排序,输出医疗术语规范化结果。(4) Prediction result output module: Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical Term normalization results.
进一步地,所述医疗术语的类型包括药物术语、疾病术语、手术术语、检验术语和检查术语。Further, the types of medical terms include drug terms, disease terms, operation terms, test terms and inspection terms.
进一步地,所述信息单元构建模块中,序列标注模型为BiLSTM-CRF模型;在作为训练数据的医疗术语上标注出每个信息单元的区间,同时标明非信息单元的字符,使得序列标注模型能够丢弃对医疗术语整体含义无影响的多余字符。Further, in the information unit building block, the sequence labeling model is the BiLSTM-CRF model; the interval of each information unit is marked on the medical terms used as training data, and the characters of the non-information units are marked at the same time, so that the sequence labeling model can Extraneous characters that have no effect on the overall meaning of the medical term are discarded.
进一步地,所述信息单元构建模块中,对数值和计量单位做初步的规范化,将原始计量单位规范化为单个基础单位或多个基础单位通过不同的运算符号组合在一起,并且对数值做相应换算。Further, in the information unit construction module, preliminary standardization is performed on the value and the measurement unit, and the original measurement unit is normalized into a single basic unit or multiple basic units are combined together through different operation symbols, and the corresponding conversion is performed on the value .
进一步地,所述异构图神经网络模块中,用V表示医疗术语知识图谱中的所有节点的集合,对于v
i∈V,记value(v
i)为其节点内容,e(v
i)为其内容编码;对于节点内容为数值型的节点v
i,其内容编码为:
Further, in the heterogeneous graph neural network module, V represents the set of all nodes in the medical terminology knowledge graph, for v i ∈ V, record value(v i ) as its node content, and e(v i ) is Its content code; for the node v i whose content is numerical, its content code is:
e(v
i)=value(v
i)·e
I
e(v i )=value(v i )·e I
其中value(v
i)为节点v
i本身的数值;e
I表示单位向量,随机初始化并通过异构图神经网络训练得到;
Where value(v i ) is the value of node v i itself; e I represents a unit vector, which is randomly initialized and obtained through heterogeneous graph neural network training;
对于节点内容为计量单位的节点v
i,其节点内容是由基础单位和运算符号组成的序列,设value(v
i)=(q
1,q
2,...,q
l,...,q
L),其中q
l为基础单位或运算符号,L为v
i的序列长度,则内容编码为:
For a node v i whose node content is a measurement unit, its node content is a sequence composed of basic units and operation symbols, set value(v i )=(q 1 , q 2 ,...,q l ,..., q L ), where q l is the basic unit or operation symbol, and L is the sequence length of v i , then the content encoding is:
其中M
0为异构图神经网络训练得到的参数矩阵;e(q
l)为每种基础单位或运算符号的语义向量,随机初始化并通过异构图神经网络训练得到;
为向量拼接运算符;
Among them, M 0 is the parameter matrix obtained by the training of the heterogeneous graph neural network; e(q l ) is the semantic vector of each basic unit or operation symbol, which is randomly initialized and obtained through the training of the heterogeneous graph neural network; concatenation operator for vectors;
对于节点内容为文本型的节点v
i,使用预训练的语言模型计算v
i的语义向量作为v
i的初始化的内容编码,并通过后续的异构图神经网络继续训练内容编码。
For the node v i whose node content is text, use the pre-trained language model to calculate the semantic vector of v i as the initial content code of vi , and continue to train the content code through the subsequent heterogeneous graph neural network.
进一步地,对于节点内容为文本型的节点v
i,预训练的语言模型采用BERT模型,计算方式为:
Further, for the node v i whose node content is text, the pre-trained language model adopts the BERT model, and the calculation method is:
其中Z
k+1为BERT模型第k+1层的隐藏状态,
为第k+1层的输入值:
其中M
1、M
2、M
3、M
4、M
5、b
1和b
2均为训练得到的参数,d为Z
k+1的维度,Z
k为BERT模型第k层的隐藏状态;若BERT模型一共有m层,则节点v
i的初始化的内容编码为e(v
i)=Z
m。
Where Z k+1 is the hidden state of the k+1 layer of the BERT model, For the input value of the k+1th layer: Among them, M 1 , M 2 , M 3 , M 4 , M 5 , b 1 and b 2 are parameters obtained from training, d is the dimension of Z k+1 , and Z k is the hidden state of layer k of the BERT model; if The BERT model has m layers in total, and the initialization content code of node v i is e(v i )=Z m .
进一步地,所述异构图神经网络模块中,基于医疗术语知识图谱中节点自身及其临近节点的内容编码来计算每个节点的向量表示;对于医疗术语知识图谱中的节点v
i∈V,用N
1(v
i)表示从v
i出发的箭头直接指向的节点的集合,如果v
i表示医疗术语节点,那么N
1(v
i)为v
i的一级信息单元集合,
为v
i的二级信息单元集合;定义v
i的临近节点集合N
1(v
i)为:
Further, in the heterogeneous graph neural network module, the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the knowledge graph of medical terminology; for the node v i ∈ V in the knowledge graph of medical terminology, Use N 1 (v i ) to represent the set of nodes directly pointed to by the arrow starting from v i , if v i represents a medical term node, then N 1 (v i ) is the first-level information unit set of v i , is the secondary information unit set of v i ; define the adjacent node set N 1 (v i ) of v i as:
则v
i的向量表示F(v
i)的计算方式为:
Then the calculation method of the vector representation F(v i ) of v i is:
其中
M
6和M
7为训练得到的矩阵参数,f(·)为非线性激活函数。
in M 6 and M 7 are matrix parameters obtained from training, and f(·) is a nonlinear activation function.
进一步地,所述异构图神经网络模块中,训练的第一个阶段,将可以训练的参数集合记为θ,则训练的目标是优化如下目标函数:Further, in the heterogeneous graph neural network module, in the first stage of training, the parameter set that can be trained is recorded as θ, and the training goal is to optimize the following objective function:
其中
表示从节点v
i预测其临近节点v的概率;
in Indicates the probability of predicting the adjacent node v from node v i ;
训练的第二个阶段,任意两个医疗术语节点的相似度的计算公式为:In the second stage of training, the formula for calculating the similarity between any two medical term nodes is:
其中v
i和v
j为医疗术语知识图谱中的医疗术语节点,sim(v
i,v
j)为v
i和v
j的相似度,W和b均为训练得到的参数;
Where v i and v j are the medical term nodes in the medical term knowledge map, sim(v i , v j ) is the similarity between v i and v j , W and b are the parameters obtained from training;
在医疗术语规范化训练数据中,设与医疗术语节点v
i含义相同的医疗术语节点集合为V
i
+,与v
i含义不相同的医疗术语节点集合为V
i
-,则训练样本的标签y
i(v)为:
In the normalized training data of medical terms, let the set of medical term nodes with the same meaning as the medical term node v i be V i + , and the set of medical term nodes with different meanings from v i be V i - , then the label of the training sample y i (v) is:
第二阶段的目标为最小化如下损失函数L:The goal of the second stage is to minimize the following loss function L:
进一步地,所述预测结果输出模块中,对于待规范的医疗术语节点v
*,基于训练完成的异构图神经网络计算v
*与医疗术语知识图谱中其它医疗术语节点的相似度并排序,取其中与v
*相似度最大的医疗术语节点
Further, in the prediction result output module, for the medical term node v * to be standardized, the similarity between v * and other medical term nodes in the medical terminology knowledge map is calculated and sorted based on the trained heterogeneous graph neural network, taking Among them, the medical term node with the greatest similarity with v *
对相似度设置阈值c,若
则认为v
*与
的含义相同,即得到v
*的规范化 结果;否则认为v
*与医疗术语知识图谱中其它医疗术语节点的含义均不相同,v
*有独立的含义。
Set the threshold c on the similarity, if Then it is considered that v * and have the same meaning, that is, the normalized result of v * is obtained; otherwise, it is considered that v * has different meanings from other medical term nodes in the medical term knowledge graph, and v * has an independent meaning.
本发明另一方面公开了一种基于异构图神经网络的医疗术语规范化方法,包括以下步骤:Another aspect of the present invention discloses a medical term standardization method based on a heterogeneous graph neural network, comprising the following steps:
(1)对每种类型的医疗术语定义关键的信息单元;所述信息单元包括一级信息单元和二级信息单元,以及两级信息单元之间的包含关系;利用序列标注模型对所有医疗术语在字符级别上识别其中包含的信息单元,构建信息单元库;(1) Define key information units for each type of medical term; the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms Identify the information units contained in it at the character level, and build an information unit library;
(2)基于医疗术语和信息单元的关系,构建医疗术语知识图谱,知识图谱的节点包括医疗术语节点和信息单元节点,边为有向边,边包括两种关系:医疗术语和信息单元之间的包含关系、一级信息单元和二级信息单元之间的包含关系,边的方向为从包含方指向被包含方;(2) Based on the relationship between medical terms and information units, construct a medical terminology knowledge graph. The nodes of the knowledge graph include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;
(3)基于医疗术语知识图谱的临近节点分布和节点内容编码,训练异构图神经网络;所述临近节点为从一个节点出发,沿医疗术语知识图谱边的方向跳转两级经过的所有节点;所述节点内容编码具体为:(3) Based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes are all nodes that start from a node and jump two stages along the direction of the medical terminology knowledge graph edge. ; The code of the node content is specifically:
对于节点内容为数值型的节点,其内容编码等于节点本身的数值与异构图神经网络训练得到的单位向量的乘积;For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
对于节点内容为计量单位的节点,其内容编码的计算过程为:通过异构图神经网络训练得到每种基础单位和运算符号的语义向量,将该节点包含的所有基础单位和运算符号的语义向量拼接后,经过非线性转换得到内容编码;For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
对于节点内容为文本型的节点,其内容编码通过预训练的语言模型得到;For a node whose content is text, its content code is obtained through a pre-trained language model;
训练的第一个阶段:将临近节点分布和节点内容编码作为输入,训练的目标是最大化每个节点的临近节点对它的条件概率,得到每个节点的向量表示;The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
训练的第二个阶段:将节点的向量表示作为输入,计算任意两个医疗术语节点的相似度,训练的目标是最大化含义相同的医疗术语节点的相似度;The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;
(4)将待规范的医疗术语节点输入训练好的异构图神经网络中,得到待规范的医疗术语节点与医疗术语知识图谱中其它医疗术语节点的相似度排序,输出医疗术语规范化结果。(4) Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge map, and output the medical term normalization results.
本发明的有益效果是:本发明对不同类型的医疗术语定义了统一的信息单元结构,实现了相对统一的结构化表示,因此在进行医疗术语规范化过程中能够更好地利用医疗领域的知识,充分学习同类医疗术语之间以及不同类医疗术语之间包含的信息单元的关联和差异。通过将所有医疗术语整合到知识图谱中,由统一的异构图神经网络实现了不同类型的医疗术语规范化工作,在提高医疗术语规范化工作效率的同时能够提高输出结果的完整性和统一性。The beneficial effects of the present invention are: the present invention defines a unified information unit structure for different types of medical terms, and realizes a relatively unified structured representation, so that knowledge in the medical field can be better utilized in the process of standardizing medical terms, Fully learn the associations and differences of the information units between the same kind of medical terms and between different kinds of medical terms. By integrating all medical terms into the knowledge graph, the unified heterogeneous graph neural network realizes the normalization of different types of medical terminology, which can improve the integrity and uniformity of the output results while improving the efficiency of medical terminology standardization.
图1为本发明实施例提供的基于异构图神经网络的医疗术语规范化系统结构图;FIG. 1 is a structural diagram of a medical term standardization system based on a heterogeneous graph neural network provided by an embodiment of the present invention;
图2为本发明实施例提供的序列标注模型训练数据;Fig. 2 is the sequence labeling model training data provided by the embodiment of the present invention;
图3为本发明实施例提供的医疗术语知识图谱示意图。Fig. 3 is a schematic diagram of a medical terminology knowledge map provided by an embodiment of the present invention.
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其它不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。In the following description, a lot of specific details are set forth in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do it without departing from the meaning of the present invention. By analogy, the present invention is therefore not limited to the specific examples disclosed below.
本发明中,医疗术语规范化指:结合医疗领域的知识和自然语言处理方法,分析真实临床环境中产生的各种医疗术语,识别出含义相同的医疗术语并区分含义不相同的医疗术语,使在一定范围内的医疗术语得到统一,以获得最佳秩序和社会效益的过程。建立统一的医疗术语标准及术语集有助于解决术语重复、内涵不清、语义表达和理解不一致等问题,对有效推动医学信息在更大范围和更深层次上的传播、共享和使用具有重要意义。In the present invention, the standardization of medical terms refers to: combining knowledge in the medical field and natural language processing methods, analyzing various medical terms generated in real clinical environments, identifying medical terms with the same meaning and distinguishing medical terms with different meanings, so that in A process in which medical terminology within a certain range is harmonized for optimal order and social benefit. Establishing a unified medical terminology standard and terminology set helps to solve problems such as term duplication, unclear connotation, semantic expression and understanding inconsistency, and is of great significance to effectively promote the dissemination, sharing and use of medical information on a wider and deeper level .
异构图神经网络指:传统的深度学习方法在线性和矩阵形状的数据上取得了巨大成功,但许多实际应用场景中的数据是图形结构的。近年来研究人员借鉴了卷积网络、循环网络的思想,定义和设计了用于处理图数据的图神经网络模型。普通的图神经网络针对节点和关系类型比较单一的图,仅使用图的临近节点信息即可获得良好的性能。而真实世界中的图数据通常节点和关系类型众多,差异较大,这种类型的图被称为异构图。在训练异构图神经网络的过程中,由于不同类型节点的内容包含的特征差别较大,信息维度不一,因此在使用图的临近节点信息的同时需要考虑节点的内容编码信息。Heterogeneous graph neural network refers to: Traditional deep learning methods have achieved great success on linear and matrix-shaped data, but the data in many practical application scenarios is graph-structured. In recent years, researchers have used the ideas of convolutional networks and recurrent networks to define and design graph neural network models for processing graph data. Ordinary graph neural networks can achieve good performance by only using the adjacent node information of graphs for graphs with a single node and relationship type. However, graph data in the real world usually has many types of nodes and relationships with large differences. This type of graph is called a heterogeneous graph. In the process of training a heterogeneous graph neural network, since the features contained in the content of different types of nodes are quite different and the information dimensions are different, it is necessary to consider the content encoding information of the nodes while using the information of the adjacent nodes of the graph.
本发明实施例提供一种基于异构图神经网络的医疗术语规范化系统,如图1所示,该系统包括以下模块:An embodiment of the present invention provides a medical terminology standardization system based on a heterogeneous graph neural network, as shown in Figure 1, the system includes the following modules:
一、信息单元构建模块,包括:1. Information unit building blocks, including:
(1)对每种类型的医疗术语定义关键的信息单元;医疗术语类型包括药物术语、疾病术语、手术术语、检验术语和检查术语,信息单元包括一级信息单元和二级信息单元,以及两级信息单元之间的包含关系;(1) Define key information units for each type of medical term; medical term types include drug terms, disease terms, surgical terms, test terms and inspection terms, and information units include first-level information units and second-level information units, and two Inclusion relationship between level information units;
(2)利用序列标注模型对所有医疗术语在字符级别上识别其中包含的信息单元,构建信息单元库;(2) Use the sequence labeling model to identify the information units contained in all medical terms at the character level, and build an information unit library;
二、医疗术语知识图谱模块:基于医疗术语和信息单元的关系,构建医疗术语知识图谱,知识图谱的节点包括医疗术语节点和信息单元节点,边为有向边,边包括两种关系:医疗术 语和信息单元之间的包含关系、一级信息单元和二级信息单元之间的包含关系,边的方向为从包含方指向被包含方;2. Medical terminology knowledge map module: Based on the relationship between medical terms and information units, construct a medical terminology knowledge map. The nodes of the knowledge map include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two types of relationships: medical terminology The containment relationship between and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;
三、异构图神经网络模块:基于医疗术语知识图谱的临近节点分布和节点内容编码,训练异构图神经网络;3. Heterogeneous graph neural network module: based on the distribution of adjacent nodes and node content coding of the medical terminology knowledge map, train the heterogeneous graph neural network;
所述临近节点为从一个节点出发,沿医疗术语知识图谱边的方向跳转两级,经过的所有节点;The adjacent nodes are starting from a node, jumping two levels along the direction of the medical terminology knowledge graph edge, and passing through all the nodes;
所述节点内容编码具体为:The node content encoding is specifically:
对于节点内容为数值型的节点,其内容编码等于节点本身的数值与异构图神经网络训练得到的单位向量的乘积;For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
对于节点内容为计量单位的节点,其内容编码的计算过程为:通过异构图神经网络训练得到每种基础单位和运算符号的语义向量,将该节点包含的所有基础单位和运算符号的语义向量拼接后,经过非线性转换得到内容编码;For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
对于节点内容为文本型的节点,其内容编码通过预训练的语言模型得到;For a node whose content is text, its content code is obtained through a pre-trained language model;
训练的第一个阶段:将临近节点分布和节点内容编码作为输入,训练的目标是最大化每个节点的临近节点对它的条件概率,得到每个节点的向量表示;The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
训练的第二个阶段:将节点的向量表示作为输入,计算任意两个医疗术语节点的相似度,训练的目标是最大化含义相同的医疗术语节点的相似度;The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;
四、预测结果输出模块:将待规范的医疗术语节点输入训练好的异构图神经网络中,得到待规范的医疗术语节点与医疗术语知识图谱中其它医疗术语节点的相似度排序,输出医疗术语规范化结果。4. Prediction result output module: Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical terminology Normalize results.
下面详细说明每个模块的实现过程:The implementation process of each module is described in detail below:
一、信息单元构建模块1. Information unit building blocks
(1)定义医疗术语的信息单元。目前已有一些国际通用的医疗术语标准集,对特定的单一类别的医疗术语定义了关键维度的信息单元,然而不同类型的医疗术语标准集互相之间并没有建立信息单元之间的关联关系,导致过去的医疗术语规范化过程中利用的信息只能局限于单一类别的医疗术语内部,而忽略了大量有用的信息。本发明结合现有的国际通用的医疗术语标准集和实际临床过程中的专家知识,对各种类型的医疗术语统一定义了关键的信息单元,并且定义详细的一级信息单元和二级信息单元结构。目前本发明已经实现的医疗术语类型包括药物术语、疾病术语、手术术语、检验术语和检查术语,若后续需要对新类型的医疗术语做规范化,在对新类型的医疗术语定义信息单元之后可以方便地将其扩展到本发明的系统中。已经实现的医疗术语的信息单元具体定义如表3所示。(1) An information unit defining a medical term. At present, there are some international medical terminology standard sets, which define information units of key dimensions for a specific single category of medical terminology. However, different types of medical terminology standard sets do not establish associations between information units. As a result, the information utilized in the past medical terminology standardization process can only be limited to a single category of medical terminology, while ignoring a lot of useful information. The present invention combines the existing international general medical terminology standard set and the expert knowledge in the actual clinical process, uniformly defines the key information units for various types of medical terms, and defines the detailed first-level information units and second-level information units structure. Currently, the types of medical terminology realized in the present invention include drug terminology, disease terminology, surgical terminology, test terminology and inspection terminology. If the new type of medical terminology needs to be standardized later, it can be convenient after defining the information unit for the new type of medical terminology. It can be extended to the system of the present invention. The specific definitions of the information units of the implemented medical terms are shown in Table 3.
表3医疗术语的信息单元Table 3 Information elements of medical terms
(2)构建信息单元库。利用序列标注模型对医疗术语中的每个字符预测其属于每种信息单元的概率,从而识别出医疗术语中包含的所有信息单元,实现医疗术语的结构化表示。本实施例中使用的序列标注模型为BiLSTM-CRF模型,该模型首先通过BiLSTM网络来理解医疗术语上下文的信息,然后基于BiLSTM网络在医疗术语每个字符位置的输出值构造状态概率和转移概率矩阵,并构建CRF模型,在序列标注任务上取得了较好的效果。为序列标注模型构建训练数据的过程如图2所示,在作为训练数据的医疗术语上标注出每个信息单元的区间,同时也会标明非信息单元的字符,从而使得序列标注模型能够丢弃对医疗术语整体含义无影响的多余的字符,避免向后续的异构图神经网络引入过多的噪声。(2) Construct information unit library. The sequence annotation model is used to predict the probability of each character in the medical term belonging to each information unit, so as to identify all the information units contained in the medical term and realize the structured representation of the medical term. The sequence labeling model used in this example is the BiLSTM-CRF model, which first uses the BiLSTM network to understand the context information of medical terms, and then constructs state probability and transition probability matrices based on the output value of the BiLSTM network at each character position of the medical term , and build a CRF model, which has achieved good results in sequence labeling tasks. The process of constructing training data for the sequence labeling model is shown in Figure 2. The interval of each information unit is marked on the medical terms used as training data, and the characters of non-information units are also marked, so that the sequence labeling model can discard the The redundant characters that have no effect on the overall meaning of medical terms avoid introducing too much noise to the subsequent heterogeneous graph neural network.
(3)需要特别注意表3中多种一级信息单元均包含数值和计量单位二级信息单元,而医疗术语中原始的数值和计量单位分布的跨度和稀疏性会比较大,从而增加异构图神经网络训练的难度。为了解决这一问题,首先对数值和计量单位做初步的规范化,将原始计量单位规范化为单个基础单位或多个基础单位通过不同的运算符号组合在一起,并且对数值做相应换算,其中基础单位包括:ml(毫升)、mg(毫克)、mm(毫米)、s(秒)、mol(物质的量)、u(单位)、iu(国际单位)、count(计数)、型、级、期,运算符号包括乘法和除法。一共产生90个规范化的计量单位。例如:原始计量单位是l(升),对应的数值为1,规范化后的计量单位是ml(毫升),对应的数值相应换算为1000。(3) It is necessary to pay special attention to the various first-level information units in Table 3, which contain the second-level information units of numerical values and measurement units, while the original numerical values and measurement unit distributions in medical terms will have a relatively large span and sparseness, thereby increasing heterogeneity The difficulty of graph neural network training. In order to solve this problem, first of all, do a preliminary standardization of the value and the unit of measurement, standardize the original unit of measurement into a single basic unit or combine multiple basic units together through different operation symbols, and convert the value accordingly, where the basic unit Including: ml (milliliter), mg (mg), mm (millimeter), s (second), mol (amount of substance), u (unit), iu (international unit), count (count), type, level, period , the operation symbols include multiplication and division. A total of 90 normalized units of measure are generated. For example: the original measurement unit is l (liter), the corresponding value is 1, the standardized measurement unit is ml (milliliter), and the corresponding value is converted to 1000 accordingly.
二、医疗术语知识图谱模块2. Medical terminology knowledge map module
基于信息单元构建模块构建的信息单元库,构建包含多种类型医疗术语的知识图谱,如图3所示。其中包含两大类型的节点:圆形节点表示医疗术语节点,矩形节点表示信息单元节点,而每一大类型节点内部又包含多种细分种类的节点,例如医疗术语节点包含“药物术语”节点、“疾病术语”节点等,信息单元节点包含“药物剂量”节点、“数值”节点等。边包括两种关系:1)医疗术语和信息单元之间的包含关系;2)一级信息单元和二级信息单元之间的包含关系。一级信息单元和二级信息单元的划分范围对不同类型的医疗术语可能会发生变化,例如对于疾病术语,“疾病主体”是它的一级信息单元,而对于手术术语来说,“疾病主体”是一级信息单元“疾病性质”中包含的二级信息单元。Based on the information unit library constructed by the information unit building block, a knowledge graph containing various types of medical terms is constructed, as shown in Figure 3. It contains two types of nodes: circular nodes represent medical term nodes, rectangular nodes represent information unit nodes, and each large type of node contains nodes of various subdivision types, for example, medical term nodes include "drug term" nodes , "disease term" node, etc., and the information unit node includes "drug dose" node, "value" node, etc. Edges include two kinds of relations: 1) inclusion relationship between medical term and information unit; 2) inclusion relationship between first-level information unit and second-level information unit. The division scope of the first-level information unit and the second-level information unit may change for different types of medical terms, for example, for disease terms, "disease subject" is its first-level information unit, and for surgical terms, "disease subject ” is a second-level information unit included in the first-level information unit “nature of disease”.
三、异构图神经网络模块3. Heterogeneous Graph Neural Network Module
(1)异构图指的是节点和关系类型比较复杂的图,图3所示的医疗术语知识图谱就是一种异构图。普通的图神经网络针对节点和关系类型比较单一的图,只依靠图的临近节点信息即可获得良好的性能。而在训练异构图神经网络的过程中,由于不同类型节点的内容包含的特征差别较大,信息维度不一,因此需要同时考虑图的临近节点分布信息和节点内容编码信息。而在计算节点内容编码时,本发明针对不同类型的节点分别设计合适的计算方法。(1) A heterogeneous graph refers to a graph with complex nodes and relationship types. The medical terminology knowledge graph shown in Figure 3 is a heterogeneous graph. Ordinary graph neural networks can achieve good performance only by relying on the information of adjacent nodes of the graph for graphs with relatively single node and relationship types. In the process of training a heterogeneous graph neural network, since the features contained in the content of different types of nodes are quite different and the information dimensions are different, it is necessary to consider the distribution information of adjacent nodes in the graph and the coding information of node content at the same time. When calculating node content encoding, the present invention designs appropriate calculation methods for different types of nodes.
(2)计算不同类型节点的内容编码。用V表示图3中的医疗术语知识图谱中的所有节点的集合,对于v
i∈V,记value(v
i)为其节点内容,e(v
i)为其内容编码,则不同类型节点的内容编码的计算方式如下:
(2) Calculate the content codes of different types of nodes. Use V to represent the set of all nodes in the medical terminology knowledge map in Figure 3. For v i ∈ V, record value(v i ) as its node content, and e(v i ) as its content code, then the different types of nodes The content-encoding is calculated as follows:
对于节点内容为数值型的节点v
i,其内容编码为:
For node v i whose node content is numeric, its content code is:
e(v
i)=value(v
i)·e
I
e(v i )=value(v i )·e I
其中value(v
i)为节点v
i本身的数值;e
I表示单位向量,随机初始化并通过异构图神经网络训练得到;
Where value(v i ) is the value of node v i itself; e I represents a unit vector, which is randomly initialized and obtained through heterogeneous graph neural network training;
对于节点内容为计量单位的节点v
i,其节点内容是由基础单位和运算符号组成的序列,设value(v
i)=(q
1,q
2,...,q
l,...,q
L),其中q
l为基础单位或运算符号,L为v
i的序列长度,则内容编码为:
For a node v i whose node content is a measurement unit, its node content is a sequence composed of basic units and operation symbols, set value(v i )=(q 1 , q 2 ,...,q l ,..., q L ), where q l is the basic unit or operation symbol, and L is the sequence length of v i , then the content encoding is:
其中M
0为异构图神经网络训练得到的参数矩阵;e(q
l)为每种基础单位或运算符号的语义向量,随机初始化并通过异构图神经网络训练得到;
为向量拼接运算符;
Among them, M 0 is the parameter matrix obtained by the training of the heterogeneous graph neural network; e(q l ) is the semantic vector of each basic unit or operation symbol, which is randomly initialized and obtained through the training of the heterogeneous graph neural network; concatenation operator for vectors;
对于节点内容为文本型的节点v
i,使用预训练的语言模型计算v
i的语义向量作为v
i的初始化的内容编码,并通过后续的异构图神经网络继续训练内容编码。本实施例中使用的预训练的语言模型为BERT模型,计算方式为:
For the node v i whose node content is text, use the pre-trained language model to calculate the semantic vector of v i as the initial content code of vi , and continue to train the content code through the subsequent heterogeneous graph neural network. The pre-trained language model used in this embodiment is the BERT model, and the calculation method is:
其中Z
k+1为BERT模型第k+1层的隐藏状态,
为第k+1层的输入值:
其中M
1、M
2、M
3、M
4、M
5、b
1和b
2均为训练得到的参数,d为Z
k+1的维度,Z
k为BERT模型第k层的隐藏状态;若BERT模型一共有m层,则节点v
i的初始化的内容编码为e(v
i)=Z
m,本实施例取m=12。
Where Z k+1 is the hidden state of the k+1 layer of the BERT model, For the input value of the k+1th layer: Among them, M 1 , M 2 , M 3 , M 4 , M 5 , b 1 and b 2 are parameters obtained from training, d is the dimension of Z k+1 , and Z k is the hidden state of layer k of the BERT model; if The BERT model has m layers in total, and the initialized content code of node v i is e(v i )=Z m , and m=12 is used in this embodiment.
(3)在异构图神经网络中,基于医疗术语知识图谱中节点自身及其临近节点的内容编码来计算每个节点的向量表示。对于医疗术语知识图谱中的节点v
i∈V,用N
1(v
i)表示从v
i出发的箭头直接指向的节点的集合,如果v
i表示医疗术语节点,那么N
1(v
i)为v
i的一级信息单元集合,
为v
i的二级信息单元集合。定义v
i的临近节点集合N
1(v
i)为:
(3) In the heterogeneous graph neural network, the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the medical terminology knowledge graph. For the node v i ∈ V in the medical terminology knowledge graph, use N 1 (v i ) to represent the set of nodes directly pointed to by the arrow starting from v i , if v i represents the medical term node, then N 1 (v i ) is The first-level information unit set of v i , is the set of secondary information units of v i . Define the adjacent node set N 1 (v i ) of v i as:
则v
i的向量表示F(v
i)的计算方式为:
Then the calculation method of the vector representation F(v i ) of v i is:
其中
为权重参数,代表节点v对于节点v
i的重要性,其中v可以是v
i自身或v
i的临近节点,具体计算如下:
in is the weight parameter, which represents the importance of node v to node v i , where v can be v i itself or an adjacent node of v i , the specific calculation is as follows:
其中
M
6和M
7为训练得到的矩阵参数,f(·)为非线性激活函数,本实施例中取f(x)=max(0,x)+0.2·min(0,x)。由于节点之间相对的重要性是不对称的,因此
也是不对称的,即
in M 6 and M 7 are matrix parameters obtained from training, and f(·) is a nonlinear activation function. In this embodiment, f(x)=max(0,x)+0.2·min(0,x) is taken. Since the relative importance of nodes is asymmetric, so is also asymmetric, that is,
(4)异构图神经网络的训练。训练过程分为两个阶段:1)将临近节点分布和节点内容编码作为输入,训练的目标是最大化每个节点的临近节点对它的条件概率,得到每个节点的向量表示;2)将节点的向量表示作为输入,计算任意两个医疗术语节点的相似度,训练的目标是最大化含义相同的医疗术语节点的相似度。(4) Training of heterogeneous graph neural network. The training process is divided into two stages: 1) The distribution of adjacent nodes and the code of node content are used as input, and the goal of training is to maximize the conditional probability of each node’s adjacent nodes to it, and obtain the vector representation of each node; 2) use The vector representation of the node is used as input to calculate the similarity of any two medical term nodes, and the training goal is to maximize the similarity of the medical term nodes with the same meaning.
在训练过程的第一阶段,将可以训练的参数集合记为θ,则训练的目标是优化如下目标函数:In the first stage of the training process, the parameter set that can be trained is recorded as θ, and the goal of training is to optimize the following objective function:
其中
表示从节点v
i预测其临近节点v的概率。
in Indicates the probability of predicting its neighbor node v from node v i .
在训练过程的第二阶段,任意两个医疗术语节点的相似度的计算公式为:In the second stage of the training process, the formula for calculating the similarity between any two medical term nodes is:
其中v
i和v
j为医疗术语知识图谱中的医疗术语节点,sim(v
i,v
j)为v
i和v
j相似度,W和b均为训练得到的参数。在医疗术语规范化训练数据中,设与医疗术语节点v
i含义相同的医疗术语节点集合为V
i
+,与v
i含义不相同的医疗术语节点集合为V
i
-,则训练样本的标签y
i(v)为:
Among them, v i and v j are the medical term nodes in the medical term knowledge graph, sim(v i , v j ) is the similarity between v i and v j , and W and b are parameters obtained from training. In the normalized training data of medical terms, let the set of medical term nodes with the same meaning as the medical term node v i be V i + , and the set of medical term nodes with different meanings from v i be V i - , then the label of the training sample y i (v) is:
第二阶段的目标为最小化如下损失函数L:The goal of the second stage is to minimize the following loss function L:
四、预测结果输出模块4. Prediction result output module
对于待规范的医疗术语节点v
*,基于训练完成的异构图神经网络计算v
*与医疗术语知识图谱中其它医疗术语节点的相似度并排序,取其中与v
*相似度最大的医疗术语节点
For the medical term node v * to be standardized, calculate and sort the similarity between v * and other medical term nodes in the medical terminology knowledge graph based on the trained heterogeneous graph neural network, and select the medical term node with the greatest similarity to v *
对相似度设置阈值c,若
则认为v
*与
的含义相同,即得到v
*的规范化结果;否则认为v
*与医疗术语知识图谱中其它医疗术语节点的含义均不相同,v
*有独立的含义。本实施例中取c=0.9。
Set the threshold c on the similarity, if Then it is considered that v * and have the same meaning, that is, the normalized result of v * is obtained; otherwise, it is considered that v * has different meanings from other medical term nodes in the medical term knowledge graph, and v * has an independent meaning. In this embodiment, c=0.9.
例如在对药物术语“氯化钾针(大冢生产)10%10毫升*1支”进行规范化时,计算它与其它药物术语节点的相似度如表4所示,则可知和它含义相同的药物术语节点为相似度最高的“氯化钾针10ml∶1g大冢制药有限公司”。For example, when standardizing the drug term "potassium chloride needle (produced by Otsuka) 10% 10 ml * 1 stick", calculate its similarity with other drug term nodes as shown in Table 4, then it can be seen that it has the same meaning as The drug term node is "Potassium Chloride Needle 10ml: 1g Otsuka Pharmaceutical Co., Ltd." with the highest similarity.
表4异构图神经网络计算医疗术语节点相似度Table 4 Heterogeneous Graph Neural Network Calculation of Medical Terms Node Similarity
药物术语节点Drug Term Node | 相似度Similarity |
氯化钾针10ml∶1g大冢制药有限公司Potassium chloride needle 10ml: 1g Otsuka Pharmaceutical Co., Ltd. | 0.960210.96021 |
氯化钾注射液(基)1000mg/10mlPotassium chloride injection (base) 1000mg/10ml | 0.909660.90966 |
(10ml)氯化钾口服溶液10%*1支(10ml) Potassium Chloride Oral Solution 10%*1 stick | 0.807150.80715 |
氯化钠注射液0.9%100毫升*1袋Sodium Chloride Injection 0.9% 100ml*1 bag | 0.610920.61092 |
本发明实施例还提供一种基于异构图神经网络的医疗术语规范化方法,该方法包括:The embodiment of the present invention also provides a medical term standardization method based on a heterogeneous graph neural network, the method comprising:
(1)对每种类型的医疗术语定义关键的信息单元;所述信息单元包括一级信息单元和二级信息单元,以及两级信息单元之间的包含关系;利用序列标注模型对所有医疗术语在字符级别上识别其中包含的信息单元,构建信息单元库;该步骤的实现参照信息单元构建模块。(1) Define key information units for each type of medical term; the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms The information units contained therein are identified at the character level, and the information unit library is constructed; the realization of this step refers to the information unit building blocks.
(2)基于医疗术语和信息单元的关系,构建医疗术语知识图谱,知识图谱的节点包括医疗术语节点和信息单元节点,边为有向边,边包括两种关系:医疗术语和信息单元之间的包含关系、一级信息单元和二级信息单元之间的包含关系,边的方向为从包含方指向被包含方。(2) Based on the relationship between medical terms and information units, construct a medical terminology knowledge graph. The nodes of the knowledge graph include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side.
(3)基于医疗术语知识图谱的临近节点分布和节点内容编码,训练异构图神经网络;所述临近节点为从一个节点出发,沿医疗术语知识图谱边的方向跳转两级经过的所有节点;所述节点内容编码具体为:(3) Based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes are all nodes that start from a node and jump two stages along the direction of the medical terminology knowledge graph edge. ; The code of the node content is specifically:
对于节点内容为数值型的节点,其内容编码等于节点本身的数值与异构图神经网络训练得到的单位向量的乘积;For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;
对于节点内容为计量单位的节点,其内容编码的计算过程为:通过异构图神经网络训练得到每种基础单位和运算符号的语义向量,将该节点包含的所有基础单位和运算符号的语义向量拼接后,经过非线性转换得到内容编码;For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;
对于节点内容为文本型的节点,其内容编码通过预训练的语言模型得到;For a node whose content is text, its content code is obtained through a pre-trained language model;
训练的第一个阶段:将临近节点分布和节点内容编码作为输入,训练的目标是最大化每个节点的临近节点对它的条件概率,得到每个节点的向量表示;The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;
训练的第二个阶段:将节点的向量表示作为输入,计算任意两个医疗术语节点的相似度,训练的目标是最大化含义相同的医疗术语节点的相似度;The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;
该步骤的实现参照异构图神经网络模块。The implementation of this step refers to the heterogeneous graph neural network module.
(4)将待规范的医疗术语节点输入训练好的异构图神经网络中,得到待规范的医疗术语节点与医疗术语知识图谱中其它医疗术语节点的相似度排序,输出医疗术语规范化结果;该步骤的实现参照预测结果输出模块。(4) Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge map, and output the normalization results of medical terminology; The implementation of the steps refers to the prediction result output module.
本发明对多种医疗术语定义并识别其包含的信息单元,实现医疗术语的结构化表示。医疗术语的结构化表示的结果不仅能够提高医疗术语规范化的效果,同时也会极大地促进医疗信息化工作的各个方面;本发明基于医疗术语的信息单元构建了新型的针对医疗术语的知识图谱,能够有效地促进包括医疗术语规范化在内的各项医疗信息化工作;本发明针对医疗术语规范化工作构造了新型的异构图神经网络,由统一的模型实现不同类型医疗术语规范化,同时针对不同类型的信息单元分别实现了合适的内容编码方式,并且对异构图神经网络设计了分阶段的训练方式。The invention defines various medical terms and identifies the information units contained therein, so as to realize the structured representation of the medical terms. The result of the structured representation of medical terminology can not only improve the standardization effect of medical terminology, but also greatly promote all aspects of medical information work; the invention builds a new type of knowledge map for medical terminology based on the information unit of medical terminology, It can effectively promote various medical informatization work including the standardization of medical terminology; the present invention constructs a new type of heterogeneous graph neural network for the standardization of medical terminology, and realizes the standardization of different types of medical terminology by a unified model. The information units of each realize the appropriate content encoding method, and design a staged training method for the heterogeneous graph neural network.
以上所述仅是本发明的优选实施方式,虽然本发明已以较佳实施例披露如上,然而并非用以限定本发明。任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰,或修改为等同变化的等效实施例。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。The above descriptions are only preferred implementations of the present invention. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.
Claims (10)
- 一种基于异构图神经网络的医疗术语规范化系统,其特征在于,该系统包括:A medical term standardization system based on a heterogeneous graph neural network, characterized in that the system includes:(1)信息单元构建模块:对每种类型的医疗术语定义关键的信息单元;所述信息单元包括一级信息单元和二级信息单元,以及两级信息单元之间的包含关系;利用序列标注模型对所有医疗术语在字符级别上识别其中包含的信息单元,构建信息单元库;(1) Information unit building block: define key information units for each type of medical term; the information units include first-level information units and second-level information units, and the inclusion relationship between the two-level information units; use sequence annotation The model recognizes the information units contained in all medical terms at the character level, and builds an information unit library;(2)医疗术语知识图谱模块:基于医疗术语和信息单元的关系,构建医疗术语知识图谱,知识图谱的节点包括医疗术语节点和信息单元节点,边为有向边,边包括两种关系:医疗术语和信息单元之间的包含关系、一级信息单元和二级信息单元之间的包含关系,边的方向为从包含方指向被包含方;(2) Medical terminology knowledge map module: Based on the relationship between medical terminology and information units, a medical terminology knowledge map is constructed. The nodes of the knowledge map include medical terminology nodes and information unit nodes. The containment relationship between terms and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;(3)异构图神经网络模块:基于医疗术语知识图谱的临近节点分布和节点内容编码,训练异构图神经网络;所述临近节点为从一个节点出发,沿医疗术语知识图谱边的方向跳转两级经过的所有节点;所述节点内容编码具体为:(3) Heterogeneous graph neural network module: based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes start from a node and jump along the direction of the medical terminology knowledge graph edge Transfer to all nodes passed through two levels; the content code of the nodes is specifically:对于节点内容为数值型的节点,其内容编码等于节点本身的数值与异构图神经网络训练得到的单位向量的乘积;For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;对于节点内容为计量单位的节点,其内容编码的计算过程为:通过异构图神经网络训练得到每种基础单位和运算符号的语义向量,将该节点包含的所有基础单位和运算符号的语义向量拼接后,经过非线性转换得到内容编码;For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;对于节点内容为文本型的节点,其内容编码通过预训练的语言模型得到;For a node whose content is text, its content code is obtained through a pre-trained language model;训练的第一个阶段:将临近节点分布和节点内容编码作为输入,训练的目标是最大化每个节点的临近节点对它的条件概率,得到每个节点的向量表示;The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;训练的第二个阶段:将节点的向量表示作为输入,计算任意两个医疗术语节点的相似度,训练的目标是最大化含义相同的医疗术语节点的相似度;The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;(4)预测结果输出模块:将待规范的医疗术语节点输入训练好的异构图神经网络中,得到待规范的医疗术语节点与医疗术语知识图谱中其它医疗术语节点的相似度排序,输出医疗术语规范化结果。(4) Prediction result output module: Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical Term normalization results.
- 根据权利要求1所述的系统,其特征在于,所述医疗术语的类型包括药物术语、疾病术语、手术术语、检验术语和检查术语。The system according to claim 1, wherein the types of medical terms include drug terms, disease terms, operation terms, test terms and inspection terms.
- 根据权利要求1所述的系统,其特征在于,所述信息单元构建模块中,序列标注模型为BiLSTM-CRF模型;在作为训练数据的医疗术语上标注出每个信息单元的区间,同时标明非信息单元的字符,使得序列标注模型能够丢弃对医疗术语整体含义无影响的多余字符。The system according to claim 1, wherein, in the information unit building block, the sequence labeling model is a BiLSTM-CRF model; the interval of each information unit is marked on the medical term as training data, and the non- characters of the information unit, enabling the sequence annotation model to discard redundant characters that have no effect on the overall meaning of the medical term.
- 根据权利要求1所述的系统,其特征在于,所述信息单元构建模块中,对数值和计量单位做初步的规范化,将原始计量单位规范化为单个基础单位或多个基础单位通过不同的运算符号组合在一起,并且对数值做相应换算。The system according to claim 1, characterized in that, in the information unit building block, preliminary normalization is performed on the value and the measurement unit, and the original measurement unit is normalized into a single basic unit or multiple basic units through different operation symbols Combine them together and convert the values accordingly.
- 根据权利要求1所述的系统,其特征在于,所述异构图神经网络模块中,用V表示医疗术语知识图谱中的所有节点的集合,对于v i∈V,记value(v i)为其节点内容,e(v i)为其内容编码;对于节点内容为数值型的节点v i,其内容编码为: The system according to claim 1, wherein, in the heterogeneous graph neural network module, V represents the set of all nodes in the medical terminology knowledge graph, and for v i ∈ V, record value(v i ) as Its node content, e(v i ) is its content code; for the node v i whose node content is numerical, its content code is:e(v i)=value(v i)·e I e(v i )=value(v i )·e I其中value(v i)为节点v i本身的数值;e I表示单位向量,随机初始化并通过异构图神经网络训练得到; Where value(v i ) is the value of node v i itself; e I represents a unit vector, which is randomly initialized and obtained through heterogeneous graph neural network training;对于节点内容为计量单位的节点v i,其节点内容是由基础单位和运算符号组成的序列,设value(v i)=(q 1,q 2,...,q l,...,q L),其中q l为基础单位或运算符号,L为v i的序列长度,则内容编码为: For a node v i whose node content is a measurement unit, its node content is a sequence composed of basic units and operation symbols, set value(v i )=(q 1 , q 2 ,...,q l ,..., q L ), where q l is the basic unit or operation symbol, and L is the sequence length of v i , then the content encoding is:其中M 0为异构图神经网络训练得到的参数矩阵;e(q l)为每种基础单位或运算符号的语义向量,随机初始化并通过异构图神经网络训练得到; 为向量拼接运算符; Among them, M 0 is the parameter matrix obtained by the training of the heterogeneous graph neural network; e(q l ) is the semantic vector of each basic unit or operation symbol, which is randomly initialized and obtained through the training of the heterogeneous graph neural network; concatenation operator for vectors;对于节点内容为文本型的节点v i,使用预训练的语言模型计算v i的语义向量作为v i的初始化的内容编码,并通过后续的异构图神经网络继续训练内容编码。 For the node v i whose node content is text, use the pre-trained language model to calculate the semantic vector of v i as the initial content code of vi , and continue to train the content code through the subsequent heterogeneous graph neural network.
- 根据权利要求5所述的系统,其特征在于,对于节点内容为文本型的节点v i,预训练的语言模型采用BERT模型,计算方式为: The system according to claim 5, wherein, for a node v i whose node content is text, the pre-trained language model adopts the BERT model, and the calculation method is:其中Z k+1为BERT模型第k+1层的隐藏状态, 为第k+1层的输入值: Where Z k+1 is the hidden state of the k+1 layer of the BERT model, For the input value of the k+1th layer:其中M 1、M 2、M 3、M 4、M 5、b 1和b 2均为训练得到的参数,d为Z k+1的维度,Z k为BERT模型第k层的隐藏状态;若BERT模型一共有m层,则节点v i的初始化的内容编码为e(v i)=Z m。 Among them, M 1 , M 2 , M 3 , M 4 , M 5 , b 1 and b 2 are parameters obtained from training, d is the dimension of Z k+1 , and Z k is the hidden state of layer k of the BERT model; if The BERT model has m layers in total, and the initialization content code of node v i is e(v i )=Z m .
- 根据权利要求1所述的系统,其特征在于,所述异构图神经网络模块中,基于医疗术语知识图谱中节点自身及其临近节点的内容编码来计算每个节点的向量表示;对于医疗术语知识图谱中的节点v i∈V,用N 1(v i)表示从v i出发的箭头直接指向的节点的集合,如果v i表示医疗术语节点,那么N 1(v i)为v i的一级信息单元集合, 为v i的二级信息单元集合;定义v i的临近节点集合N 1(v i)为: The system according to claim 1, wherein in the heterogeneous graph neural network module, the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the medical terminology knowledge map; for medical terminology For the node v i ∈ V in the knowledge graph, use N 1 (v i ) to represent the set of nodes directly pointed to by the arrow starting from v i . If v i represents a medical term node, then N 1 (v i ) is the A collection of first-level information units, is the secondary information unit set of v i ; define the adjacent node set N 1 (v i ) of v i as:则v i的向量表示F(v i)的计算方式为: Then the calculation method of the vector representation F(v i ) of v i is:
- 根据权利要求1所述的系统,其特征在于,所述异构图神经网络模块中,训练的第一个阶段,将可以训练的参数集合记为θ,则训练的目标是优化如下目标函数:The system according to claim 1, wherein in the heterogeneous graph neural network module, in the first stage of training, the trainable parameter set is denoted as θ, and the training goal is to optimize the following objective function:其中 表示从节点v i预测其临近节点v的概率; in Indicates the probability of predicting the adjacent node v from node v i ;训练的第二个阶段,任意两个医疗术语节点的相似度的计算公式为:In the second stage of training, the formula for calculating the similarity between any two medical term nodes is:其中v i和v j为医疗术语知识图谱中的医疗术语节点,sim(v i,v j)为v i和v j的相似度,W和b均为训练得到的参数; Where v i and v j are the medical term nodes in the medical term knowledge map, sim(v i , v j ) is the similarity between v i and v j , W and b are the parameters obtained from training;在医疗术语规范化训练数据中,设与医疗术语节点v i含义相同的医疗术语节点集合为V i +,与v i含义不相同的医疗术语节点集合为V i -,则训练样本的标签y i(v)为: In the normalized training data of medical terms, let the set of medical term nodes with the same meaning as the medical term node v i be V i + , and the set of medical term nodes with different meanings from v i be V i - , then the label of the training sample y i (v) is:第二阶段的目标为最小化如下损失函数L:The goal of the second stage is to minimize the following loss function L:
- 根据权利要求1所述的系统,其特征在于,所述预测结果输出模块中,对于待规范的医疗术语节点v *,基于训练完成的异构图神经网络计算v *与医疗术语知识图谱中其它医疗术语节点的相似度并排序,取其中与v *相似度最大的医疗术语节点 The system according to claim 1, characterized in that, in the prediction result output module, for the medical term node v * to be standardized, based on the trained heterogeneous graph neural network calculation v * and other medical terminology knowledge graphs The similarity of the medical term nodes is sorted, and the medical term node with the greatest similarity with v * is selected对相似度设置阈值c,若 则认为v *与 的含义相同,即得到v *的规范化结果;否则认为v *与医疗术语知识图谱中其它医疗术语节点的含义均不相同,v *有独立的含义。 Set the threshold c on the similarity, if Then it is considered that v * and have the same meaning, that is, the normalized result of v * is obtained; otherwise, it is considered that v * has different meanings from other medical term nodes in the medical term knowledge graph, and v * has an independent meaning.
- 一种基于异构图神经网络的医疗术语规范化方法,其特征在于,包括以下步骤:A method for normalizing medical terms based on a heterogeneous graph neural network, characterized in that it comprises the following steps:(1)对每种类型的医疗术语定义关键的信息单元;所述信息单元包括一级信息单元和二级信息单元,以及两级信息单元之间的包含关系;利用序列标注模型对所有医疗术语在字符级别上识别其中包含的信息单元,构建信息单元库;(1) Define key information units for each type of medical term; the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms Identify the information units contained in it at the character level, and build an information unit library;(2)基于医疗术语和信息单元的关系,构建医疗术语知识图谱,知识图谱的节点包括医疗术语节点和信息单元节点,边为有向边,边包括两种关系:医疗术语和信息单元之间的包含关系、一级信息单元和二级信息单元之间的包含关系,边的方向为从包含方指向被包含方;(2) Based on the relationship between medical terms and information units, construct a medical terminology knowledge graph. The nodes of the knowledge graph include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;(3)基于医疗术语知识图谱的临近节点分布和节点内容编码,训练异构图神经网络;所述临近节点为从一个节点出发,沿医疗术语知识图谱边的方向跳转两级经过的所有节点;所述节点内容编码具体为:(3) Based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes are all nodes that start from a node and jump two stages along the direction of the medical terminology knowledge graph edge. ; The code of the node content is specifically:对于节点内容为数值型的节点,其内容编码等于节点本身的数值与异构图神经网络训练得到的单位向量的乘积;For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;对于节点内容为计量单位的节点,其内容编码的计算过程为:通过异构图神经网络训练得到每种基础单位和运算符号的语义向量,将该节点包含的所有基础单位和运算符号的语义向量拼接后,经过非线性转换得到内容编码;For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;对于节点内容为文本型的节点,其内容编码通过预训练的语言模型得到;For a node whose content is text, its content code is obtained through a pre-trained language model;训练的第一个阶段:将临近节点分布和节点内容编码作为输入,训练的目标是最大化每个节点的临近节点对它的条件概率,得到每个节点的向量表示;The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;训练的第二个阶段:将节点的向量表示作为输入,计算任意两个医疗术语节点的相似度,训练的目标是最大化含义相同的医疗术语节点的相似度;The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;(4)将待规范的医疗术语节点输入训练好的异构图神经网络中,得到待规范的医疗术语节点与医疗术语知识图谱中其它医疗术语节点的相似度排序,输出医疗术语规范化结果。(4) Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge map, and output the medical term normalization results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023536585A JP7432802B2 (en) | 2021-10-19 | 2022-09-05 | Medical terminology normalization system and method based on heterogeneous graph neural network |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111213727.4 | 2021-10-19 | ||
CN202111213727.4A CN113656604B (en) | 2021-10-19 | 2021-10-19 | Medical term normalization system and method based on heterogeneous graph neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023065858A1 true WO2023065858A1 (en) | 2023-04-27 |
Family
ID=78494655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/116967 WO2023065858A1 (en) | 2021-10-19 | 2022-09-05 | Medical term standardization system and method based on heterogeneous graph neural network |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP7432802B2 (en) |
CN (1) | CN113656604B (en) |
WO (1) | WO2023065858A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116312915A (en) * | 2023-05-19 | 2023-06-23 | 之江实验室 | Method and system for standardized association of drug terms in electronic medical records |
CN117497111A (en) * | 2023-12-25 | 2024-02-02 | 四川省医学科学院·四川省人民医院 | System for realizing disease name standardization and classification based on deep learning |
CN117688974A (en) * | 2024-02-01 | 2024-03-12 | 中国人民解放军总医院 | Knowledge graph-based generation type large model modeling method, system and equipment |
CN118098435A (en) * | 2024-02-04 | 2024-05-28 | 中央民族大学 | Method and system for predicting efficacy of medicine |
CN118194867A (en) * | 2023-12-05 | 2024-06-14 | 武汉大学 | Method and system for establishing enterprise digital term library based on semi-supervised learning |
CN118299064A (en) * | 2024-06-04 | 2024-07-05 | 湖南工商大学 | Rare disease-based graph model training method, application method and related equipment |
CN118503332A (en) * | 2024-05-31 | 2024-08-16 | 北京普巴大数据有限公司 | Knowledge management method, terminal and server based on block chain |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113656604B (en) * | 2021-10-19 | 2022-02-22 | 之江实验室 | Medical term normalization system and method based on heterogeneous graph neural network |
CN114496302A (en) * | 2021-12-29 | 2022-05-13 | 深圳云天励飞技术股份有限公司 | Method for predicting pharmaceutical indications and related device |
CN114003791B (en) * | 2021-12-30 | 2022-04-08 | 之江实验室 | Depth map matching-based automatic classification method and system for medical data elements |
CN116386895B (en) * | 2023-04-06 | 2023-11-28 | 之江实验室 | Epidemic public opinion entity identification method and device based on heterogeneous graph neural network |
CN117009839B (en) * | 2023-09-28 | 2024-01-09 | 之江实验室 | Patient clustering method and device based on heterogeneous hypergraph neural network |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110349639A (en) * | 2019-07-12 | 2019-10-18 | 之江实验室 | A kind of multicenter medical terms standardized system based on common therapy terminology bank |
US20200118682A1 (en) * | 2018-10-12 | 2020-04-16 | Fujitsu Limited | Medical diagnostic aid and method |
CN112271001A (en) * | 2020-11-17 | 2021-01-26 | 中山大学 | Medical consultation dialogue system and method applying heterogeneous graph neural network |
CN113010685A (en) * | 2021-02-23 | 2021-06-22 | 安徽科大讯飞医疗信息技术有限公司 | Medical term standardization method, electronic device, and storage medium |
CN113191156A (en) * | 2021-04-29 | 2021-07-30 | 浙江禾连网络科技有限公司 | Medical examination item standardization system and method based on medical knowledge graph and pre-training model |
CN113377897A (en) * | 2021-05-27 | 2021-09-10 | 杭州莱迈医疗信息科技有限公司 | Multi-language medical term standard standardization system and method based on deep confrontation learning |
CN113656604A (en) * | 2021-10-19 | 2021-11-16 | 之江实验室 | Medical term normalization system and method based on heterogeneous graph neural network |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7788213B2 (en) | 2007-06-08 | 2010-08-31 | International Business Machines Corporation | System and method for a multiple disciplinary normalization of source for metadata integration with ETL processing layer of complex data across multiple claim engine sources in support of the creation of universal/enterprise healthcare claims record |
WO2018209254A1 (en) * | 2017-05-11 | 2018-11-15 | Hubspot, Inc. | Methods and systems for automated generation of personalized messages |
US11381651B2 (en) * | 2019-05-29 | 2022-07-05 | Adobe Inc. | Interpretable user modeling from unstructured user data |
CN111400560A (en) * | 2020-03-10 | 2020-07-10 | 支付宝(杭州)信息技术有限公司 | Method and system for predicting based on heterogeneous graph neural network model |
CN112035451A (en) | 2020-08-25 | 2020-12-04 | 上海灵长软件科技有限公司 | Data verification optimization processing method and device, electronic equipment and storage medium |
CN112541056B (en) | 2020-12-18 | 2024-05-31 | 卫宁健康科技集团股份有限公司 | Medical term standardization method, device, electronic equipment and storage medium |
CN112542223A (en) * | 2020-12-21 | 2021-03-23 | 西南科技大学 | Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record |
CN113345545B (en) | 2021-07-28 | 2021-10-29 | 北京惠每云科技有限公司 | Clinical data checking method and device, electronic equipment and readable storage medium |
CN113436698B (en) | 2021-08-27 | 2021-12-07 | 之江实验室 | Automatic medical term standardization system and method integrating self-supervision and active learning |
-
2021
- 2021-10-19 CN CN202111213727.4A patent/CN113656604B/en active Active
-
2022
- 2022-09-05 JP JP2023536585A patent/JP7432802B2/en active Active
- 2022-09-05 WO PCT/CN2022/116967 patent/WO2023065858A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200118682A1 (en) * | 2018-10-12 | 2020-04-16 | Fujitsu Limited | Medical diagnostic aid and method |
CN110349639A (en) * | 2019-07-12 | 2019-10-18 | 之江实验室 | A kind of multicenter medical terms standardized system based on common therapy terminology bank |
CN112271001A (en) * | 2020-11-17 | 2021-01-26 | 中山大学 | Medical consultation dialogue system and method applying heterogeneous graph neural network |
CN113010685A (en) * | 2021-02-23 | 2021-06-22 | 安徽科大讯飞医疗信息技术有限公司 | Medical term standardization method, electronic device, and storage medium |
CN113191156A (en) * | 2021-04-29 | 2021-07-30 | 浙江禾连网络科技有限公司 | Medical examination item standardization system and method based on medical knowledge graph and pre-training model |
CN113377897A (en) * | 2021-05-27 | 2021-09-10 | 杭州莱迈医疗信息科技有限公司 | Multi-language medical term standard standardization system and method based on deep confrontation learning |
CN113656604A (en) * | 2021-10-19 | 2021-11-16 | 之江实验室 | Medical term normalization system and method based on heterogeneous graph neural network |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116312915A (en) * | 2023-05-19 | 2023-06-23 | 之江实验室 | Method and system for standardized association of drug terms in electronic medical records |
CN116312915B (en) * | 2023-05-19 | 2023-09-19 | 之江实验室 | Method and system for standardized association of drug terms in electronic medical records |
CN118194867A (en) * | 2023-12-05 | 2024-06-14 | 武汉大学 | Method and system for establishing enterprise digital term library based on semi-supervised learning |
CN117497111A (en) * | 2023-12-25 | 2024-02-02 | 四川省医学科学院·四川省人民医院 | System for realizing disease name standardization and classification based on deep learning |
CN117497111B (en) * | 2023-12-25 | 2024-03-15 | 四川省医学科学院·四川省人民医院 | System for realizing disease name standardization and classification based on deep learning |
CN117688974A (en) * | 2024-02-01 | 2024-03-12 | 中国人民解放军总医院 | Knowledge graph-based generation type large model modeling method, system and equipment |
CN117688974B (en) * | 2024-02-01 | 2024-04-26 | 中国人民解放军总医院 | Knowledge graph-based generation type large model modeling method, system and equipment |
CN118098435A (en) * | 2024-02-04 | 2024-05-28 | 中央民族大学 | Method and system for predicting efficacy of medicine |
CN118503332A (en) * | 2024-05-31 | 2024-08-16 | 北京普巴大数据有限公司 | Knowledge management method, terminal and server based on block chain |
CN118299064A (en) * | 2024-06-04 | 2024-07-05 | 湖南工商大学 | Rare disease-based graph model training method, application method and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113656604A (en) | 2021-11-16 |
JP2024500400A (en) | 2024-01-09 |
CN113656604B (en) | 2022-02-22 |
JP7432802B2 (en) | 2024-02-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023065858A1 (en) | Medical term standardization system and method based on heterogeneous graph neural network | |
CN113871003B (en) | Disease auxiliary differential diagnosis system based on causal medical knowledge graph | |
CN110032648B (en) | Medical record structured analysis method based on medical field entity | |
US20220277858A1 (en) | Medical Prediction Method and System Based on Semantic Graph Network | |
CN106934235B (en) | Patient's similarity measurement migratory system between a kind of disease areas based on transfer learning | |
CN110189831B (en) | Medical record knowledge graph construction method and system based on dynamic graph sequence | |
CN106682397A (en) | Knowledge-based electronic medical record quality control method | |
Tashkandi et al. | Efficient in-database patient similarity analysis for personalized medical decision support systems | |
CN111222340A (en) | Breast electronic medical record entity recognition system based on multi-standard active learning | |
CN113918694B (en) | Question analysis method for medical knowledge graph questions and answers | |
CN116680377B (en) | Chinese medical term self-adaptive alignment method based on log feedback | |
TW202101477A (en) | Method for applying a label made after sampling to neural network training model | |
CN113707339A (en) | Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases | |
CN112749277A (en) | Medical data processing method and device and storage medium | |
Lu et al. | Chinese clinical named entity recognition with word-level information incorporating dictionaries | |
Leng et al. | Bi-level artificial intelligence model for risk classification of acute respiratory diseases based on Chinese clinical data | |
Sudharson et al. | Enhancing the Efficiency of Lung Disease Prediction using CatBoost and Expectation Maximization Algorithms | |
Fang et al. | Multi-modal sarcasm detection based on Multi-Channel Enhanced Fusion model | |
CN113360643A (en) | Electronic medical record data quality evaluation method based on short text classification | |
Wang et al. | Research on named entity recognition of doctor-patient question answering community based on bilstm-crf model | |
Ning et al. | Research on a vehicle-mounted intelligent TCM syndrome differentiation system based on deep belief network | |
CN116630062A (en) | Medical insurance fraud detection method, system and storage medium | |
Lu et al. | Towards semi-structured automatic ICD coding via tree-based contrastive learning | |
Zhang et al. | Conco-ernie: Complex user intent detect model for smart healthcare cognitive bot | |
Liu et al. | Disease Topic Modeling of Users' Inquiry Texts: A Text Mining-Based PQDR-LDA Model for Analyzing the Online Medical Records |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22882473 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023536585 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |