WO2023065858A1

WO2023065858A1 - Medical term standardization system and method based on heterogeneous graph neural network

Info

Publication number: WO2023065858A1
Application number: PCT/CN2022/116967
Authority: WO
Inventors: 李劲松; 杨宗峰; 辛然; 田雨; 周天舒
Original assignee: 之江实验室
Priority date: 2021-10-19
Filing date: 2022-09-05
Publication date: 2023-04-27
Also published as: CN113656604A; JP2024500400A; CN113656604B; JP7432802B2

Abstract

Provided are a medical term standardization system and method based on a heterogeneous graph neural network, the method comprising: establishing key information units for various types of medical terms so as to achieve structured representation of the medical terms, and, on the basis of the information units, establishing a knowledge map comprising the various types of medical terms; on the basis of the knowledge map, establishing a heterogeneous graph neural network comprising the various types of medical terms, and, while training the heterogeneous graph neural network, comprehensively considering adjacent node distribution and node content coding of the map so as to perform medical term standardization. The knowledge of the association and difference between information units of similar medical terms can be fully utilized, and various types of medical terms are comprised at the same time, thus knowledge in the medical field can be learned comprehensively, and new types of medical terms can be conveniently added into the system, thereby reducing the workload of the standardization of new types of medical terms.

Description

Medical terminology standardization system and method based on heterogeneous graph neural network

technical field

The invention belongs to the technical field of standardization of Chinese medical terms and a multi-center medical information platform, and in particular relates to a medical term standardization system and method based on a heterogeneous graph neural network.

Background technique

An important research direction in the process of medical informatization is to apply higher-performance machine learning and artificial intelligence technologies to solve practical clinical problems. One of the advantages of artificial intelligence technology is that it can discover complex laws and characteristics from massive data. Therefore, the medical data of multiple medical institutions is comprehensively used for analysis, mining and model design, and then provides support for medical research and clinical decision-making. It has become a medical informatization inevitable trend. However, due to the numerous information standards adopted by different medical institutions, and the artificially generated semi-structured and unstructured data, it is extremely difficult to integrate and utilize medical data from different sources. Medical terminology is the basic element of medical data. Establishing a complete medical terminology standardization system can align medical data from different sources to a unified standard and structure, thereby providing larger-scale and higher-quality data for clinical decision-making and medical research. . Medical terminology mainly includes terms such as drugs, medical examinations, and diseases generated during clinical operations. Different types of medical terms will contain information of specific key dimensions, which we define as information units of medical terms. For example, the drug term "5% glucose injection (base) 500 ml" contains information elements as shown in Table 1:

Table 1 Example of Drug Term Information Unit

信息单元名称information unit name	药物成分drug ingredients	药物剂型Pharmaceutical dosage form	药物剂量drug dosage	药物规格Drug Specifications
信息单元的值value of information unit	葡萄糖glucose		注射液Injection	5％5%	500毫升500ml

Check that the term "Left Finger Ortholateral_X" contains information elements as shown in Table 2:

Table 2 Example of inspection term information unit

信息单元名称information unit name	身体部位body parts	身体部位范围range of body parts	检查视图check view	检查方法Inspection Method
信息单元的值value of information unit	手指finger	左侧left side	正位+侧位Anteroposterior + Lateral	X射线摄影X-ray photography

Some information units are composed of other finer-grained information units, which are respectively defined as first-level information units and second-level information units. "Drug dose" and "Drug specification", wherein the "Drug specification" information unit is composed of secondary information units "Value" (500) and "Measurement unit" (ml). Given a set of information units of medical terms, a complete medical term can be determined.

In actual clinical operations, due to differences in information standards adopted by various medical institutions and differences in the personal habits of medical staff, a large number of non-standard medical terms will be generated, mainly manifested in the redundancy or absence of key information units, irregular expressions, and the number of Issues such as inconsistent units, for example, the meanings of the following drug terms are exactly the same, but the form is quite different: "Levofloxacin Tablets (Colabitux) 500 mg" and "Colabitux 0.5g/tablet". The goal of standardization of medical terminology is to identify medical terms with the same meaning but different literal forms, so as to unify their expressions, and at the same time distinguish medical terms with different meanings, and ultimately promote the standardization of medical data as a whole.

The traditional standardization method of medical terms is to understand the meaning of each medical term through machine learning or manual verification for a single category of medical terms, and mark the medical terms with the same semantics. This method regards each medical term as a whole, ignoring the inherent information unit structure within the medical term. The main disadvantages are: (1) The knowledge of the correlation and difference between information units cannot be effectively used. The associations and differences between information units of different dimensions of the same medical term will contain a wealth of medical domain knowledge, but existing practices have not explicitly structured and utilized such knowledge; (2) different types of medical Terminology will contain the same or related information units, and the past medical terminology standardization work is to develop independent systems for a single category of medical terminology. On the one hand, the workload is too large, and on the other hand, different types of medical terminology cannot be used comprehensively. Knowledge in information units of medical terminology; (3) redundant information will be taken into account. Due to irregular expressions and other reasons, most medical terms will contain some redundant characters in addition to the key information units. These characters have little connection with the overall meaning of the medical term, and as noise, the meaning of the medical term will be biased. .

Contents of the invention

The purpose of the present invention is to propose a medical terminology standardization system and method based on a heterogeneous graph neural network based on the shortcomings of the current medical terminology standardization method based on the characteristics of the medical terminology itself. The present invention constructs a new type of knowledge map based on information units for all medical terms, and standardizes medical terms through an improved heterogeneous graph neural network on the basis of the knowledge map, effectively utilizing the knowledge in the information units of medical terms, and obtaining more accurate Medical terminology normalization results.

The purpose of the present invention is achieved through the following technical solutions: In order to make full use of the medical field knowledge contained in the medical terminology itself in the process of standardizing the medical terminology, the present invention first constructs key information units for various types of medical terminology, and realizes medical treatment. A structured representation of terminology and building a knowledge graph containing various types of medical terminology based on information units. Based on this knowledge graph, a heterogeneous graph neural network containing various types of medical terms is constructed. During the training process of the heterogeneous graph neural network, the distribution of adjacent nodes and the content coding of nodes are considered comprehensively for the normalization of medical terms. Through this method, the present invention can make full use of the knowledge of the correlation and difference between information units of similar medical terms, and at the same time accommodate various types of medical terms in the system, and can comprehensively learn the knowledge in the medical field, and can conveniently New types of medical terms are added to the system, reducing the workload of standardizing new types of medical terms. In the process of extracting information units for medical terms, redundant characters and information will be discarded to avoid introducing excessive noise and errors.

One aspect of the present invention discloses a medical term standardization system based on a heterogeneous graph neural network, including:

(1) Information unit building block: define key information units for each type of medical term; the information units include first-level information units and second-level information units, and the inclusion relationship between the two-level information units; use sequence annotation The model recognizes the information units contained in all medical terms at the character level, and builds an information unit library;

(2) Medical terminology knowledge map module: Based on the relationship between medical terminology and information units, a medical terminology knowledge map is constructed. The nodes of the knowledge map include medical terminology nodes and information unit nodes. The containment relationship between terms and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;

(3) Heterogeneous graph neural network module: based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes start from a node and jump along the direction of the medical terminology knowledge graph edge Transfer to all nodes passed through two levels; the content code of the nodes is specifically:

For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;

For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;

For a node whose content is text, its content code is obtained through a pre-trained language model;

The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;

The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;

(4) Prediction result output module: Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical Term normalization results.

Further, the types of medical terms include drug terms, disease terms, operation terms, test terms and inspection terms.

Further, in the information unit building block, the sequence labeling model is the BiLSTM-CRF model; the interval of each information unit is marked on the medical terms used as training data, and the characters of the non-information units are marked at the same time, so that the sequence labeling model can Extraneous characters that have no effect on the overall meaning of the medical term are discarded.

Further, in the information unit construction module, preliminary standardization is performed on the value and the measurement unit, and the original measurement unit is normalized into a single basic unit or multiple basic units are combined together through different operation symbols, and the corresponding conversion is performed on the value .

Further, in the heterogeneous graph neural network module, V represents the set of all nodes in the medical terminology knowledge graph, for v _i ∈ V, record value(v _i ) as its node content, and e(v _i ) is Its content code; for the node v _i whose content is numerical, its content code is:

e(v _i )=value(v _i )·e _I

Where value(v _i ) is the value of node v _i itself; e _I represents a unit vector, which is randomly initialized and obtained through heterogeneous graph neural network training;

For a node v _i whose node content is a measurement unit, its node content is a sequence composed of basic units and operation symbols, set value(v _i )=(q ₁ , q ₂ ,...,q _l ,..., q _L ), where q _l is the basic unit or operation symbol, and L is the sequence length of v _i , then the content encoding is:

Among them, M ₀ is the parameter matrix obtained by the training of the heterogeneous graph neural network; e(q _l ) is the semantic vector of each basic unit or operation symbol, which is randomly initialized and obtained through the training of the heterogeneous graph neural network;

concatenation operator for vectors;

For the node v _i whose node content is text, use the pre-trained language model to calculate the semantic vector of v _i as the initial content code of _vi , and continue to train the content code through the subsequent heterogeneous graph neural network.

Further, for the node v _i whose node content is text, the pre-trained language model adopts the BERT model, and the calculation method is:

Where Z _k+1 is the hidden state of the k+1 layer of the BERT model,

For the input value of the k+1th layer:

Among them, M ₁ , M ₂ , M ₃ , M ₄ , M ₅ , b ₁ and b ₂ are parameters obtained from training, d is the dimension of Z _k+1 , and Z _k is the hidden state of layer k of the BERT model; if The BERT model has m layers in total, and the initialization content code of node v _i is e(v _i )=Z _m .

Further, in the heterogeneous graph neural network module, the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the knowledge graph of medical terminology; for the node v _i ∈ V in the knowledge graph of medical terminology, Use N ₁ (v _i ) to represent the set of nodes directly pointed to by the arrow starting from v _i , if v _i represents a medical term node, then N ₁ (v _i ) is the first-level information unit set of v _i ,

is the secondary information unit set of v _i ; define the adjacent node set N ₁ (v _i ) of v _i as:

Then the calculation method of the vector representation F(v _i ) of v _i is:

in

is the weight parameter, the specific calculation is as follows:

in

M ₆ and M ₇ are matrix parameters obtained from training, and f(·) is a nonlinear activation function.

Further, in the heterogeneous graph neural network module, in the first stage of training, the parameter set that can be trained is recorded as θ, and the training goal is to optimize the following objective function:

in

Indicates the probability of predicting the adjacent node v from node v _i ;

In the second stage of training, the formula for calculating the similarity between any two medical term nodes is:

Where v _i and v _j are the medical term nodes in the medical term knowledge map, sim(v _i , v _j ) is the similarity between v _i and v _j , W and b are the parameters obtained from training;

In the normalized training data of medical terms, let the set of medical term nodes with the same meaning as the medical term node v _i be V _i ⁺ , and the set of medical term nodes with different meanings from v _i be V _i ^- , then the label of the training sample y _i (v) is:

The goal of the second stage is to minimize the following loss function L:

Further, in the prediction result output module, for the medical term node v _* to be standardized, the similarity between v _* and other medical term nodes in the medical terminology knowledge map is calculated and sorted based on the trained heterogeneous graph neural network, taking Among them, the medical term node with the greatest similarity with v _*

Set the threshold c on the similarity, if

Then it is considered that v _* and

have the same meaning, that is, the normalized result of v _* is obtained; otherwise, it is considered that v _* has different meanings from other medical term nodes in the medical term knowledge graph, and v _* has an independent meaning.

Another aspect of the present invention discloses a medical term standardization method based on a heterogeneous graph neural network, comprising the following steps:

(1) Define key information units for each type of medical term; the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms Identify the information units contained in it at the character level, and build an information unit library;

(2) Based on the relationship between medical terms and information units, construct a medical terminology knowledge graph. The nodes of the knowledge graph include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;

(3) Based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes are all nodes that start from a node and jump two stages along the direction of the medical terminology knowledge graph edge. ; The code of the node content is specifically:

(4) Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge map, and output the medical term normalization results.

The beneficial effects of the present invention are: the present invention defines a unified information unit structure for different types of medical terms, and realizes a relatively unified structured representation, so that knowledge in the medical field can be better utilized in the process of standardizing medical terms, Fully learn the associations and differences of the information units between the same kind of medical terms and between different kinds of medical terms. By integrating all medical terms into the knowledge graph, the unified heterogeneous graph neural network realizes the normalization of different types of medical terminology, which can improve the integrity and uniformity of the output results while improving the efficiency of medical terminology standardization.

Description of drawings

FIG. 1 is a structural diagram of a medical term standardization system based on a heterogeneous graph neural network provided by an embodiment of the present invention;

Fig. 2 is the sequence labeling model training data provided by the embodiment of the present invention;

Fig. 3 is a schematic diagram of a medical terminology knowledge map provided by an embodiment of the present invention.

Detailed ways

In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.

In the following description, a lot of specific details are set forth in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do it without departing from the meaning of the present invention. By analogy, the present invention is therefore not limited to the specific examples disclosed below.

In the present invention, the standardization of medical terms refers to: combining knowledge in the medical field and natural language processing methods, analyzing various medical terms generated in real clinical environments, identifying medical terms with the same meaning and distinguishing medical terms with different meanings, so that in A process in which medical terminology within a certain range is harmonized for optimal order and social benefit. Establishing a unified medical terminology standard and terminology set helps to solve problems such as term duplication, unclear connotation, semantic expression and understanding inconsistency, and is of great significance to effectively promote the dissemination, sharing and use of medical information on a wider and deeper level .

Heterogeneous graph neural network refers to: Traditional deep learning methods have achieved great success on linear and matrix-shaped data, but the data in many practical application scenarios is graph-structured. In recent years, researchers have used the ideas of convolutional networks and recurrent networks to define and design graph neural network models for processing graph data. Ordinary graph neural networks can achieve good performance by only using the adjacent node information of graphs for graphs with a single node and relationship type. However, graph data in the real world usually has many types of nodes and relationships with large differences. This type of graph is called a heterogeneous graph. In the process of training a heterogeneous graph neural network, since the features contained in the content of different types of nodes are quite different and the information dimensions are different, it is necessary to consider the content encoding information of the nodes while using the information of the adjacent nodes of the graph.

An embodiment of the present invention provides a medical terminology standardization system based on a heterogeneous graph neural network, as shown in Figure 1, the system includes the following modules:

1. Information unit building blocks, including:

(1) Define key information units for each type of medical term; medical term types include drug terms, disease terms, surgical terms, test terms and inspection terms, and information units include first-level information units and second-level information units, and two Inclusion relationship between level information units;

(2) Use the sequence labeling model to identify the information units contained in all medical terms at the character level, and build an information unit library;

2. Medical terminology knowledge map module: Based on the relationship between medical terms and information units, construct a medical terminology knowledge map. The nodes of the knowledge map include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two types of relationships: medical terminology The containment relationship between and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;

3. Heterogeneous graph neural network module: based on the distribution of adjacent nodes and node content coding of the medical terminology knowledge map, train the heterogeneous graph neural network;

The adjacent nodes are starting from a node, jumping two levels along the direction of the medical terminology knowledge graph edge, and passing through all the nodes;

The node content encoding is specifically:

4. Prediction result output module: Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical terminology Normalize results.

The implementation process of each module is described in detail below:

1. Information unit building blocks

(1) An information unit defining a medical term. At present, there are some international medical terminology standard sets, which define information units of key dimensions for a specific single category of medical terminology. However, different types of medical terminology standard sets do not establish associations between information units. As a result, the information utilized in the past medical terminology standardization process can only be limited to a single category of medical terminology, while ignoring a lot of useful information. The present invention combines the existing international general medical terminology standard set and the expert knowledge in the actual clinical process, uniformly defines the key information units for various types of medical terms, and defines the detailed first-level information units and second-level information units structure. Currently, the types of medical terminology realized in the present invention include drug terminology, disease terminology, surgical terminology, test terminology and inspection terminology. If the new type of medical terminology needs to be standardized later, it can be convenient after defining the information unit for the new type of medical terminology. It can be extended to the system of the present invention. The specific definitions of the information units of the implemented medical terms are shown in Table 3.

Table 3 Information elements of medical terms

(2) Construct information unit library. The sequence annotation model is used to predict the probability of each character in the medical term belonging to each information unit, so as to identify all the information units contained in the medical term and realize the structured representation of the medical term. The sequence labeling model used in this example is the BiLSTM-CRF model, which first uses the BiLSTM network to understand the context information of medical terms, and then constructs state probability and transition probability matrices based on the output value of the BiLSTM network at each character position of the medical term , and build a CRF model, which has achieved good results in sequence labeling tasks. The process of constructing training data for the sequence labeling model is shown in Figure 2. The interval of each information unit is marked on the medical terms used as training data, and the characters of non-information units are also marked, so that the sequence labeling model can discard the The redundant characters that have no effect on the overall meaning of medical terms avoid introducing too much noise to the subsequent heterogeneous graph neural network.

(3) It is necessary to pay special attention to the various first-level information units in Table 3, which contain the second-level information units of numerical values and measurement units, while the original numerical values and measurement unit distributions in medical terms will have a relatively large span and sparseness, thereby increasing heterogeneity The difficulty of graph neural network training. In order to solve this problem, first of all, do a preliminary standardization of the value and the unit of measurement, standardize the original unit of measurement into a single basic unit or combine multiple basic units together through different operation symbols, and convert the value accordingly, where the basic unit Including: ml (milliliter), mg (mg), mm (millimeter), s (second), mol (amount of substance), u (unit), iu (international unit), count (count), type, level, period , the operation symbols include multiplication and division. A total of 90 normalized units of measure are generated. For example: the original measurement unit is l (liter), the corresponding value is 1, the standardized measurement unit is ml (milliliter), and the corresponding value is converted to 1000 accordingly.

2. Medical terminology knowledge map module

Based on the information unit library constructed by the information unit building block, a knowledge graph containing various types of medical terms is constructed, as shown in Figure 3. It contains two types of nodes: circular nodes represent medical term nodes, rectangular nodes represent information unit nodes, and each large type of node contains nodes of various subdivision types, for example, medical term nodes include "drug term" nodes , "disease term" node, etc., and the information unit node includes "drug dose" node, "value" node, etc. Edges include two kinds of relations: 1) inclusion relationship between medical term and information unit; 2) inclusion relationship between first-level information unit and second-level information unit. The division scope of the first-level information unit and the second-level information unit may change for different types of medical terms, for example, for disease terms, "disease subject" is its first-level information unit, and for surgical terms, "disease subject ” is a second-level information unit included in the first-level information unit “nature of disease”.

3. Heterogeneous Graph Neural Network Module

(1) A heterogeneous graph refers to a graph with complex nodes and relationship types. The medical terminology knowledge graph shown in Figure 3 is a heterogeneous graph. Ordinary graph neural networks can achieve good performance only by relying on the information of adjacent nodes of the graph for graphs with relatively single node and relationship types. In the process of training a heterogeneous graph neural network, since the features contained in the content of different types of nodes are quite different and the information dimensions are different, it is necessary to consider the distribution information of adjacent nodes in the graph and the coding information of node content at the same time. When calculating node content encoding, the present invention designs appropriate calculation methods for different types of nodes.

(2) Calculate the content codes of different types of nodes. Use V to represent the set of all nodes in the medical terminology knowledge map in Figure 3. For v _i ∈ V, record value(v _i ) as its node content, and e(v _i ) as its content code, then the different types of nodes The content-encoding is calculated as follows:

For node v _i whose node content is numeric, its content code is:

e(v _i )=value(v _i )·e _I

concatenation operator for vectors;

For the node v _i whose node content is text, use the pre-trained language model to calculate the semantic vector of v _i as the initial content code of _vi , and continue to train the content code through the subsequent heterogeneous graph neural network. The pre-trained language model used in this embodiment is the BERT model, and the calculation method is:

Where Z _k+1 is the hidden state of the k+1 layer of the BERT model,

For the input value of the k+1th layer:

Among them, M ₁ , M ₂ , M ₃ , M ₄ , M ₅ , b ₁ and b ₂ are parameters obtained from training, d is the dimension of Z _k+1 , and Z _k is the hidden state of layer k of the BERT model; if The BERT model has m layers in total, and the initialized content code of node v _i is e(v _i )=Z _m , and m=12 is used in this embodiment.

(3) In the heterogeneous graph neural network, the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the medical terminology knowledge graph. For the node v _i ∈ V in the medical terminology knowledge graph, use N ₁ (v _i ) to represent the set of nodes directly pointed to by the arrow starting from v _i , if v _i represents the medical term node, then N ₁ (v _i ) is The first-level information unit set of v _i ,

is the set of secondary information units of v _i . Define the adjacent node set N ₁ (v _i ) of v _i as:

Then the calculation method of the vector representation F(v _i ) of v _i is:

in

is the weight parameter, which represents the importance of node v to node v _i , where v can be v _i itself or an adjacent node of v _i , the specific calculation is as follows:

in

M ₆ and M ₇ are matrix parameters obtained from training, and f(·) is a nonlinear activation function. In this embodiment, f(x)=max(0,x)+0.2·min(0,x) is taken. Since the relative importance of nodes is asymmetric, so

is also asymmetric, that is,

(4) Training of heterogeneous graph neural network. The training process is divided into two stages: 1) The distribution of adjacent nodes and the code of node content are used as input, and the goal of training is to maximize the conditional probability of each node’s adjacent nodes to it, and obtain the vector representation of each node; 2) use The vector representation of the node is used as input to calculate the similarity of any two medical term nodes, and the training goal is to maximize the similarity of the medical term nodes with the same meaning.

In the first stage of the training process, the parameter set that can be trained is recorded as θ, and the goal of training is to optimize the following objective function:

in

Indicates the probability of predicting its neighbor node v from node v _i .

In the second stage of the training process, the formula for calculating the similarity between any two medical term nodes is:

Among them, v _i and v _j are the medical term nodes in the medical term knowledge graph, sim(v _i , v _j ) is the similarity between v _i and v _j , and W and b are parameters obtained from training. In the normalized training data of medical terms, let the set of medical term nodes with the same meaning as the medical term node v _i be V _i ⁺ , and the set of medical term nodes with different meanings from v _i be V _i ^- , then the label of the training sample y _i (v) is:

The goal of the second stage is to minimize the following loss function L:

4. Prediction result output module

For the medical term node v _* to be standardized, calculate and sort the similarity between v _* and other medical term nodes in the medical terminology knowledge graph based on the trained heterogeneous graph neural network, and select the medical term node with the greatest similarity to v _*

Set the threshold c on the similarity, if

Then it is considered that v _* and

have the same meaning, that is, the normalized result of v _* is obtained; otherwise, it is considered that v _* has different meanings from other medical term nodes in the medical term knowledge graph, and v _* has an independent meaning. In this embodiment, c=0.9.

For example, when standardizing the drug term "potassium chloride needle (produced by Otsuka) 10% 10 ml * 1 stick", calculate its similarity with other drug term nodes as shown in Table 4, then it can be seen that it has the same meaning as The drug term node is "Potassium Chloride Needle 10ml: 1g Otsuka Pharmaceutical Co., Ltd." with the highest similarity.

Table 4 Heterogeneous Graph Neural Network Calculation of Medical Terms Node Similarity

药物术语节点Drug Term Node	相似度Similarity
氯化钾针10ml∶1g大冢制药有限公司Potassium chloride needle 10ml: 1g Otsuka Pharmaceutical Co., Ltd.	0.960210.96021
氯化钾注射液(基)1000mg/10mlPotassium chloride injection (base) 1000mg/10ml	0.909660.90966
(10ml)氯化钾口服溶液10％1支(10ml) Potassium Chloride Oral Solution 10%1 stick	0.807150.80715
氯化钠注射液0.9％100毫升1袋Sodium Chloride Injection 0.9% 100ml1 bag	0.610920.61092

The embodiment of the present invention also provides a medical term standardization method based on a heterogeneous graph neural network, the method comprising:

(1) Define key information units for each type of medical term; the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms The information units contained therein are identified at the character level, and the information unit library is constructed; the realization of this step refers to the information unit building blocks.

(2) Based on the relationship between medical terms and information units, construct a medical terminology knowledge graph. The nodes of the knowledge graph include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side.

The implementation of this step refers to the heterogeneous graph neural network module.

(4) Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge map, and output the normalization results of medical terminology; The implementation of the steps refers to the prediction result output module.

The invention defines various medical terms and identifies the information units contained therein, so as to realize the structured representation of the medical terms. The result of the structured representation of medical terminology can not only improve the standardization effect of medical terminology, but also greatly promote all aspects of medical information work; the invention builds a new type of knowledge map for medical terminology based on the information unit of medical terminology, It can effectively promote various medical informatization work including the standardization of medical terminology; the present invention constructs a new type of heterogeneous graph neural network for the standardization of medical terminology, and realizes the standardization of different types of medical terminology by a unified model. The information units of each realize the appropriate content encoding method, and design a staged training method for the heterogeneous graph neural network.

The above descriptions are only preferred implementations of the present invention. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims

A medical term standardization system based on a heterogeneous graph neural network, characterized in that the system includes:

(1) Information unit building block: define key information units for each type of medical term; the information units include first-level information units and second-level information units, and the inclusion relationship between the two-level information units; use sequence annotation The model recognizes the information units contained in all medical terms at the character level, and builds an information unit library;

(2) Medical terminology knowledge map module: Based on the relationship between medical terminology and information units, a medical terminology knowledge map is constructed. The nodes of the knowledge map include medical terminology nodes and information unit nodes. The containment relationship between terms and information units, the containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;

(3) Heterogeneous graph neural network module: based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes start from a node and jump along the direction of the medical terminology knowledge graph edge Transfer to all nodes passed through two levels; the content code of the nodes is specifically:

For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;

For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;

For a node whose content is text, its content code is obtained through a pre-trained language model;

The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;

The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;

(4) Prediction result output module: Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge graph, and output medical Term normalization results.
The system according to claim 1, wherein the types of medical terms include drug terms, disease terms, operation terms, test terms and inspection terms.
The system according to claim 1, wherein, in the information unit building block, the sequence labeling model is a BiLSTM-CRF model; the interval of each information unit is marked on the medical term as training data, and the non- characters of the information unit, enabling the sequence annotation model to discard redundant characters that have no effect on the overall meaning of the medical term.
The system according to claim 1, characterized in that, in the information unit building block, preliminary normalization is performed on the value and the measurement unit, and the original measurement unit is normalized into a single basic unit or multiple basic units through different operation symbols Combine them together and convert the values accordingly.
The system according to claim 1, wherein, in the heterogeneous graph neural network module, V represents the set of all nodes in the medical terminology knowledge graph, and for v i ∈ V, record value(v i ) as Its node content, e(v i ) is its content code; for the node v i whose node content is numerical, its content code is:

e(v i )=value(v i )·e I

Where value(v i ) is the value of node v i itself; e I represents a unit vector, which is randomly initialized and obtained through heterogeneous graph neural network training;

For a node v i whose node content is a measurement unit, its node content is a sequence composed of basic units and operation symbols, set value(v i )=(q 1 , q 2 ,...,q l ,..., q L ), where q l is the basic unit or operation symbol, and L is the sequence length of v i , then the content encoding is:

Among them, M 0 is the parameter matrix obtained by the training of the heterogeneous graph neural network; e(q l ) is the semantic vector of each basic unit or operation symbol, which is randomly initialized and obtained through the training of the heterogeneous graph neural network;
concatenation operator for vectors;

For the node v i whose node content is text, use the pre-trained language model to calculate the semantic vector of v i as the initial content code of vi , and continue to train the content code through the subsequent heterogeneous graph neural network.
The system according to claim 5, wherein, for a node v i whose node content is text, the pre-trained language model adopts the BERT model, and the calculation method is:

Where Z k+1 is the hidden state of the k+1 layer of the BERT model,
For the input value of the k+1th layer:

Among them, M 1 , M 2 , M 3 , M 4 , M 5 , b 1 and b 2 are parameters obtained from training, d is the dimension of Z k+1 , and Z k is the hidden state of layer k of the BERT model; if The BERT model has m layers in total, and the initialization content code of node v i is e(v i )=Z m .
The system according to claim 1, wherein in the heterogeneous graph neural network module, the vector representation of each node is calculated based on the content encoding of the node itself and its adjacent nodes in the medical terminology knowledge map; for medical terminology For the node v i ∈ V in the knowledge graph, use N 1 (v i ) to represent the set of nodes directly pointed to by the arrow starting from v i . If v i represents a medical term node, then N 1 (v i ) is the A collection of first-level information units,
is the secondary information unit set of v i ; define the adjacent node set N 1 (v i ) of v i as:

Then the calculation method of the vector representation F(v i ) of v i is:

in
is the weight parameter, the specific calculation is as follows:

in
M 6 and M 7 are matrix parameters obtained from training, and f(·) is a nonlinear activation function.
The system according to claim 1, wherein in the heterogeneous graph neural network module, in the first stage of training, the trainable parameter set is denoted as θ, and the training goal is to optimize the following objective function:

in
Indicates the probability of predicting the adjacent node v from node v i ;

In the second stage of training, the formula for calculating the similarity between any two medical term nodes is:

Where v i and v j are the medical term nodes in the medical term knowledge map, sim(v i , v j ) is the similarity between v i and v j , W and b are the parameters obtained from training;

In the normalized training data of medical terms, let the set of medical term nodes with the same meaning as the medical term node v i be V i + , and the set of medical term nodes with different meanings from v i be V i - , then the label of the training sample y i (v) is:

The goal of the second stage is to minimize the following loss function L:
The system according to claim 1, characterized in that, in the prediction result output module, for the medical term node v * to be standardized, based on the trained heterogeneous graph neural network calculation v * and other medical terminology knowledge graphs The similarity of the medical term nodes is sorted, and the medical term node with the greatest similarity with v * is selected

Set the threshold c on the similarity, if
Then it is considered that v * and
have the same meaning, that is, the normalized result of v * is obtained; otherwise, it is considered that v * has different meanings from other medical term nodes in the medical term knowledge graph, and v * has an independent meaning.
A method for normalizing medical terms based on a heterogeneous graph neural network, characterized in that it comprises the following steps:

(1) Define key information units for each type of medical term; the information unit includes a first-level information unit and a second-level information unit, and the inclusion relationship between the two-level information units; use the sequence labeling model to identify all medical terms Identify the information units contained in it at the character level, and build an information unit library;

(2) Based on the relationship between medical terms and information units, construct a medical terminology knowledge graph. The nodes of the knowledge graph include medical terminology nodes and information unit nodes. The edges are directed edges, and the edges include two relationships: between medical terms and information units The containment relationship between the first-level information unit and the second-level information unit, the direction of the edge is from the containing side to the contained side;

(3) Based on the adjacent node distribution and node content coding of the medical terminology knowledge graph, train the heterogeneous graph neural network; the adjacent nodes are all nodes that start from a node and jump two stages along the direction of the medical terminology knowledge graph edge. ; The code of the node content is specifically:

For a node whose content is numerical, its content code is equal to the product of the value of the node itself and the unit vector obtained by the training of the heterogeneous graph neural network;

For a node whose content is a unit of measurement, the calculation process of its content encoding is: the semantic vector of each basic unit and operation symbol is obtained through heterogeneous graph neural network training, and the semantic vector of all basic units and operation symbols contained in the node After splicing, the content coding is obtained through nonlinear conversion;

For a node whose content is text, its content code is obtained through a pre-trained language model;

The first stage of training: The adjacent node distribution and node content encoding are used as input. The goal of training is to maximize the conditional probability of each node's adjacent nodes to it, and obtain the vector representation of each node;

The second stage of training: take the vector representation of the node as input, and calculate the similarity between any two medical term nodes. The goal of training is to maximize the similarity between the medical term nodes with the same meaning;

(4) Input the medical term nodes to be standardized into the trained heterogeneous graph neural network, obtain the similarity ranking between the medical term nodes to be standardized and other medical term nodes in the medical terminology knowledge map, and output the medical term normalization results.