CN113656604B - Medical term normalization system and method based on heterogeneous graph neural network - Google Patents
Medical term normalization system and method based on heterogeneous graph neural network Download PDFInfo
- Publication number
- CN113656604B CN113656604B CN202111213727.4A CN202111213727A CN113656604B CN 113656604 B CN113656604 B CN 113656604B CN 202111213727 A CN202111213727 A CN 202111213727A CN 113656604 B CN113656604 B CN 113656604B
- Authority
- CN
- China
- Prior art keywords
- nodes
- node
- medical term
- medical
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Abstract
The invention discloses a medical term normalization system and method based on a heterogeneous graph neural network. And constructing a heterogeneous graph neural network containing various types of medical terms based on the knowledge graph, and comprehensively considering the adjacent node distribution and the node content coding of the graph in the training process of the heterogeneous graph neural network for the medical term normalization. The invention can fully utilize the knowledge of the correlation and difference among the information units of the same type of medical terms, simultaneously accommodate various types of medical terms, can comprehensively learn the knowledge in the medical field, can conveniently add the new type of medical terms into the system, and reduces the workload of the normalization of the new type of medical terms.
Description
Technical Field
The invention belongs to the technical field of Chinese medical term standardization and multi-center medical information platforms, and particularly relates to a medical term standardization system and method based on an isomerous graph neural network.
Background
An important research direction in the process of medical informatization is to apply higher-performance machine learning and artificial intelligence technology to solve the actual clinical problems. One advantage of the artificial intelligence technology is that complex rules and characteristics can be found from mass data, so that medical data of multiple medical institutions are comprehensively utilized for analysis mining and model design, and support is provided for medical research and clinical decision-making work to become a necessary trend of medical informatization. Integration of medical data from different sources is made extremely difficult by the multitude of information standards employed by different medical institutions and the frequent artificial production of semi-structured and unstructured data. Medical terms are basic elements forming medical data, and a perfect medical term standardization system is established, so that medical data from different sources can be aligned to a unified standard and structure, and larger-scale and higher-quality data are provided for clinical decision and medical research work. Medical terms mainly include terms of the types of drugs, medical examinations, diseases, etc. generated during clinical operations. Different types of medical terms will contain information of a particular key dimension, which we define as the information element of the medical term. For example, the pharmaceutical term "5% glucose injection (base) 500 ml" contains the information elements as shown in table 1:
table 1 example drug term information element
The examination term "left hand means positive bit _ X" contains the information elements as shown in table 2:
table 2 examination terminology information element example
Some of the information elements are composed of other finer grained information elements, which are defined as primary information elements and secondary information elements, respectively, e.g. the pharmaceutical terms in table 1 comprise the primary information elements "pharmaceutical composition", "pharmaceutical dosage form", "pharmaceutical dosage" and "pharmaceutical specification", wherein the "pharmaceutical specification" information elements are composed of the secondary information elements "number" (500) and "dosing unit" (ml). A complete medical term can be determined given the information elements of a group of medical terms.
In actual clinical operation, due to the reasons of standard differences of information adopted by various medical institutions, personal habits of medical workers and the like, a large number of irregular medical terms are generated, which are mainly expressed as problems of redundancy or loss of key information units, irregular expression modes, non-uniform quantity units and the like, for example, the following medical terms have the same meanings but have larger differences in forms: "levofloxacin tablet (clonidine) 500 mg" and "clonidine 0.5 g/tablet". The aim of medical term normalization is to identify medical terms with the same meaning but different literal forms, so as to unify the expression modes of the medical terms, distinguish the medical terms with different meanings, and finally promote the normalization of the whole medical data.
The traditional medical term normalization method is to understand the meaning of each medical term by machine learning or manual verification method for a single category of medical terms, and to mark out medical terms with the same semantics. Such a method takes each medical term as a whole, ignores the structure of the information unit inherent in the medical term, and has the main disadvantages that: (1) the knowledge of the association and difference of information units with each other cannot be effectively exploited. The association and difference between information units of different dimensions of the same medical term can contain rich medical domain knowledge, and the existing practice does not explicitly structure and utilize the knowledge; (2) different types of medical terms can contain the same or related information units, and the traditional medical term standardization work is to respectively develop independent systems aiming at the medical terms of a single category, so that on one hand, the workload is overlarge, and on the other hand, the knowledge in the information units of the different types of medical terms cannot be comprehensively utilized; (3) the excess information is taken into account. Most medical terms contain some redundant characters besides the key information units due to the reasons of irregular expression, etc., the characters have little relation with the meaning of the medical term as a whole, and the meaning of the medical term is deviated as noise.
Disclosure of Invention
The invention aims to provide a medical term normalization system and method based on an isomerous graph neural network, aiming at the defects of the conventional medical term normalization method and based on the characteristics of medical terms. The invention constructs a novel knowledge graph based on the information unit for all medical terms, and normalizes the medical terms through the improved heterogeneous graph neural network on the basis of the knowledge graph, thereby effectively utilizing the knowledge in the medical term information unit and obtaining a more accurate medical term normalization result.
The purpose of the invention is realized by the following technical scheme: in order to fully utilize medical field knowledge contained in medical terms in the process of medical term normalization, the invention firstly constructs key information units for various types of medical terms, realizes the structural representation of the medical terms, and constructs a knowledge graph containing various types of medical terms based on the information units. And constructing a heterogeneous graph neural network containing various types of medical terms based on the knowledge graph, and comprehensively considering the adjacent node distribution and the node content coding of the graph in the training process of the heterogeneous graph neural network for the medical term normalization. By the method, the invention can fully utilize the knowledge of the correlation and difference among the information units of the same type of medical terms, simultaneously accommodate various types of medical terms in the system, comprehensively learn the knowledge in the medical field, conveniently add the new type of medical terms into the system and reduce the workload of the normalization of the new type of medical terms. Redundant characters and information can be discarded in the process of extracting the information unit from the medical term, and excessive noise and errors are avoided.
The invention discloses a medical term normalization system based on a heterogeneous graph neural network on one hand, which comprises the following components:
(1) an information unit construction module: defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) medical term knowledge-graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) the heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
Further, the types of medical terms include pharmaceutical terms, disease terms, surgical terms, test terms, and examination terms.
Furthermore, in the information unit construction module, the sequence marking model is a BilSTM-CRF model; marking the interval of each information unit on the medical term as training data, and simultaneously marking characters of non-information units, so that the sequence marking model can discard redundant characters which have no influence on the whole meaning of the medical term.
Furthermore, in the information unit construction module, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted.
Further, in the heteromorphic neural network module, useRepresents the set of all nodes in the medical term knowledge-graph, forMemory for recordingFor the content of its nodes, the node is,encoding its content; for nodes whose contents are numericalIts content is encoded as:
whereinIs a nodeThe value of itself;expressing unit vectors, randomly initializing and obtaining the unit vectors through heterogeneous graph neural network training;
node with node content as metering unitThe node content is a sequence composed of basic units and operation symbolsWhereinIs a basic unit or an operation symbol,is composed ofThe content is encoded as:
whereinTraining a parameter matrix obtained for a neural network of a heterogeneous graph;the semantic vector of each basic unit or operation symbol is randomly initialized and obtained through training of a neural network of a heterogeneous graph;is a vector splicing operator;
for node contents ofText type nodeComputing using pre-trained language modelsAs a semantic vector ofAnd continuing to train the content encoding through a subsequent heterogeneous graph neural network.
Further, the node with text type node contentThe pre-trained language model adopts a BERT model, and the calculation mode is as follows:
whereinAs a BERT modelThe hidden state of the layer or layers is,is as followsInput values of layers:whereinAndare all parameters obtained by the training process,is composed ofThe dimension (c) of (a) is,as a BERT modelkA hidden state of the layer; if the BERT model is commonmLayer, then nodeIs initialized to。
Further, in the abnormal pattern neural network module, calculating vector representation of each node based on content coding of the node and adjacent nodes in the medical term knowledge graph; knowledge graph nodes for medical termsBy usingRepresents fromSet of nodes pointed directly by the starting arrow, ifRepresents a medical term node, thenIs composed ofFirst level information unit set ofIn the synthesis process, the raw materials are mixed,is composed ofThe secondary information unit set of (2); definition ofSet of adjacent nodes ofComprises the following steps:
wherein,Andin order to train the parameters of the resulting matrix,is a non-linear activation function.
Further, in the heteromorphic neural network module, in the first training stage, a parameter set which can be trained is recorded asThen the goal of the training is to optimize the following objective function:
in the second stage of training, the similarity between any two medical term nodes is calculated according to the formula:
whereinAndfor medical term nodes in a medical term knowledge-graph,is composed ofAndthe degree of similarity of (a) to (b),Wandball are parameters obtained by training;
in the medical term normalized training data, the medical term node is setThe nodes of the same meaning of the medical term areAnd is andnode sets of medical terms with different meaningsThen training the label of the sampleComprises the following steps:
the goal of the second stage is to minimize the loss functionL:
Further, in the prediction result output module, the medical term node to be normalized is outputBased on training completionHeteromorphic neural network computingSimilarity with other medical term nodes in the medical term knowledge graph and ordering, taking the similarity with the other medical term nodesMedical term node with maximum similarity:
Setting a threshold for similarityIf, ifThen it is considered asAndhave the same meaning, namely theNormalizing the result; otherwise, consider asThe meaning of the nodes is different from that of other medical terms in the medical term knowledge-graph,have independent meanings.
The invention also discloses a medical term normalization method based on the neural network of the heterogeneous graph, which comprises the following steps:
(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
The invention has the beneficial effects that: the invention defines a uniform information unit structure for different types of medical terms and realizes relatively uniform structural representation, thereby better utilizing the knowledge in the medical field in the process of medical term standardization and fully learning the association and difference of information units contained between the same type of medical terms and between different types of medical terms. By integrating all medical terms into the knowledge graph, the unified heterogeneous graph neural network realizes the standardization work of different types of medical terms, and the integrity and the uniformity of output results can be improved while the working efficiency of the standardization work of the medical terms is improved.
Drawings
FIG. 1 is a block diagram of a medical term normalization system based on a neural network with a heterogeneous graph according to an embodiment of the present invention;
FIG. 2 is a sequence annotation model training data provided in an embodiment of the present invention;
fig. 3 is a schematic view of a medical term knowledge-graph provided by an embodiment of the invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
In the present invention, the medical term normalization means: the method is a process of analyzing various medical terms generated in a real clinical environment by combining knowledge in the medical field and a natural language processing method, identifying medical terms with the same meaning and distinguishing medical terms with different meanings, and unifying the medical terms within a certain range to obtain the best order and social benefit. The establishment of the unified medical term standard and the term set is helpful for solving the problems of term repetition, connotation, semantic expression and inconsistent understanding and the like, and has important significance for effectively promoting the propagation, sharing and use of medical information in a wider range and a deeper level.
The heteromorphic neural network refers to: traditional deep learning methods have had great success on linear and matrix-shaped data, but the data in many practical application scenarios is graphical in structure. In recent years, researchers have defined and designed graph neural network models for processing graph data by taking the ideas of convolutional networks and cyclic networks as reference. The common graph neural network aims at a single graph with nodes and relationship types, and good performance can be obtained only by using adjacent node information of the graph. In contrast, graph data in the real world is usually large in node and relationship types and large in difference, and a graph of the type is called an abnormal graph. In the process of training the heteromorphic graph neural network, because the content of different types of nodes contains large difference of features and different information dimensions, the content coding information of the nodes needs to be considered while the information of adjacent nodes of the graph is used.
The embodiment of the invention provides a medical term normalization system based on a heterogeneous graph neural network, which comprises the following modules as shown in figure 1:
an information unit construction module, comprising:
(1) defining a key information unit for each type of medical term; the medical term types include drug terms, disease terms, operation terms, examination terms, and examination terms, the information units include primary information units and secondary information units, and the inclusion relationship between the primary information units and the secondary information units;
(2) identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
II, a medical term knowledge graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
thirdly, a heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph;
the adjacent nodes are all nodes which start from one node and jump two levels along the direction of the edge of the medical term knowledge graph and pass through;
the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
and fourthly, a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
The implementation process of each module is described in detail as follows:
information unit construction module
(1) An information element defining a medical term. Currently, some international universal medical term standard sets exist, information units with key dimensions are defined for specific medical terms of a single category, however, the correlation relationship between the information units is not established between the different types of medical term standard sets, so that information utilized in the medical term normalization process in the past can be limited to the interior of the medical terms of the single category, and a large amount of useful information is ignored. The invention combines the existing international universal medical term standard set and expert knowledge in the actual clinical process, uniformly defines key information units for various types of medical terms, and defines detailed primary information unit and secondary information unit structures. The types of medical terms that have been implemented by the present invention include pharmaceutical terms, disease terms, surgical terms, test terms and examination terms, which can be easily extended into the system of the present invention after defining the information element for the new type of medical terms if the new type of medical terms are subsequently required to be normalized. The information elements of the implemented medical terms are specifically defined as shown in table 3.
TABLE 3 information element of medical terms
(2) And constructing an information unit library. And predicting the probability of each character in the medical term belonging to each information unit by using a sequence labeling model, thereby identifying all information units contained in the medical term and realizing the structural representation of the medical term. The sequence labeling model used in the embodiment is a BilSTM-CRF model, the model firstly understands the context information of the medical terms through a BilSTM network, then constructs a state probability and transition probability matrix based on the output value of the BilSTM network at each character position of the medical terms, and constructs a CRF model, thereby obtaining better effect on the sequence labeling task. The process of constructing training data for the sequence labeling model is shown in fig. 2, and the interval of each information unit is labeled on the medical term serving as the training data, and meanwhile, characters of non-information units are also labeled, so that the sequence labeling model can discard redundant characters which do not affect the whole meaning of the medical term, and excessive noise is prevented from being introduced into a subsequent heteromorphic neural network.
(3) It should be noted that in table 3, the various primary information units all include a number and measurement unit secondary information unit, and the original number and measurement unit distribution in the medical terminology has a large span and sparsity, so as to increase the difficulty of training the neural network of the heterogeneous map. In order to solve the problem, firstly, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted, wherein the basic units comprise: ml (ml), mg (mg), mm (mm), s (sec), mol (amount of substance), u (unit), iu (international unit), count (count), type, stage, period, and the operation symbols include multiplication and division. A total of 90 normalized units of measure are produced. For example: the original unit of measurement is l (liter), the corresponding value is 1, the normalized unit of measurement is ml (milliliter), and the corresponding value is converted into 1000 correspondingly.
Second, medical term knowledge map module
And constructing a knowledge graph containing various types of medical terms based on the information unit library constructed by the information unit construction module, as shown in figure 3. Two major types of nodes are included: the circular nodes represent medical term nodes, the rectangular nodes represent information unit nodes, and each large type of node internally comprises a plurality of subdivided types of nodes, for example, the medical term nodes comprise "medicine term" nodes, "disease term" nodes and the like, and the information unit nodes comprise "medicine dose" nodes, "numerical value" nodes and the like. Edges include two relationships: 1) containment relationships between medical terms and information elements; 2) the inclusion relationship between the primary information element and the secondary information element. The range of division of the primary information element and the secondary information element may vary for different types of medical terms, for example, for disease terms, the "disease subject" is its primary information element, and for surgery terms, the "disease subject" is the secondary information element contained in the primary information element "disease property".
Three, heterogeneous graph neural network module
(1) The heterogeneous graph refers to a graph with more complex nodes and relationship types, and the medical term knowledge graph shown in fig. 3 is a heterogeneous graph. The common graph neural network aims at a single graph with nodes and relationship types, and good performance can be obtained only by depending on adjacent node information of the graph. In the process of training the heteromorphic graph neural network, because the content of different types of nodes contains large characteristic difference and different information dimensions, adjacent node distribution information and node content coding information of the graph need to be considered at the same time. When the content coding of the nodes is calculated, the invention designs proper calculation methods respectively aiming at different types of nodes.
(2) And calculating content codes of different types of nodes. By usingRepresents the set of all nodes in the medical term knowledge graph of FIG. 3, forMemory for recordingFor the content of its nodes, the node is,for the content encoding, the content encoding of different types of nodes is calculated as follows:
whereinIs a nodeThe value of itself;expressing unit vectors, randomly initializing and obtaining the unit vectors through heterogeneous graph neural network training;
node with node content as metering unitThe node content is a sequence composed of basic units and operation symbolsWhereinIs a basic unit or an operation symbol,is composed ofThe content is encoded as:
whereinTraining a parameter matrix obtained for a neural network of a heterogeneous graph;the semantic vector of each basic unit or operation symbol is randomly initialized and obtained through training of a neural network of a heterogeneous graph;is a vector splicing operator;
for nodes with textual contentsComputing using pre-trained language modelsAs a semantic vector ofAnd continuing to train the content encoding through a subsequent heterogeneous graph neural network. The pre-trained language model used in this embodiment is a BERT model, and the calculation method is as follows:
whereinAs a BERT modelThe hidden state of the layer or layers is,is as followsInput values of layers:whereinAndare all parameters obtained by the training process,is composed ofThe dimension (c) of (a) is,as a BERT modelkA hidden state of the layer; if the BERT model is commonmLayer, then nodeIs initialized toThis example takesm=12。
(3) In a heterogeneous graph neural network, a vector representation of each node is computed based on content encodings of the node itself and its neighboring nodes in the medical term knowledge graph. Knowledge graph nodes for medical termsBy usingRepresents fromSet of nodes pointed directly by the starting arrow, ifRepresents a medical term node, thenIs composed ofThe set of primary information units of (a),is composed ofThe set of secondary information units of (1). Definition ofSet of adjacent nodes ofComprises the following steps:
whereinAs weight parameter, representing the nodeFor nodeOf importance, whereinCan beBy itself orThe adjacent nodes are specifically calculated as follows:
wherein,Andin order to train the parameters of the resulting matrix,for non-linear activation functions, in this example. Since the relative importance between nodes is asymmetric, it is not possible to determine the relative importance of the nodesAre also asymmetrical, i.e.。
(4) And (5) training a heterogeneous graph neural network. The training process is divided into two phases: 1) taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node; 2) the vector representation of the nodes is taken as input, the similarity of any two medical term nodes is calculated, and the training aim is to maximize the similarity of medical term nodes with the same meaning.
In the first stage of the training process, the parameter set that can be trained is recorded asThen the goal of the training is to optimize the following objective function:
In the second stage of the training process, the similarity of any two medical term nodes is calculated according to the formula:
whereinAndfor medical term nodes in a medical term knowledge-graph,is composed ofAndthe degree of similarity is such that,Wandbare all parameters obtained by training. In the medical term normalized training data, the medical term node is setThe nodes of the same meaning of the medical term areAnd is andnode sets of medical terms with different meaningsThen training the label of the sampleComprises the following steps:
Fourth, output module of prediction result
For medical term node to be normalizedComputation based on trained heterogeneous graph neural networksSimilarity with other medical term nodes in the medical term knowledge graph and ordering, taking the similarity with the other medical term nodesMedical term node with maximum similarity:
Setting a threshold for similarityIf, ifThen it is considered asAndhave the same meaning, namely theNormalizing the result; otherwise, consider asThe meaning of the nodes is different from that of other medical terms in the medical term knowledge-graph,have independent meanings. In this example to。
For example, when the drug term "potassium chloride needle (tsukau production) 10% 10ml by 1 is normalized, its similarity to other drug term nodes is calculated as shown in table 4, and it can be known that the drug term node having the same meaning as it is" potassium chloride needle 10ml:1g tsukau pharmaceutical company limited "having the highest similarity.
TABLE 4 heterogeneous graph neural network computing medical term node similarity
The embodiment of the invention also provides a medical term normalization method based on the neural network of the heterogeneous graph, which comprises the following steps:
(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library; the implementation of this step refers to the information element building block.
(2) Based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the direction of the edge is from the containing side to the contained side.
(3) Training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
the implementation of this step refers to the heterogeneous graph neural network module.
(4) Inputting medical term nodes to be normalized into the trained heteromorphic graph neural network to obtain similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results; the implementation of this step refers to the prediction result output module.
The invention defines and identifies the information units contained in a plurality of medical terms, and realizes the structural representation of the medical terms. The result of the structured representation of the medical terms can not only improve the effect of the normalization of the medical terms, but also greatly promote various aspects of medical informatization work; the invention constructs a novel knowledge graph aiming at the medical terms based on the information units of the medical terms, and can effectively promote various medical informatization works including the standardization of the medical terms; the invention constructs a novel heterogeneous graph neural network aiming at the medical term standardization work, realizes the standardization of different types of medical terms by a uniform model, simultaneously respectively realizes a proper content coding mode aiming at different types of information units, and designs a staged training mode for the heterogeneous graph neural network.
The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.
Claims (10)
1. A medical term normalization system based on a heterogeneous graph neural network, the system comprising:
(1) an information unit construction module: defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) medical term knowledge-graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) the heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
2. The system of claim 1, wherein the types of medical terms include pharmaceutical terms, disease terms, surgical terms, test terms, and examination terms.
3. The system of claim 1, wherein in the information element construction module, the sequence labeling model is a BilSTM-CRF model; marking the interval of each information unit on the medical term as training data, and simultaneously marking characters of non-information units, so that the sequence marking model can discard redundant characters which have no influence on the whole meaning of the medical term.
4. The system according to claim 1, wherein in the information unit construction module, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted.
5. The system of claim 1, wherein the neural network module of the heteromorphic image is usedRepresents the set of all nodes in the medical term knowledge-graph, forMemory for recordingFor the content of its nodes, the node is,encoding its content; for nodes whose contents are numericalIts content is encoded as:
whereinIs a nodeThe value of itself;expressing unit vectors, randomly initializing and obtaining the unit vectors through heterogeneous graph neural network training;
node with node content as metering unitThe node content is a sequence composed of basic units and operation symbolsWhereinIs a basic unit or an operation symbol,is composed ofThe content is encoded as:
whereinTraining a parameter matrix obtained for a neural network of a heterogeneous graph;the semantic vector of each basic unit or operation symbol is randomly initialized and obtained through training of a neural network of a heterogeneous graph;is a vector splicing operator;
6. The system of claim 5, wherein the node content is text-based for nodesThe pre-trained language model adopts a BERT model, and the calculation mode is as follows:
whereinAs a BERT modelThe hidden state of the layer or layers is,is as followsInput values of layers:whereinAndare all parameters obtained by the training process,is composed ofThe dimension (c) of (a) is,as a BERT modelkA hidden state of the layer; if the BERT model is commonmLayer, then nodeIs initialized to。
7. The system according to claim 1, wherein in the heteromorphic neural network module, a vector representation of each node is calculated based on content encoding of the node itself and its neighboring nodes in the medical term knowledge graph; knowledge graph nodes for medical termsBy usingRepresents fromSet of nodes pointed directly by the starting arrow, ifRepresents a medical term node, thenIs composed ofThe set of primary information units of (a),is composed ofThe secondary information unit set of (2); definition ofSet of adjacent nodes ofComprises the following steps:
8. The system of claim 1, wherein the first stage of training in the heteromorphic neural network module records as a set of parameters that can be trainedThen the goal of the training is to optimize the following objective function:
in the second stage of training, the similarity between any two medical term nodes is calculated according to the formula:
whereinAndfor medical term nodes in a medical term knowledge-graph,is composed ofAndthe degree of similarity of (a) to (b),Wandball are parameters obtained by training;
in the medical term normalized training data, the medical term node is setThe nodes of the same meaning of the medical term areAnd is andnode sets of medical terms with different meaningsThen train toLabel for exercise sampleComprises the following steps:
the goal of the second stage is to minimize the loss functionL:
9. The system of claim 1, wherein the prediction result output module outputs a node for a medical term to be specifiedComputation based on trained heterogeneous graph neural networksSimilarity with other medical term nodes in the medical term knowledge graph and ordering, taking the similarity with the other medical term nodesMedical term node with maximum similarity:
10. A medical term normalization method based on a heterogeneous graph neural network is characterized by comprising the following steps:
(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111213727.4A CN113656604B (en) | 2021-10-19 | 2021-10-19 | Medical term normalization system and method based on heterogeneous graph neural network |
JP2023536585A JP7432802B2 (en) | 2021-10-19 | 2022-09-05 | Medical terminology normalization system and method based on heterogeneous graph neural network |
PCT/CN2022/116967 WO2023065858A1 (en) | 2021-10-19 | 2022-09-05 | Medical term standardization system and method based on heterogeneous graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111213727.4A CN113656604B (en) | 2021-10-19 | 2021-10-19 | Medical term normalization system and method based on heterogeneous graph neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113656604A CN113656604A (en) | 2021-11-16 |
CN113656604B true CN113656604B (en) | 2022-02-22 |
Family
ID=78494655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111213727.4A Active CN113656604B (en) | 2021-10-19 | 2021-10-19 | Medical term normalization system and method based on heterogeneous graph neural network |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP7432802B2 (en) |
CN (1) | CN113656604B (en) |
WO (1) | WO2023065858A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113656604B (en) * | 2021-10-19 | 2022-02-22 | 之江实验室 | Medical term normalization system and method based on heterogeneous graph neural network |
CN114003791B (en) * | 2021-12-30 | 2022-04-08 | 之江实验室 | Depth map matching-based automatic classification method and system for medical data elements |
CN116386895B (en) * | 2023-04-06 | 2023-11-28 | 之江实验室 | Epidemic public opinion entity identification method and device based on heterogeneous graph neural network |
CN116312915B (en) * | 2023-05-19 | 2023-09-19 | 之江实验室 | Method and system for standardized association of drug terms in electronic medical records |
CN117009839B (en) * | 2023-09-28 | 2024-01-09 | 之江实验室 | Patient clustering method and device based on heterogeneous hypergraph neural network |
CN117497111B (en) * | 2023-12-25 | 2024-03-15 | 四川省医学科学院·四川省人民医院 | System for realizing disease name standardization and classification based on deep learning |
CN117688974B (en) * | 2024-02-01 | 2024-04-26 | 中国人民解放军总医院 | Knowledge graph-based generation type large model modeling method, system and equipment |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7788213B2 (en) | 2007-06-08 | 2010-08-31 | International Business Machines Corporation | System and method for a multiple disciplinary normalization of source for metadata integration with ETL processing layer of complex data across multiple claim engine sources in support of the creation of universal/enterprise healthcare claims record |
WO2018209254A1 (en) * | 2017-05-11 | 2018-11-15 | Hubspot, Inc. | Methods and systems for automated generation of personalized messages |
EP3637435A1 (en) * | 2018-10-12 | 2020-04-15 | Fujitsu Limited | Medical diagnostic aid and method |
US11381651B2 (en) * | 2019-05-29 | 2022-07-05 | Adobe Inc. | Interpretable user modeling from unstructured user data |
CN110349639B (en) * | 2019-07-12 | 2022-01-04 | 之江实验室 | Multi-center medical term standardization system based on general medical term library |
CN111400560A (en) * | 2020-03-10 | 2020-07-10 | 支付宝(杭州)信息技术有限公司 | Method and system for predicting based on heterogeneous graph neural network model |
CN112035451A (en) | 2020-08-25 | 2020-12-04 | 上海灵长软件科技有限公司 | Data verification optimization processing method and device, electronic equipment and storage medium |
CN112271001B (en) * | 2020-11-17 | 2022-08-16 | 中山大学 | Medical consultation dialogue system and method applying heterogeneous graph neural network |
CN112541056A (en) | 2020-12-18 | 2021-03-23 | 卫宁健康科技集团股份有限公司 | Medical term standardization method, device, electronic equipment and storage medium |
CN112542223A (en) * | 2020-12-21 | 2021-03-23 | 西南科技大学 | Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record |
CN113010685B (en) | 2021-02-23 | 2022-12-06 | 安徽讯飞医疗股份有限公司 | Medical term standardization method, electronic device, and storage medium |
CN113191156A (en) * | 2021-04-29 | 2021-07-30 | 浙江禾连网络科技有限公司 | Medical examination item standardization system and method based on medical knowledge graph and pre-training model |
CN113377897B (en) * | 2021-05-27 | 2022-04-22 | 杭州莱迈医疗信息科技有限公司 | Multi-language medical term standard standardization system and method based on deep confrontation learning |
CN113345545B (en) | 2021-07-28 | 2021-10-29 | 北京惠每云科技有限公司 | Clinical data checking method and device, electronic equipment and readable storage medium |
CN113436698B (en) | 2021-08-27 | 2021-12-07 | 之江实验室 | Automatic medical term standardization system and method integrating self-supervision and active learning |
CN113656604B (en) * | 2021-10-19 | 2022-02-22 | 之江实验室 | Medical term normalization system and method based on heterogeneous graph neural network |
-
2021
- 2021-10-19 CN CN202111213727.4A patent/CN113656604B/en active Active
-
2022
- 2022-09-05 WO PCT/CN2022/116967 patent/WO2023065858A1/en active Application Filing
- 2022-09-05 JP JP2023536585A patent/JP7432802B2/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113656604A (en) | 2021-11-16 |
WO2023065858A1 (en) | 2023-04-27 |
JP7432802B2 (en) | 2024-02-16 |
JP2024500400A (en) | 2024-01-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113656604B (en) | Medical term normalization system and method based on heterogeneous graph neural network | |
Dediu et al. | Abstract profiles of structural stability point to universal tendencies, family-specific factors, and ancient connections between languages | |
CN106682397A (en) | Knowledge-based electronic medical record quality control method | |
Sankaranarayanan et al. | COVID-19 mortality prediction from deep learning in a large multistate electronic health record and laboratory information system data set: Algorithm development and validation | |
Reiter et al. | A shared task for the digital humanities chapter 1: Introduction to annotation, narrative levels and shared tasks | |
Data et al. | Mortality Prediction in the ICU | |
Liu et al. | An explainable knowledge distillation method with XGBoost for ICU mortality prediction | |
Hassani et al. | The science of statistics versus data science: What is the future? | |
Shahin et al. | Artificial intelligence: from buzzword to useful tool in clinical pharmacology | |
Baron et al. | Machine learning and other emerging decision support tools | |
CN114386436B (en) | Text data analysis method, model training method, device and computer equipment | |
Xu | Ecological influences on the formation of the hiring network in the communication job market, 2015 to 2019 | |
Chen et al. | Syntactic type-aware graph attention network for drug-drug interactions and their adverse effects extraction | |
Yu et al. | The effect of mentee and mentor gender on scientific productivity of applicants for NIH training fellowships | |
CN110827966A (en) | Regional single disease supervision system | |
Rahman et al. | Modeling Influenza with a Forest Deep Neural Network Utilizing a Virtualized Clinical Semantic Network | |
Riezler et al. | Validity, Reliability, and Significance | |
Ciaperoni | Efficient and trustworthy methods for knowledge discovery | |
Wang et al. | A Model for Predicting Physical Health of College Students Based on Semantic Web and Deep Learning Under Cloud Edge Collaborative Architecture | |
An et al. | Knowledge-Enhanced Difference-Aware Clinical Time Series Representation Learning for Diagnosis Prediction | |
Kang et al. | Scientific Networks | |
Qi et al. | Recurrence Prediction and Risk Classification of COPD Patients Based on Machine Learning. | |
Skirgård | Disentangling Ancestral State Reconstruction in historical linguistics: Comparing classic approaches and new methods using Oceanic grammar | |
Yang et al. | Cautious explorers generate more future academic impact | |
CN115238700A (en) | Biomedical entity extraction method based on multi-task learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |