CN113656604B - Medical term normalization system and method based on heterogeneous graph neural network - Google Patents

Medical term normalization system and method based on heterogeneous graph neural network Download PDF

Info

Publication number
CN113656604B
CN113656604B CN202111213727.4A CN202111213727A CN113656604B CN 113656604 B CN113656604 B CN 113656604B CN 202111213727 A CN202111213727 A CN 202111213727A CN 113656604 B CN113656604 B CN 113656604B
Authority
CN
China
Prior art keywords
nodes
node
medical term
medical
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111213727.4A
Other languages
Chinese (zh)
Other versions
CN113656604A (en
Inventor
李劲松
杨宗峰
辛然
田雨
周天舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202111213727.4A priority Critical patent/CN113656604B/en
Publication of CN113656604A publication Critical patent/CN113656604A/en
Application granted granted Critical
Publication of CN113656604B publication Critical patent/CN113656604B/en
Priority to JP2023536585A priority patent/JP7432802B2/en
Priority to PCT/CN2022/116967 priority patent/WO2023065858A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Abstract

The invention discloses a medical term normalization system and method based on a heterogeneous graph neural network. And constructing a heterogeneous graph neural network containing various types of medical terms based on the knowledge graph, and comprehensively considering the adjacent node distribution and the node content coding of the graph in the training process of the heterogeneous graph neural network for the medical term normalization. The invention can fully utilize the knowledge of the correlation and difference among the information units of the same type of medical terms, simultaneously accommodate various types of medical terms, can comprehensively learn the knowledge in the medical field, can conveniently add the new type of medical terms into the system, and reduces the workload of the normalization of the new type of medical terms.

Description

Medical term normalization system and method based on heterogeneous graph neural network
Technical Field
The invention belongs to the technical field of Chinese medical term standardization and multi-center medical information platforms, and particularly relates to a medical term standardization system and method based on an isomerous graph neural network.
Background
An important research direction in the process of medical informatization is to apply higher-performance machine learning and artificial intelligence technology to solve the actual clinical problems. One advantage of the artificial intelligence technology is that complex rules and characteristics can be found from mass data, so that medical data of multiple medical institutions are comprehensively utilized for analysis mining and model design, and support is provided for medical research and clinical decision-making work to become a necessary trend of medical informatization. Integration of medical data from different sources is made extremely difficult by the multitude of information standards employed by different medical institutions and the frequent artificial production of semi-structured and unstructured data. Medical terms are basic elements forming medical data, and a perfect medical term standardization system is established, so that medical data from different sources can be aligned to a unified standard and structure, and larger-scale and higher-quality data are provided for clinical decision and medical research work. Medical terms mainly include terms of the types of drugs, medical examinations, diseases, etc. generated during clinical operations. Different types of medical terms will contain information of a particular key dimension, which we define as the information element of the medical term. For example, the pharmaceutical term "5% glucose injection (base) 500 ml" contains the information elements as shown in table 1:
table 1 example drug term information element
Figure 432124DEST_PATH_IMAGE001
The examination term "left hand means positive bit _ X" contains the information elements as shown in table 2:
table 2 examination terminology information element example
Figure 913921DEST_PATH_IMAGE002
Some of the information elements are composed of other finer grained information elements, which are defined as primary information elements and secondary information elements, respectively, e.g. the pharmaceutical terms in table 1 comprise the primary information elements "pharmaceutical composition", "pharmaceutical dosage form", "pharmaceutical dosage" and "pharmaceutical specification", wherein the "pharmaceutical specification" information elements are composed of the secondary information elements "number" (500) and "dosing unit" (ml). A complete medical term can be determined given the information elements of a group of medical terms.
In actual clinical operation, due to the reasons of standard differences of information adopted by various medical institutions, personal habits of medical workers and the like, a large number of irregular medical terms are generated, which are mainly expressed as problems of redundancy or loss of key information units, irregular expression modes, non-uniform quantity units and the like, for example, the following medical terms have the same meanings but have larger differences in forms: "levofloxacin tablet (clonidine) 500 mg" and "clonidine 0.5 g/tablet". The aim of medical term normalization is to identify medical terms with the same meaning but different literal forms, so as to unify the expression modes of the medical terms, distinguish the medical terms with different meanings, and finally promote the normalization of the whole medical data.
The traditional medical term normalization method is to understand the meaning of each medical term by machine learning or manual verification method for a single category of medical terms, and to mark out medical terms with the same semantics. Such a method takes each medical term as a whole, ignores the structure of the information unit inherent in the medical term, and has the main disadvantages that: (1) the knowledge of the association and difference of information units with each other cannot be effectively exploited. The association and difference between information units of different dimensions of the same medical term can contain rich medical domain knowledge, and the existing practice does not explicitly structure and utilize the knowledge; (2) different types of medical terms can contain the same or related information units, and the traditional medical term standardization work is to respectively develop independent systems aiming at the medical terms of a single category, so that on one hand, the workload is overlarge, and on the other hand, the knowledge in the information units of the different types of medical terms cannot be comprehensively utilized; (3) the excess information is taken into account. Most medical terms contain some redundant characters besides the key information units due to the reasons of irregular expression, etc., the characters have little relation with the meaning of the medical term as a whole, and the meaning of the medical term is deviated as noise.
Disclosure of Invention
The invention aims to provide a medical term normalization system and method based on an isomerous graph neural network, aiming at the defects of the conventional medical term normalization method and based on the characteristics of medical terms. The invention constructs a novel knowledge graph based on the information unit for all medical terms, and normalizes the medical terms through the improved heterogeneous graph neural network on the basis of the knowledge graph, thereby effectively utilizing the knowledge in the medical term information unit and obtaining a more accurate medical term normalization result.
The purpose of the invention is realized by the following technical scheme: in order to fully utilize medical field knowledge contained in medical terms in the process of medical term normalization, the invention firstly constructs key information units for various types of medical terms, realizes the structural representation of the medical terms, and constructs a knowledge graph containing various types of medical terms based on the information units. And constructing a heterogeneous graph neural network containing various types of medical terms based on the knowledge graph, and comprehensively considering the adjacent node distribution and the node content coding of the graph in the training process of the heterogeneous graph neural network for the medical term normalization. By the method, the invention can fully utilize the knowledge of the correlation and difference among the information units of the same type of medical terms, simultaneously accommodate various types of medical terms in the system, comprehensively learn the knowledge in the medical field, conveniently add the new type of medical terms into the system and reduce the workload of the normalization of the new type of medical terms. Redundant characters and information can be discarded in the process of extracting the information unit from the medical term, and excessive noise and errors are avoided.
The invention discloses a medical term normalization system based on a heterogeneous graph neural network on one hand, which comprises the following components:
(1) an information unit construction module: defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) medical term knowledge-graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) the heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
Further, the types of medical terms include pharmaceutical terms, disease terms, surgical terms, test terms, and examination terms.
Furthermore, in the information unit construction module, the sequence marking model is a BilSTM-CRF model; marking the interval of each information unit on the medical term as training data, and simultaneously marking characters of non-information units, so that the sequence marking model can discard redundant characters which have no influence on the whole meaning of the medical term.
Furthermore, in the information unit construction module, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted.
Further, in the heteromorphic neural network module, use
Figure 202951DEST_PATH_IMAGE003
Represents the set of all nodes in the medical term knowledge-graph, for
Figure 86593DEST_PATH_IMAGE004
Memory for recording
Figure 435666DEST_PATH_IMAGE005
For the content of its nodes, the node is,
Figure 670338DEST_PATH_IMAGE006
encoding its content; for nodes whose contents are numerical
Figure 494550DEST_PATH_IMAGE007
Its content is encoded as:
Figure 498279DEST_PATH_IMAGE008
wherein
Figure 752674DEST_PATH_IMAGE005
Is a node
Figure 209063DEST_PATH_IMAGE007
The value of itself;
Figure 839895DEST_PATH_IMAGE009
expressing unit vectors, randomly initializing and obtaining the unit vectors through heterogeneous graph neural network training;
node with node content as metering unit
Figure 698130DEST_PATH_IMAGE007
The node content is a sequence composed of basic units and operation symbols
Figure 389005DEST_PATH_IMAGE010
Wherein
Figure 332691DEST_PATH_IMAGE011
Is a basic unit or an operation symbol,
Figure 501635DEST_PATH_IMAGE012
is composed of
Figure 417638DEST_PATH_IMAGE007
The content is encoded as:
Figure 404049DEST_PATH_IMAGE013
wherein
Figure 710396DEST_PATH_IMAGE014
Training a parameter matrix obtained for a neural network of a heterogeneous graph;
Figure 807665DEST_PATH_IMAGE015
the semantic vector of each basic unit or operation symbol is randomly initialized and obtained through training of a neural network of a heterogeneous graph;
Figure 250279DEST_PATH_IMAGE016
is a vector splicing operator;
for node contents ofText type node
Figure 407591DEST_PATH_IMAGE007
Computing using pre-trained language models
Figure 201235DEST_PATH_IMAGE007
As a semantic vector of
Figure 102195DEST_PATH_IMAGE007
And continuing to train the content encoding through a subsequent heterogeneous graph neural network.
Further, the node with text type node content
Figure 661964DEST_PATH_IMAGE007
The pre-trained language model adopts a BERT model, and the calculation mode is as follows:
Figure 990178DEST_PATH_IMAGE017
wherein
Figure 5538DEST_PATH_IMAGE018
As a BERT model
Figure 710189DEST_PATH_IMAGE019
The hidden state of the layer or layers is,
Figure 127395DEST_PATH_IMAGE020
is as follows
Figure 626509DEST_PATH_IMAGE021
Input values of layers:
Figure 129166DEST_PATH_IMAGE022
wherein
Figure 106349DEST_PATH_IMAGE023
And
Figure 643641DEST_PATH_IMAGE024
are all parameters obtained by the training process,
Figure 313657DEST_PATH_IMAGE025
is composed of
Figure 631506DEST_PATH_IMAGE026
The dimension (c) of (a) is,
Figure 287746DEST_PATH_IMAGE027
as a BERT modelkA hidden state of the layer; if the BERT model is commonmLayer, then node
Figure 804178DEST_PATH_IMAGE007
Is initialized to
Figure 520461DEST_PATH_IMAGE028
Further, in the abnormal pattern neural network module, calculating vector representation of each node based on content coding of the node and adjacent nodes in the medical term knowledge graph; knowledge graph nodes for medical terms
Figure 122344DEST_PATH_IMAGE029
By using
Figure 582275DEST_PATH_IMAGE030
Represents from
Figure 953214DEST_PATH_IMAGE007
Set of nodes pointed directly by the starting arrow, if
Figure 837469DEST_PATH_IMAGE007
Represents a medical term node, then
Figure 661068DEST_PATH_IMAGE030
Is composed of
Figure 924690DEST_PATH_IMAGE007
First level information unit set ofIn the synthesis process, the raw materials are mixed,
Figure 150135DEST_PATH_IMAGE031
is composed of
Figure 473800DEST_PATH_IMAGE007
The secondary information unit set of (2); definition of
Figure 50275DEST_PATH_IMAGE007
Set of adjacent nodes of
Figure 852009DEST_PATH_IMAGE030
Comprises the following steps:
Figure 931961DEST_PATH_IMAGE032
then
Figure 426527DEST_PATH_IMAGE007
Vector representation of
Figure 224719DEST_PATH_IMAGE033
The calculation method is as follows:
Figure 564564DEST_PATH_IMAGE034
wherein
Figure 764601DEST_PATH_IMAGE035
As the weight parameter, the following is specifically calculated:
Figure 430069DEST_PATH_IMAGE036
wherein
Figure 449978DEST_PATH_IMAGE037
Figure 859093DEST_PATH_IMAGE038
And
Figure 913637DEST_PATH_IMAGE039
in order to train the parameters of the resulting matrix,
Figure 812323DEST_PATH_IMAGE040
is a non-linear activation function.
Further, in the heteromorphic neural network module, in the first training stage, a parameter set which can be trained is recorded as
Figure 457543DEST_PATH_IMAGE041
Then the goal of the training is to optimize the following objective function:
Figure 529405DEST_PATH_IMAGE042
wherein
Figure 110559DEST_PATH_IMAGE043
Representing slave nodes
Figure 117829DEST_PATH_IMAGE007
Predict its neighboring nodes
Figure 112330DEST_PATH_IMAGE044
The probability of (d);
in the second stage of training, the similarity between any two medical term nodes is calculated according to the formula:
Figure 863248DEST_PATH_IMAGE045
wherein
Figure 892384DEST_PATH_IMAGE007
And
Figure 804976DEST_PATH_IMAGE046
for medical term nodes in a medical term knowledge-graph,
Figure 552352DEST_PATH_IMAGE047
is composed of
Figure 841383DEST_PATH_IMAGE007
And
Figure 725025DEST_PATH_IMAGE046
the degree of similarity of (a) to (b),Wandball are parameters obtained by training;
in the medical term normalized training data, the medical term node is set
Figure 808519DEST_PATH_IMAGE007
The nodes of the same meaning of the medical term are
Figure 777612DEST_PATH_IMAGE048
And is and
Figure 870332DEST_PATH_IMAGE007
node sets of medical terms with different meanings
Figure 608481DEST_PATH_IMAGE049
Then training the label of the sample
Figure 456352DEST_PATH_IMAGE050
Comprises the following steps:
Figure 581915DEST_PATH_IMAGE051
the goal of the second stage is to minimize the loss functionL
Figure 212748DEST_PATH_IMAGE052
Further, in the prediction result output module, the medical term node to be normalized is output
Figure 70982DEST_PATH_IMAGE053
Based on training completionHeteromorphic neural network computing
Figure 761858DEST_PATH_IMAGE053
Similarity with other medical term nodes in the medical term knowledge graph and ordering, taking the similarity with the other medical term nodes
Figure 643226DEST_PATH_IMAGE053
Medical term node with maximum similarity
Figure 874487DEST_PATH_IMAGE054
Figure 587228DEST_PATH_IMAGE055
Setting a threshold for similarity
Figure 449005DEST_PATH_IMAGE056
If, if
Figure 83249DEST_PATH_IMAGE057
Then it is considered as
Figure 180518DEST_PATH_IMAGE053
And
Figure 623131DEST_PATH_IMAGE054
have the same meaning, namely the
Figure 452547DEST_PATH_IMAGE053
Normalizing the result; otherwise, consider as
Figure 308508DEST_PATH_IMAGE053
The meaning of the nodes is different from that of other medical terms in the medical term knowledge-graph,
Figure 209468DEST_PATH_IMAGE053
have independent meanings.
The invention also discloses a medical term normalization method based on the neural network of the heterogeneous graph, which comprises the following steps:
(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
The invention has the beneficial effects that: the invention defines a uniform information unit structure for different types of medical terms and realizes relatively uniform structural representation, thereby better utilizing the knowledge in the medical field in the process of medical term standardization and fully learning the association and difference of information units contained between the same type of medical terms and between different types of medical terms. By integrating all medical terms into the knowledge graph, the unified heterogeneous graph neural network realizes the standardization work of different types of medical terms, and the integrity and the uniformity of output results can be improved while the working efficiency of the standardization work of the medical terms is improved.
Drawings
FIG. 1 is a block diagram of a medical term normalization system based on a neural network with a heterogeneous graph according to an embodiment of the present invention;
FIG. 2 is a sequence annotation model training data provided in an embodiment of the present invention;
fig. 3 is a schematic view of a medical term knowledge-graph provided by an embodiment of the invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
In the present invention, the medical term normalization means: the method is a process of analyzing various medical terms generated in a real clinical environment by combining knowledge in the medical field and a natural language processing method, identifying medical terms with the same meaning and distinguishing medical terms with different meanings, and unifying the medical terms within a certain range to obtain the best order and social benefit. The establishment of the unified medical term standard and the term set is helpful for solving the problems of term repetition, connotation, semantic expression and inconsistent understanding and the like, and has important significance for effectively promoting the propagation, sharing and use of medical information in a wider range and a deeper level.
The heteromorphic neural network refers to: traditional deep learning methods have had great success on linear and matrix-shaped data, but the data in many practical application scenarios is graphical in structure. In recent years, researchers have defined and designed graph neural network models for processing graph data by taking the ideas of convolutional networks and cyclic networks as reference. The common graph neural network aims at a single graph with nodes and relationship types, and good performance can be obtained only by using adjacent node information of the graph. In contrast, graph data in the real world is usually large in node and relationship types and large in difference, and a graph of the type is called an abnormal graph. In the process of training the heteromorphic graph neural network, because the content of different types of nodes contains large difference of features and different information dimensions, the content coding information of the nodes needs to be considered while the information of adjacent nodes of the graph is used.
The embodiment of the invention provides a medical term normalization system based on a heterogeneous graph neural network, which comprises the following modules as shown in figure 1:
an information unit construction module, comprising:
(1) defining a key information unit for each type of medical term; the medical term types include drug terms, disease terms, operation terms, examination terms, and examination terms, the information units include primary information units and secondary information units, and the inclusion relationship between the primary information units and the secondary information units;
(2) identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
II, a medical term knowledge graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
thirdly, a heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph;
the adjacent nodes are all nodes which start from one node and jump two levels along the direction of the edge of the medical term knowledge graph and pass through;
the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
and fourthly, a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
The implementation process of each module is described in detail as follows:
information unit construction module
(1) An information element defining a medical term. Currently, some international universal medical term standard sets exist, information units with key dimensions are defined for specific medical terms of a single category, however, the correlation relationship between the information units is not established between the different types of medical term standard sets, so that information utilized in the medical term normalization process in the past can be limited to the interior of the medical terms of the single category, and a large amount of useful information is ignored. The invention combines the existing international universal medical term standard set and expert knowledge in the actual clinical process, uniformly defines key information units for various types of medical terms, and defines detailed primary information unit and secondary information unit structures. The types of medical terms that have been implemented by the present invention include pharmaceutical terms, disease terms, surgical terms, test terms and examination terms, which can be easily extended into the system of the present invention after defining the information element for the new type of medical terms if the new type of medical terms are subsequently required to be normalized. The information elements of the implemented medical terms are specifically defined as shown in table 3.
TABLE 3 information element of medical terms
Figure 772167DEST_PATH_IMAGE058
(2) And constructing an information unit library. And predicting the probability of each character in the medical term belonging to each information unit by using a sequence labeling model, thereby identifying all information units contained in the medical term and realizing the structural representation of the medical term. The sequence labeling model used in the embodiment is a BilSTM-CRF model, the model firstly understands the context information of the medical terms through a BilSTM network, then constructs a state probability and transition probability matrix based on the output value of the BilSTM network at each character position of the medical terms, and constructs a CRF model, thereby obtaining better effect on the sequence labeling task. The process of constructing training data for the sequence labeling model is shown in fig. 2, and the interval of each information unit is labeled on the medical term serving as the training data, and meanwhile, characters of non-information units are also labeled, so that the sequence labeling model can discard redundant characters which do not affect the whole meaning of the medical term, and excessive noise is prevented from being introduced into a subsequent heteromorphic neural network.
(3) It should be noted that in table 3, the various primary information units all include a number and measurement unit secondary information unit, and the original number and measurement unit distribution in the medical terminology has a large span and sparsity, so as to increase the difficulty of training the neural network of the heterogeneous map. In order to solve the problem, firstly, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted, wherein the basic units comprise: ml (ml), mg (mg), mm (mm), s (sec), mol (amount of substance), u (unit), iu (international unit), count (count), type, stage, period, and the operation symbols include multiplication and division. A total of 90 normalized units of measure are produced. For example: the original unit of measurement is l (liter), the corresponding value is 1, the normalized unit of measurement is ml (milliliter), and the corresponding value is converted into 1000 correspondingly.
Second, medical term knowledge map module
And constructing a knowledge graph containing various types of medical terms based on the information unit library constructed by the information unit construction module, as shown in figure 3. Two major types of nodes are included: the circular nodes represent medical term nodes, the rectangular nodes represent information unit nodes, and each large type of node internally comprises a plurality of subdivided types of nodes, for example, the medical term nodes comprise "medicine term" nodes, "disease term" nodes and the like, and the information unit nodes comprise "medicine dose" nodes, "numerical value" nodes and the like. Edges include two relationships: 1) containment relationships between medical terms and information elements; 2) the inclusion relationship between the primary information element and the secondary information element. The range of division of the primary information element and the secondary information element may vary for different types of medical terms, for example, for disease terms, the "disease subject" is its primary information element, and for surgery terms, the "disease subject" is the secondary information element contained in the primary information element "disease property".
Three, heterogeneous graph neural network module
(1) The heterogeneous graph refers to a graph with more complex nodes and relationship types, and the medical term knowledge graph shown in fig. 3 is a heterogeneous graph. The common graph neural network aims at a single graph with nodes and relationship types, and good performance can be obtained only by depending on adjacent node information of the graph. In the process of training the heteromorphic graph neural network, because the content of different types of nodes contains large characteristic difference and different information dimensions, adjacent node distribution information and node content coding information of the graph need to be considered at the same time. When the content coding of the nodes is calculated, the invention designs proper calculation methods respectively aiming at different types of nodes.
(2) And calculating content codes of different types of nodes. By using
Figure 100380DEST_PATH_IMAGE059
Represents the set of all nodes in the medical term knowledge graph of FIG. 3, for
Figure 136249DEST_PATH_IMAGE060
Memory for recording
Figure 575320DEST_PATH_IMAGE061
For the content of its nodes, the node is,
Figure 992526DEST_PATH_IMAGE062
for the content encoding, the content encoding of different types of nodes is calculated as follows:
for nodes whose contents are numerical
Figure 757220DEST_PATH_IMAGE063
Its content is encoded as:
Figure 259877DEST_PATH_IMAGE064
wherein
Figure 502639DEST_PATH_IMAGE061
Is a node
Figure 39931DEST_PATH_IMAGE063
The value of itself;
Figure 709947DEST_PATH_IMAGE065
expressing unit vectors, randomly initializing and obtaining the unit vectors through heterogeneous graph neural network training;
node with node content as metering unit
Figure 434320DEST_PATH_IMAGE063
The node content is a sequence composed of basic units and operation symbols
Figure 480773DEST_PATH_IMAGE066
Wherein
Figure 200468DEST_PATH_IMAGE067
Is a basic unit or an operation symbol,
Figure 713489DEST_PATH_IMAGE068
is composed of
Figure 925158DEST_PATH_IMAGE063
The content is encoded as:
Figure 509723DEST_PATH_IMAGE069
wherein
Figure 83924DEST_PATH_IMAGE070
Training a parameter matrix obtained for a neural network of a heterogeneous graph;
Figure 971109DEST_PATH_IMAGE071
the semantic vector of each basic unit or operation symbol is randomly initialized and obtained through training of a neural network of a heterogeneous graph;
Figure 60288DEST_PATH_IMAGE072
is a vector splicing operator;
for nodes with textual contents
Figure 55401DEST_PATH_IMAGE063
Computing using pre-trained language models
Figure 546425DEST_PATH_IMAGE063
As a semantic vector of
Figure 604511DEST_PATH_IMAGE063
And continuing to train the content encoding through a subsequent heterogeneous graph neural network. The pre-trained language model used in this embodiment is a BERT model, and the calculation method is as follows:
Figure 915407DEST_PATH_IMAGE073
wherein
Figure 717140DEST_PATH_IMAGE074
As a BERT model
Figure 62671DEST_PATH_IMAGE075
The hidden state of the layer or layers is,
Figure 291658DEST_PATH_IMAGE076
is as follows
Figure 89850DEST_PATH_IMAGE077
Input values of layers:
Figure 695275DEST_PATH_IMAGE078
wherein
Figure 895312DEST_PATH_IMAGE079
And
Figure 560780DEST_PATH_IMAGE080
are all parameters obtained by the training process,
Figure 580688DEST_PATH_IMAGE081
is composed of
Figure 989804DEST_PATH_IMAGE082
The dimension (c) of (a) is,
Figure 309927DEST_PATH_IMAGE083
as a BERT modelkA hidden state of the layer; if the BERT model is commonmLayer, then node
Figure 880717DEST_PATH_IMAGE063
Is initialized to
Figure 653501DEST_PATH_IMAGE084
This example takesm=12。
(3) In a heterogeneous graph neural network, a vector representation of each node is computed based on content encodings of the node itself and its neighboring nodes in the medical term knowledge graph. Knowledge graph nodes for medical terms
Figure 597798DEST_PATH_IMAGE085
By using
Figure 506849DEST_PATH_IMAGE086
Represents from
Figure 514119DEST_PATH_IMAGE063
Set of nodes pointed directly by the starting arrow, if
Figure 508620DEST_PATH_IMAGE063
Represents a medical term node, then
Figure 259538DEST_PATH_IMAGE086
Is composed of
Figure 226357DEST_PATH_IMAGE063
The set of primary information units of (a),
Figure 201266DEST_PATH_IMAGE087
is composed of
Figure 683063DEST_PATH_IMAGE063
The set of secondary information units of (1). Definition of
Figure 299989DEST_PATH_IMAGE063
Set of adjacent nodes of
Figure 58998DEST_PATH_IMAGE086
Comprises the following steps:
Figure 532704DEST_PATH_IMAGE088
then
Figure 377164DEST_PATH_IMAGE063
Vector representation of
Figure 594518DEST_PATH_IMAGE089
The calculation method is as follows:
Figure 208034DEST_PATH_IMAGE090
wherein
Figure 852641DEST_PATH_IMAGE091
As weight parameter, representing the node
Figure 184397DEST_PATH_IMAGE092
For node
Figure 939863DEST_PATH_IMAGE063
Of importance, wherein
Figure 670534DEST_PATH_IMAGE092
Can be
Figure 220464DEST_PATH_IMAGE063
By itself or
Figure 39516DEST_PATH_IMAGE063
The adjacent nodes are specifically calculated as follows:
Figure 598673DEST_PATH_IMAGE093
wherein
Figure 983518DEST_PATH_IMAGE094
Figure 579716DEST_PATH_IMAGE095
And
Figure 417222DEST_PATH_IMAGE096
in order to train the parameters of the resulting matrix,
Figure 514491DEST_PATH_IMAGE097
for non-linear activation functions, in this example
Figure 957104DEST_PATH_IMAGE098
. Since the relative importance between nodes is asymmetric, it is not possible to determine the relative importance of the nodes
Figure 114416DEST_PATH_IMAGE099
Are also asymmetrical, i.e.
Figure 642481DEST_PATH_IMAGE100
(4) And (5) training a heterogeneous graph neural network. The training process is divided into two phases: 1) taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node; 2) the vector representation of the nodes is taken as input, the similarity of any two medical term nodes is calculated, and the training aim is to maximize the similarity of medical term nodes with the same meaning.
In the first stage of the training process, the parameter set that can be trained is recorded as
Figure 215544DEST_PATH_IMAGE101
Then the goal of the training is to optimize the following objective function:
Figure 637299DEST_PATH_IMAGE102
wherein
Figure 837948DEST_PATH_IMAGE103
Representing slave nodes
Figure 977943DEST_PATH_IMAGE063
Predict its neighboring nodes
Figure 292380DEST_PATH_IMAGE104
The probability of (c).
In the second stage of the training process, the similarity of any two medical term nodes is calculated according to the formula:
Figure 834220DEST_PATH_IMAGE105
wherein
Figure 208701DEST_PATH_IMAGE063
And
Figure 835991DEST_PATH_IMAGE106
for medical term nodes in a medical term knowledge-graph,
Figure 954120DEST_PATH_IMAGE107
is composed of
Figure 22570DEST_PATH_IMAGE063
And
Figure 692586DEST_PATH_IMAGE106
the degree of similarity is such that,Wandbare all parameters obtained by training. In the medical term normalized training data, the medical term node is set
Figure 682539DEST_PATH_IMAGE063
The nodes of the same meaning of the medical term are
Figure 728992DEST_PATH_IMAGE108
And is and
Figure 855211DEST_PATH_IMAGE063
node sets of medical terms with different meanings
Figure 696128DEST_PATH_IMAGE109
Then training the label of the sample
Figure 173377DEST_PATH_IMAGE110
Comprises the following steps:
Figure 757942DEST_PATH_IMAGE111
the goal of the second stage is to minimize the loss function
Figure 735738DEST_PATH_IMAGE112
Figure 747556DEST_PATH_IMAGE113
Fourth, output module of prediction result
For medical term node to be normalized
Figure 977680DEST_PATH_IMAGE114
Computation based on trained heterogeneous graph neural networks
Figure 834778DEST_PATH_IMAGE115
Similarity with other medical term nodes in the medical term knowledge graph and ordering, taking the similarity with the other medical term nodes
Figure 201168DEST_PATH_IMAGE115
Medical term node with maximum similarity
Figure 649467DEST_PATH_IMAGE116
Figure 570150DEST_PATH_IMAGE117
Setting a threshold for similarity
Figure 496517DEST_PATH_IMAGE118
If, if
Figure 717414DEST_PATH_IMAGE119
Then it is considered as
Figure 71035DEST_PATH_IMAGE115
And
Figure 744593DEST_PATH_IMAGE116
have the same meaning, namely the
Figure 474652DEST_PATH_IMAGE115
Normalizing the result; otherwise, consider as
Figure 550055DEST_PATH_IMAGE115
The meaning of the nodes is different from that of other medical terms in the medical term knowledge-graph,
Figure 74577DEST_PATH_IMAGE115
have independent meanings. In this example to
Figure 235431DEST_PATH_IMAGE120
For example, when the drug term "potassium chloride needle (tsukau production) 10% 10ml by 1 is normalized, its similarity to other drug term nodes is calculated as shown in table 4, and it can be known that the drug term node having the same meaning as it is" potassium chloride needle 10ml:1g tsukau pharmaceutical company limited "having the highest similarity.
TABLE 4 heterogeneous graph neural network computing medical term node similarity
Figure 503602DEST_PATH_IMAGE121
The embodiment of the invention also provides a medical term normalization method based on the neural network of the heterogeneous graph, which comprises the following steps:
(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library; the implementation of this step refers to the information element building block.
(2) Based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the direction of the edge is from the containing side to the contained side.
(3) Training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
the implementation of this step refers to the heterogeneous graph neural network module.
(4) Inputting medical term nodes to be normalized into the trained heteromorphic graph neural network to obtain similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results; the implementation of this step refers to the prediction result output module.
The invention defines and identifies the information units contained in a plurality of medical terms, and realizes the structural representation of the medical terms. The result of the structured representation of the medical terms can not only improve the effect of the normalization of the medical terms, but also greatly promote various aspects of medical informatization work; the invention constructs a novel knowledge graph aiming at the medical terms based on the information units of the medical terms, and can effectively promote various medical informatization works including the standardization of the medical terms; the invention constructs a novel heterogeneous graph neural network aiming at the medical term standardization work, realizes the standardization of different types of medical terms by a uniform model, simultaneously respectively realizes a proper content coding mode aiming at different types of information units, and designs a staged training mode for the heterogeneous graph neural network.
The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (10)

1. A medical term normalization system based on a heterogeneous graph neural network, the system comprising:
(1) an information unit construction module: defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) medical term knowledge-graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) the heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
2. The system of claim 1, wherein the types of medical terms include pharmaceutical terms, disease terms, surgical terms, test terms, and examination terms.
3. The system of claim 1, wherein in the information element construction module, the sequence labeling model is a BilSTM-CRF model; marking the interval of each information unit on the medical term as training data, and simultaneously marking characters of non-information units, so that the sequence marking model can discard redundant characters which have no influence on the whole meaning of the medical term.
4. The system according to claim 1, wherein in the information unit construction module, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted.
5. The system of claim 1, wherein the neural network module of the heteromorphic image is used
Figure DEST_PATH_IMAGE001
Represents the set of all nodes in the medical term knowledge-graph, for
Figure 568023DEST_PATH_IMAGE002
Memory for recording
Figure DEST_PATH_IMAGE003
For the content of its nodes, the node is,
Figure 743789DEST_PATH_IMAGE004
encoding its content; for nodes whose contents are numerical
Figure DEST_PATH_IMAGE005
Its content is encoded as:
Figure 731337DEST_PATH_IMAGE006
wherein
Figure DEST_PATH_IMAGE007
Is a node
Figure 17962DEST_PATH_IMAGE005
The value of itself;
Figure 610617DEST_PATH_IMAGE008
expressing unit vectors, randomly initializing and obtaining the unit vectors through heterogeneous graph neural network training;
node with node content as metering unit
Figure 927591DEST_PATH_IMAGE005
The node content is a sequence composed of basic units and operation symbols
Figure DEST_PATH_IMAGE009
Wherein
Figure 402435DEST_PATH_IMAGE010
Is a basic unit or an operation symbol,
Figure DEST_PATH_IMAGE011
is composed of
Figure 492750DEST_PATH_IMAGE005
The content is encoded as:
Figure 205491DEST_PATH_IMAGE012
wherein
Figure 926323DEST_PATH_IMAGE013
Training a parameter matrix obtained for a neural network of a heterogeneous graph;
Figure DEST_PATH_IMAGE014
the semantic vector of each basic unit or operation symbol is randomly initialized and obtained through training of a neural network of a heterogeneous graph;
Figure 389927DEST_PATH_IMAGE015
is a vector splicing operator;
for nodes with textual contents
Figure 221617DEST_PATH_IMAGE005
Computing using pre-trained language models
Figure 54444DEST_PATH_IMAGE005
As a semantic vector of
Figure 946177DEST_PATH_IMAGE005
And continuing to train the content encoding through a subsequent heterogeneous graph neural network.
6. The system of claim 5, wherein the node content is text-based for nodes
Figure 130033DEST_PATH_IMAGE005
The pre-trained language model adopts a BERT model, and the calculation mode is as follows:
Figure DEST_PATH_IMAGE016
wherein
Figure 562152DEST_PATH_IMAGE017
As a BERT model
Figure DEST_PATH_IMAGE018
The hidden state of the layer or layers is,
Figure 16529DEST_PATH_IMAGE019
is as follows
Figure DEST_PATH_IMAGE020
Input values of layers:
Figure 875901DEST_PATH_IMAGE021
wherein
Figure DEST_PATH_IMAGE022
And
Figure 547053DEST_PATH_IMAGE023
are all parameters obtained by the training process,
Figure 251704DEST_PATH_IMAGE025
is composed of
Figure DEST_PATH_IMAGE026
The dimension (c) of (a) is,
Figure 357326DEST_PATH_IMAGE027
as a BERT modelkA hidden state of the layer; if the BERT model is commonmLayer, then node
Figure 856440DEST_PATH_IMAGE005
Is initialized to
Figure DEST_PATH_IMAGE028
7. The system according to claim 1, wherein in the heteromorphic neural network module, a vector representation of each node is calculated based on content encoding of the node itself and its neighboring nodes in the medical term knowledge graph; knowledge graph nodes for medical terms
Figure 14889DEST_PATH_IMAGE029
By using
Figure DEST_PATH_IMAGE030
Represents from
Figure 788810DEST_PATH_IMAGE005
Set of nodes pointed directly by the starting arrow, if
Figure 185156DEST_PATH_IMAGE005
Represents a medical term node, then
Figure 855172DEST_PATH_IMAGE030
Is composed of
Figure 471223DEST_PATH_IMAGE005
The set of primary information units of (a),
Figure 252098DEST_PATH_IMAGE031
is composed of
Figure 768530DEST_PATH_IMAGE005
The secondary information unit set of (2); definition of
Figure 609447DEST_PATH_IMAGE005
Set of adjacent nodes of
Figure 211329DEST_PATH_IMAGE030
Comprises the following steps:
Figure DEST_PATH_IMAGE032
then
Figure 327053DEST_PATH_IMAGE005
Vector representation of
Figure 697991DEST_PATH_IMAGE033
The calculation method is as follows:
Figure DEST_PATH_IMAGE034
wherein
Figure 8012DEST_PATH_IMAGE035
As the weight parameter, the following is specifically calculated:
Figure DEST_PATH_IMAGE036
wherein
Figure 628349DEST_PATH_IMAGE037
Figure DEST_PATH_IMAGE038
And
Figure 282185DEST_PATH_IMAGE039
in order to train the parameters of the resulting matrix,
Figure DEST_PATH_IMAGE040
is a non-linear activation function.
8. The system of claim 1, wherein the first stage of training in the heteromorphic neural network module records as a set of parameters that can be trained
Figure 817551DEST_PATH_IMAGE041
Then the goal of the training is to optimize the following objective function:
Figure DEST_PATH_IMAGE042
wherein
Figure 265850DEST_PATH_IMAGE043
Representing slave nodes
Figure 842325DEST_PATH_IMAGE005
Predict its neighboring nodes
Figure DEST_PATH_IMAGE044
The probability of (d);
in the second stage of training, the similarity between any two medical term nodes is calculated according to the formula:
Figure 299851DEST_PATH_IMAGE045
wherein
Figure 379802DEST_PATH_IMAGE005
And
Figure DEST_PATH_IMAGE046
for medical term nodes in a medical term knowledge-graph,
Figure 264582DEST_PATH_IMAGE047
is composed of
Figure 564238DEST_PATH_IMAGE005
And
Figure 294297DEST_PATH_IMAGE046
the degree of similarity of (a) to (b),Wandball are parameters obtained by training;
in the medical term normalized training data, the medical term node is set
Figure 228755DEST_PATH_IMAGE005
The nodes of the same meaning of the medical term are
Figure DEST_PATH_IMAGE048
And is and
Figure 550015DEST_PATH_IMAGE005
node sets of medical terms with different meanings
Figure 835502DEST_PATH_IMAGE049
Then train toLabel for exercise sample
Figure DEST_PATH_IMAGE050
Comprises the following steps:
Figure 634831DEST_PATH_IMAGE051
the goal of the second stage is to minimize the loss functionL
Figure DEST_PATH_IMAGE052
9. The system of claim 1, wherein the prediction result output module outputs a node for a medical term to be specified
Figure 721998DEST_PATH_IMAGE053
Computation based on trained heterogeneous graph neural networks
Figure 683001DEST_PATH_IMAGE053
Similarity with other medical term nodes in the medical term knowledge graph and ordering, taking the similarity with the other medical term nodes
Figure DEST_PATH_IMAGE054
Medical term node with maximum similarity
Figure 721364DEST_PATH_IMAGE055
Figure DEST_PATH_IMAGE056
Setting a threshold for similarity
Figure 324384DEST_PATH_IMAGE057
If, if
Figure DEST_PATH_IMAGE058
Then it is considered as
Figure 531636DEST_PATH_IMAGE053
And
Figure 132382DEST_PATH_IMAGE055
have the same meaning, namely the
Figure 392462DEST_PATH_IMAGE053
Normalizing the result; otherwise, consider as
Figure 2435DEST_PATH_IMAGE054
The meaning of the nodes is different from that of other medical terms in the medical term knowledge-graph,
Figure 31571DEST_PATH_IMAGE053
have independent meanings.
10. A medical term normalization method based on a heterogeneous graph neural network is characterized by comprising the following steps:
(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;
(2) based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;
(3) training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:
for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;
for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;
for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;
the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;
the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;
(4) and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.
CN202111213727.4A 2021-10-19 2021-10-19 Medical term normalization system and method based on heterogeneous graph neural network Active CN113656604B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202111213727.4A CN113656604B (en) 2021-10-19 2021-10-19 Medical term normalization system and method based on heterogeneous graph neural network
JP2023536585A JP7432802B2 (en) 2021-10-19 2022-09-05 Medical terminology normalization system and method based on heterogeneous graph neural network
PCT/CN2022/116967 WO2023065858A1 (en) 2021-10-19 2022-09-05 Medical term standardization system and method based on heterogeneous graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111213727.4A CN113656604B (en) 2021-10-19 2021-10-19 Medical term normalization system and method based on heterogeneous graph neural network

Publications (2)

Publication Number Publication Date
CN113656604A CN113656604A (en) 2021-11-16
CN113656604B true CN113656604B (en) 2022-02-22

Family

ID=78494655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111213727.4A Active CN113656604B (en) 2021-10-19 2021-10-19 Medical term normalization system and method based on heterogeneous graph neural network

Country Status (3)

Country Link
JP (1) JP7432802B2 (en)
CN (1) CN113656604B (en)
WO (1) WO2023065858A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656604B (en) * 2021-10-19 2022-02-22 之江实验室 Medical term normalization system and method based on heterogeneous graph neural network
CN114003791B (en) * 2021-12-30 2022-04-08 之江实验室 Depth map matching-based automatic classification method and system for medical data elements
CN116386895B (en) * 2023-04-06 2023-11-28 之江实验室 Epidemic public opinion entity identification method and device based on heterogeneous graph neural network
CN116312915B (en) * 2023-05-19 2023-09-19 之江实验室 Method and system for standardized association of drug terms in electronic medical records
CN117009839B (en) * 2023-09-28 2024-01-09 之江实验室 Patient clustering method and device based on heterogeneous hypergraph neural network
CN117497111B (en) * 2023-12-25 2024-03-15 四川省医学科学院·四川省人民医院 System for realizing disease name standardization and classification based on deep learning
CN117688974B (en) * 2024-02-01 2024-04-26 中国人民解放军总医院 Knowledge graph-based generation type large model modeling method, system and equipment

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7788213B2 (en) 2007-06-08 2010-08-31 International Business Machines Corporation System and method for a multiple disciplinary normalization of source for metadata integration with ETL processing layer of complex data across multiple claim engine sources in support of the creation of universal/enterprise healthcare claims record
WO2018209254A1 (en) * 2017-05-11 2018-11-15 Hubspot, Inc. Methods and systems for automated generation of personalized messages
EP3637435A1 (en) * 2018-10-12 2020-04-15 Fujitsu Limited Medical diagnostic aid and method
US11381651B2 (en) * 2019-05-29 2022-07-05 Adobe Inc. Interpretable user modeling from unstructured user data
CN110349639B (en) * 2019-07-12 2022-01-04 之江实验室 Multi-center medical term standardization system based on general medical term library
CN111400560A (en) * 2020-03-10 2020-07-10 支付宝(杭州)信息技术有限公司 Method and system for predicting based on heterogeneous graph neural network model
CN112035451A (en) 2020-08-25 2020-12-04 上海灵长软件科技有限公司 Data verification optimization processing method and device, electronic equipment and storage medium
CN112271001B (en) * 2020-11-17 2022-08-16 中山大学 Medical consultation dialogue system and method applying heterogeneous graph neural network
CN112541056A (en) 2020-12-18 2021-03-23 卫宁健康科技集团股份有限公司 Medical term standardization method, device, electronic equipment and storage medium
CN112542223A (en) * 2020-12-21 2021-03-23 西南科技大学 Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record
CN113010685B (en) 2021-02-23 2022-12-06 安徽讯飞医疗股份有限公司 Medical term standardization method, electronic device, and storage medium
CN113191156A (en) * 2021-04-29 2021-07-30 浙江禾连网络科技有限公司 Medical examination item standardization system and method based on medical knowledge graph and pre-training model
CN113377897B (en) * 2021-05-27 2022-04-22 杭州莱迈医疗信息科技有限公司 Multi-language medical term standard standardization system and method based on deep confrontation learning
CN113345545B (en) 2021-07-28 2021-10-29 北京惠每云科技有限公司 Clinical data checking method and device, electronic equipment and readable storage medium
CN113436698B (en) 2021-08-27 2021-12-07 之江实验室 Automatic medical term standardization system and method integrating self-supervision and active learning
CN113656604B (en) * 2021-10-19 2022-02-22 之江实验室 Medical term normalization system and method based on heterogeneous graph neural network

Also Published As

Publication number Publication date
CN113656604A (en) 2021-11-16
WO2023065858A1 (en) 2023-04-27
JP7432802B2 (en) 2024-02-16
JP2024500400A (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN113656604B (en) Medical term normalization system and method based on heterogeneous graph neural network
Dediu et al. Abstract profiles of structural stability point to universal tendencies, family-specific factors, and ancient connections between languages
CN106682397A (en) Knowledge-based electronic medical record quality control method
Sankaranarayanan et al. COVID-19 mortality prediction from deep learning in a large multistate electronic health record and laboratory information system data set: Algorithm development and validation
Reiter et al. A shared task for the digital humanities chapter 1: Introduction to annotation, narrative levels and shared tasks
Data et al. Mortality Prediction in the ICU
Liu et al. An explainable knowledge distillation method with XGBoost for ICU mortality prediction
Hassani et al. The science of statistics versus data science: What is the future?
Shahin et al. Artificial intelligence: from buzzword to useful tool in clinical pharmacology
Baron et al. Machine learning and other emerging decision support tools
CN114386436B (en) Text data analysis method, model training method, device and computer equipment
Xu Ecological influences on the formation of the hiring network in the communication job market, 2015 to 2019
Chen et al. Syntactic type-aware graph attention network for drug-drug interactions and their adverse effects extraction
Yu et al. The effect of mentee and mentor gender on scientific productivity of applicants for NIH training fellowships
CN110827966A (en) Regional single disease supervision system
Rahman et al. Modeling Influenza with a Forest Deep Neural Network Utilizing a Virtualized Clinical Semantic Network
Riezler et al. Validity, Reliability, and Significance
Ciaperoni Efficient and trustworthy methods for knowledge discovery
Wang et al. A Model for Predicting Physical Health of College Students Based on Semantic Web and Deep Learning Under Cloud Edge Collaborative Architecture
An et al. Knowledge-Enhanced Difference-Aware Clinical Time Series Representation Learning for Diagnosis Prediction
Kang et al. Scientific Networks
Qi et al. Recurrence Prediction and Risk Classification of COPD Patients Based on Machine Learning.
Skirgård Disentangling Ancestral State Reconstruction in historical linguistics: Comparing classic approaches and new methods using Oceanic grammar
Yang et al. Cautious explorers generate more future academic impact
CN115238700A (en) Biomedical entity extraction method based on multi-task learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant