CN113656604B

CN113656604B - Medical term normalization system and method based on heterogeneous graph neural network

Info

Publication number: CN113656604B
Application number: CN202111213727.4A
Authority: CN
Inventors: 李劲松; 杨宗峰; 辛然; 田雨; 周天舒
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-02-22
Anticipated expiration: 2041-10-19
Also published as: CN113656604A; WO2023065858A1; JP7432802B2; JP2024500400A

Abstract

The invention discloses a medical term normalization system and method based on a heterogeneous graph neural network. And constructing a heterogeneous graph neural network containing various types of medical terms based on the knowledge graph, and comprehensively considering the adjacent node distribution and the node content coding of the graph in the training process of the heterogeneous graph neural network for the medical term normalization. The invention can fully utilize the knowledge of the correlation and difference among the information units of the same type of medical terms, simultaneously accommodate various types of medical terms, can comprehensively learn the knowledge in the medical field, can conveniently add the new type of medical terms into the system, and reduces the workload of the normalization of the new type of medical terms.

Description

Medical term normalization system and method based on heterogeneous graph neural network

Technical Field

The invention belongs to the technical field of Chinese medical term standardization and multi-center medical information platforms, and particularly relates to a medical term standardization system and method based on an isomerous graph neural network.

Background

An important research direction in the process of medical informatization is to apply higher-performance machine learning and artificial intelligence technology to solve the actual clinical problems. One advantage of the artificial intelligence technology is that complex rules and characteristics can be found from mass data, so that medical data of multiple medical institutions are comprehensively utilized for analysis mining and model design, and support is provided for medical research and clinical decision-making work to become a necessary trend of medical informatization. Integration of medical data from different sources is made extremely difficult by the multitude of information standards employed by different medical institutions and the frequent artificial production of semi-structured and unstructured data. Medical terms are basic elements forming medical data, and a perfect medical term standardization system is established, so that medical data from different sources can be aligned to a unified standard and structure, and larger-scale and higher-quality data are provided for clinical decision and medical research work. Medical terms mainly include terms of the types of drugs, medical examinations, diseases, etc. generated during clinical operations. Different types of medical terms will contain information of a particular key dimension, which we define as the information element of the medical term. For example, the pharmaceutical term "5% glucose injection (base) 500 ml" contains the information elements as shown in table 1:

table 1 example drug term information element

The examination term "left hand means positive bit _ X" contains the information elements as shown in table 2:

table 2 examination terminology information element example

Some of the information elements are composed of other finer grained information elements, which are defined as primary information elements and secondary information elements, respectively, e.g. the pharmaceutical terms in table 1 comprise the primary information elements "pharmaceutical composition", "pharmaceutical dosage form", "pharmaceutical dosage" and "pharmaceutical specification", wherein the "pharmaceutical specification" information elements are composed of the secondary information elements "number" (500) and "dosing unit" (ml). A complete medical term can be determined given the information elements of a group of medical terms.

In actual clinical operation, due to the reasons of standard differences of information adopted by various medical institutions, personal habits of medical workers and the like, a large number of irregular medical terms are generated, which are mainly expressed as problems of redundancy or loss of key information units, irregular expression modes, non-uniform quantity units and the like, for example, the following medical terms have the same meanings but have larger differences in forms: "levofloxacin tablet (clonidine) 500 mg" and "clonidine 0.5 g/tablet". The aim of medical term normalization is to identify medical terms with the same meaning but different literal forms, so as to unify the expression modes of the medical terms, distinguish the medical terms with different meanings, and finally promote the normalization of the whole medical data.

The traditional medical term normalization method is to understand the meaning of each medical term by machine learning or manual verification method for a single category of medical terms, and to mark out medical terms with the same semantics. Such a method takes each medical term as a whole, ignores the structure of the information unit inherent in the medical term, and has the main disadvantages that: (1) the knowledge of the association and difference of information units with each other cannot be effectively exploited. The association and difference between information units of different dimensions of the same medical term can contain rich medical domain knowledge, and the existing practice does not explicitly structure and utilize the knowledge; (2) different types of medical terms can contain the same or related information units, and the traditional medical term standardization work is to respectively develop independent systems aiming at the medical terms of a single category, so that on one hand, the workload is overlarge, and on the other hand, the knowledge in the information units of the different types of medical terms cannot be comprehensively utilized; (3) the excess information is taken into account. Most medical terms contain some redundant characters besides the key information units due to the reasons of irregular expression, etc., the characters have little relation with the meaning of the medical term as a whole, and the meaning of the medical term is deviated as noise.

Disclosure of Invention

The invention aims to provide a medical term normalization system and method based on an isomerous graph neural network, aiming at the defects of the conventional medical term normalization method and based on the characteristics of medical terms. The invention constructs a novel knowledge graph based on the information unit for all medical terms, and normalizes the medical terms through the improved heterogeneous graph neural network on the basis of the knowledge graph, thereby effectively utilizing the knowledge in the medical term information unit and obtaining a more accurate medical term normalization result.

The purpose of the invention is realized by the following technical scheme: in order to fully utilize medical field knowledge contained in medical terms in the process of medical term normalization, the invention firstly constructs key information units for various types of medical terms, realizes the structural representation of the medical terms, and constructs a knowledge graph containing various types of medical terms based on the information units. And constructing a heterogeneous graph neural network containing various types of medical terms based on the knowledge graph, and comprehensively considering the adjacent node distribution and the node content coding of the graph in the training process of the heterogeneous graph neural network for the medical term normalization. By the method, the invention can fully utilize the knowledge of the correlation and difference among the information units of the same type of medical terms, simultaneously accommodate various types of medical terms in the system, comprehensively learn the knowledge in the medical field, conveniently add the new type of medical terms into the system and reduce the workload of the normalization of the new type of medical terms. Redundant characters and information can be discarded in the process of extracting the information unit from the medical term, and excessive noise and errors are avoided.

The invention discloses a medical term normalization system based on a heterogeneous graph neural network on one hand, which comprises the following components:

(1) an information unit construction module: defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;

(2) medical term knowledge-graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;

(3) the heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:

for the node with the numerical value type node content, the content code of the node is equal to the product of the numerical value of the node and a unit vector obtained by training a neural network of a heterogeneous graph;

for the node with the node content as a measurement unit, the calculation process of the content coding is as follows: obtaining semantic vectors of each basic unit and operation symbol through heterogeneous graph neural network training, splicing all the semantic vectors of the basic units and operation symbols contained in the node, and obtaining content codes through nonlinear conversion;

for the nodes with the node contents in text types, the content codes are obtained through a pre-trained language model;

the first stage of training: taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node;

the second stage of training: taking vector representation of the nodes as input, calculating the similarity of any two medical term nodes, and training the medical term nodes with the same meaning to be maximized;

(4) a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.

Further, the types of medical terms include pharmaceutical terms, disease terms, surgical terms, test terms, and examination terms.

Furthermore, in the information unit construction module, the sequence marking model is a BilSTM-CRF model; marking the interval of each information unit on the medical term as training data, and simultaneously marking characters of non-information units, so that the sequence marking model can discard redundant characters which have no influence on the whole meaning of the medical term.

Furthermore, in the information unit construction module, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted.

Further, in the heteromorphic neural network module, use

Represents the set of all nodes in the medical term knowledge-graph, for

Memory for recording

For the content of its nodes, the node is,

encoding its content; for nodes whose contents are numerical

Its content is encoded as:

wherein

Is a node

The value of itself;

expressing unit vectors, randomly initializing and obtaining the unit vectors through heterogeneous graph neural network training;

node with node content as metering unit

The node content is a sequence composed of basic units and operation symbols

Wherein

Is a basic unit or an operation symbol,

is composed of

The content is encoded as:

wherein

Training a parameter matrix obtained for a neural network of a heterogeneous graph;

the semantic vector of each basic unit or operation symbol is randomly initialized and obtained through training of a neural network of a heterogeneous graph;

is a vector splicing operator;

for node contents ofText type node

Computing using pre-trained language models

As a semantic vector of

And continuing to train the content encoding through a subsequent heterogeneous graph neural network.

Further, the node with text type node content

The pre-trained language model adopts a BERT model, and the calculation mode is as follows:

wherein

As a BERT model

The hidden state of the layer or layers is,

is as follows

Input values of layers:

wherein

And

are all parameters obtained by the training process,

is composed of

The dimension (c) of (a) is,

as a BERT modelkA hidden state of the layer; if the BERT model is commonmLayer, then node

Is initialized to

。

Further, in the abnormal pattern neural network module, calculating vector representation of each node based on content coding of the node and adjacent nodes in the medical term knowledge graph; knowledge graph nodes for medical terms

By using

Represents from

Set of nodes pointed directly by the starting arrow, if

Represents a medical term node, then

Is composed of

First level information unit set ofIn the synthesis process, the raw materials are mixed,

is composed of

The secondary information unit set of (2); definition of

Set of adjacent nodes of

Comprises the following steps:

then

Vector representation of

The calculation method is as follows:

wherein

As the weight parameter, the following is specifically calculated:

wherein

，

And

in order to train the parameters of the resulting matrix,

is a non-linear activation function.

Further, in the heteromorphic neural network module, in the first training stage, a parameter set which can be trained is recorded as

Then the goal of the training is to optimize the following objective function:

wherein

Representing slave nodes

Predict its neighboring nodes

The probability of (d);

in the second stage of training, the similarity between any two medical term nodes is calculated according to the formula:

wherein

And

for medical term nodes in a medical term knowledge-graph,

is composed of

And

the degree of similarity of (a) to (b),Wandball are parameters obtained by training;

in the medical term normalized training data, the medical term node is set

The nodes of the same meaning of the medical term are

And is and

node sets of medical terms with different meanings

Then training the label of the sample

Comprises the following steps:

the goal of the second stage is to minimize the loss functionL：

Further, in the prediction result output module, the medical term node to be normalized is output

Based on training completionHeteromorphic neural network computing

Similarity with other medical term nodes in the medical term knowledge graph and ordering, taking the similarity with the other medical term nodes

Medical term node with maximum similarity

：

Setting a threshold for similarity

If, if

Then it is considered as

And

have the same meaning, namely the

Normalizing the result; otherwise, consider as

The meaning of the nodes is different from that of other medical terms in the medical term knowledge-graph,

have independent meanings.

The invention also discloses a medical term normalization method based on the neural network of the heterogeneous graph, which comprises the following steps:

(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;

(2) based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;

(3) training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph; the adjacent nodes are all nodes which jump from one node to two levels along the direction of the edge of the medical term knowledge graph; the node content coding specifically comprises the following steps:

(4) and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.

The invention has the beneficial effects that: the invention defines a uniform information unit structure for different types of medical terms and realizes relatively uniform structural representation, thereby better utilizing the knowledge in the medical field in the process of medical term standardization and fully learning the association and difference of information units contained between the same type of medical terms and between different types of medical terms. By integrating all medical terms into the knowledge graph, the unified heterogeneous graph neural network realizes the standardization work of different types of medical terms, and the integrity and the uniformity of output results can be improved while the working efficiency of the standardization work of the medical terms is improved.

Drawings

FIG. 1 is a block diagram of a medical term normalization system based on a neural network with a heterogeneous graph according to an embodiment of the present invention;

FIG. 2 is a sequence annotation model training data provided in an embodiment of the present invention;

fig. 3 is a schematic view of a medical term knowledge-graph provided by an embodiment of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

In the present invention, the medical term normalization means: the method is a process of analyzing various medical terms generated in a real clinical environment by combining knowledge in the medical field and a natural language processing method, identifying medical terms with the same meaning and distinguishing medical terms with different meanings, and unifying the medical terms within a certain range to obtain the best order and social benefit. The establishment of the unified medical term standard and the term set is helpful for solving the problems of term repetition, connotation, semantic expression and inconsistent understanding and the like, and has important significance for effectively promoting the propagation, sharing and use of medical information in a wider range and a deeper level.

The heteromorphic neural network refers to: traditional deep learning methods have had great success on linear and matrix-shaped data, but the data in many practical application scenarios is graphical in structure. In recent years, researchers have defined and designed graph neural network models for processing graph data by taking the ideas of convolutional networks and cyclic networks as reference. The common graph neural network aims at a single graph with nodes and relationship types, and good performance can be obtained only by using adjacent node information of the graph. In contrast, graph data in the real world is usually large in node and relationship types and large in difference, and a graph of the type is called an abnormal graph. In the process of training the heteromorphic graph neural network, because the content of different types of nodes contains large difference of features and different information dimensions, the content coding information of the nodes needs to be considered while the information of adjacent nodes of the graph is used.

The embodiment of the invention provides a medical term normalization system based on a heterogeneous graph neural network, which comprises the following modules as shown in figure 1:

an information unit construction module, comprising:

(1) defining a key information unit for each type of medical term; the medical term types include drug terms, disease terms, operation terms, examination terms, and examination terms, the information units include primary information units and secondary information units, and the inclusion relationship between the primary information units and the secondary information units;

(2) identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library;

II, a medical term knowledge graph module: based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the medical term and the inclusion relationship between the information units, the inclusion relationship between the primary information unit and the secondary information unit, and the direction of the edge is from the inclusion side to the inclusion side;

thirdly, a heterogeneous graph neural network module: training a heteromorphic neural network based on the adjacent node distribution and node content coding of the medical term knowledge graph;

the adjacent nodes are all nodes which start from one node and jump two levels along the direction of the edge of the medical term knowledge graph and pass through;

the node content coding specifically comprises the following steps:

and fourthly, a prediction result output module: and inputting the medical term nodes to be normalized into the trained abnormal graph neural network, obtaining similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results.

The implementation process of each module is described in detail as follows:

information unit construction module

(1) An information element defining a medical term. Currently, some international universal medical term standard sets exist, information units with key dimensions are defined for specific medical terms of a single category, however, the correlation relationship between the information units is not established between the different types of medical term standard sets, so that information utilized in the medical term normalization process in the past can be limited to the interior of the medical terms of the single category, and a large amount of useful information is ignored. The invention combines the existing international universal medical term standard set and expert knowledge in the actual clinical process, uniformly defines key information units for various types of medical terms, and defines detailed primary information unit and secondary information unit structures. The types of medical terms that have been implemented by the present invention include pharmaceutical terms, disease terms, surgical terms, test terms and examination terms, which can be easily extended into the system of the present invention after defining the information element for the new type of medical terms if the new type of medical terms are subsequently required to be normalized. The information elements of the implemented medical terms are specifically defined as shown in table 3.

TABLE 3 information element of medical terms

(2) And constructing an information unit library. And predicting the probability of each character in the medical term belonging to each information unit by using a sequence labeling model, thereby identifying all information units contained in the medical term and realizing the structural representation of the medical term. The sequence labeling model used in the embodiment is a BilSTM-CRF model, the model firstly understands the context information of the medical terms through a BilSTM network, then constructs a state probability and transition probability matrix based on the output value of the BilSTM network at each character position of the medical terms, and constructs a CRF model, thereby obtaining better effect on the sequence labeling task. The process of constructing training data for the sequence labeling model is shown in fig. 2, and the interval of each information unit is labeled on the medical term serving as the training data, and meanwhile, characters of non-information units are also labeled, so that the sequence labeling model can discard redundant characters which do not affect the whole meaning of the medical term, and excessive noise is prevented from being introduced into a subsequent heteromorphic neural network.

(3) It should be noted that in table 3, the various primary information units all include a number and measurement unit secondary information unit, and the original number and measurement unit distribution in the medical terminology has a large span and sparsity, so as to increase the difficulty of training the neural network of the heterogeneous map. In order to solve the problem, firstly, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted, wherein the basic units comprise: ml (ml), mg (mg), mm (mm), s (sec), mol (amount of substance), u (unit), iu (international unit), count (count), type, stage, period, and the operation symbols include multiplication and division. A total of 90 normalized units of measure are produced. For example: the original unit of measurement is l (liter), the corresponding value is 1, the normalized unit of measurement is ml (milliliter), and the corresponding value is converted into 1000 correspondingly.

Second, medical term knowledge map module

And constructing a knowledge graph containing various types of medical terms based on the information unit library constructed by the information unit construction module, as shown in figure 3. Two major types of nodes are included: the circular nodes represent medical term nodes, the rectangular nodes represent information unit nodes, and each large type of node internally comprises a plurality of subdivided types of nodes, for example, the medical term nodes comprise "medicine term" nodes, "disease term" nodes and the like, and the information unit nodes comprise "medicine dose" nodes, "numerical value" nodes and the like. Edges include two relationships: 1) containment relationships between medical terms and information elements; 2) the inclusion relationship between the primary information element and the secondary information element. The range of division of the primary information element and the secondary information element may vary for different types of medical terms, for example, for disease terms, the "disease subject" is its primary information element, and for surgery terms, the "disease subject" is the secondary information element contained in the primary information element "disease property".

Three, heterogeneous graph neural network module

(1) The heterogeneous graph refers to a graph with more complex nodes and relationship types, and the medical term knowledge graph shown in fig. 3 is a heterogeneous graph. The common graph neural network aims at a single graph with nodes and relationship types, and good performance can be obtained only by depending on adjacent node information of the graph. In the process of training the heteromorphic graph neural network, because the content of different types of nodes contains large characteristic difference and different information dimensions, adjacent node distribution information and node content coding information of the graph need to be considered at the same time. When the content coding of the nodes is calculated, the invention designs proper calculation methods respectively aiming at different types of nodes.

(2) And calculating content codes of different types of nodes. By using

Represents the set of all nodes in the medical term knowledge graph of FIG. 3, for

Memory for recording

For the content of its nodes, the node is,

for the content encoding, the content encoding of different types of nodes is calculated as follows:

for nodes whose contents are numerical

Its content is encoded as:

wherein

Is a node

The value of itself;

node with node content as metering unit

The node content is a sequence composed of basic units and operation symbols

Wherein

Is a basic unit or an operation symbol,

is composed of

The content is encoded as:

wherein

is a vector splicing operator;

for nodes with textual contents

Computing using pre-trained language models

As a semantic vector of

And continuing to train the content encoding through a subsequent heterogeneous graph neural network. The pre-trained language model used in this embodiment is a BERT model, and the calculation method is as follows:

wherein

As a BERT model

The hidden state of the layer or layers is,

is as follows

Input values of layers:

wherein

And

are all parameters obtained by the training process,

is composed of

The dimension (c) of (a) is,

Is initialized to

This example takesm=12。

(3) In a heterogeneous graph neural network, a vector representation of each node is computed based on content encodings of the node itself and its neighboring nodes in the medical term knowledge graph. Knowledge graph nodes for medical terms

By using

Represents from

Set of nodes pointed directly by the starting arrow, if

Represents a medical term node, then

Is composed of

The set of primary information units of (a),

is composed of

The set of secondary information units of (1). Definition of

Set of adjacent nodes of

Comprises the following steps:

then

Vector representation of

The calculation method is as follows:

wherein

As weight parameter, representing the node

For node

Of importance, wherein

Can be

By itself or

The adjacent nodes are specifically calculated as follows:

wherein

，

And

in order to train the parameters of the resulting matrix,

for non-linear activation functions, in this example

. Since the relative importance between nodes is asymmetric, it is not possible to determine the relative importance of the nodes

Are also asymmetrical, i.e.

。

(4) And (5) training a heterogeneous graph neural network. The training process is divided into two phases: 1) taking the distribution of adjacent nodes and the content codes of the nodes as input, and the training aims to maximize the conditional probability of the adjacent nodes of each node on the adjacent nodes to obtain the vector representation of each node; 2) the vector representation of the nodes is taken as input, the similarity of any two medical term nodes is calculated, and the training aim is to maximize the similarity of medical term nodes with the same meaning.

In the first stage of the training process, the parameter set that can be trained is recorded as

Then the goal of the training is to optimize the following objective function:

wherein

Representing slave nodes

Predict its neighboring nodes

The probability of (c).

In the second stage of the training process, the similarity of any two medical term nodes is calculated according to the formula:

wherein

And

for medical term nodes in a medical term knowledge-graph,

is composed of

And

the degree of similarity is such that,Wandbare all parameters obtained by training. In the medical term normalized training data, the medical term node is set

The nodes of the same meaning of the medical term are

And is and

node sets of medical terms with different meanings

Then training the label of the sample

Comprises the following steps:

the goal of the second stage is to minimize the loss function

：

Fourth, output module of prediction result

For medical term node to be normalized

Computation based on trained heterogeneous graph neural networks

Medical term node with maximum similarity

：

Setting a threshold for similarity

If, if

Then it is considered as

And

have the same meaning, namely the

Normalizing the result; otherwise, consider as

have independent meanings. In this example to

。

For example, when the drug term "potassium chloride needle (tsukau production) 10% 10ml by 1 is normalized, its similarity to other drug term nodes is calculated as shown in table 4, and it can be known that the drug term node having the same meaning as it is" potassium chloride needle 10ml:1g tsukau pharmaceutical company limited "having the highest similarity.

TABLE 4 heterogeneous graph neural network computing medical term node similarity

The embodiment of the invention also provides a medical term normalization method based on the neural network of the heterogeneous graph, which comprises the following steps:

(1) defining a key information unit for each type of medical term; the information units comprise a first-level information unit, a second-level information unit and an inclusion relationship between the two-level information units; identifying information units contained in all medical terms on a character level by using a sequence labeling model, and constructing an information unit library; the implementation of this step refers to the information element building block.

(2) Based on the relationship between the medical terms and the information units, a medical term knowledge graph is constructed, the nodes of the knowledge graph comprise medical term nodes and information unit nodes, the edges are directed edges, and the edges comprise two relationships: the direction of the edge is from the containing side to the contained side.

the implementation of this step refers to the heterogeneous graph neural network module.

(4) Inputting medical term nodes to be normalized into the trained heteromorphic graph neural network to obtain similarity sequencing of the medical term nodes to be normalized and other medical term nodes in the medical term knowledge graph, and outputting medical term normalization results; the implementation of this step refers to the prediction result output module.

The invention defines and identifies the information units contained in a plurality of medical terms, and realizes the structural representation of the medical terms. The result of the structured representation of the medical terms can not only improve the effect of the normalization of the medical terms, but also greatly promote various aspects of medical informatization work; the invention constructs a novel knowledge graph aiming at the medical terms based on the information units of the medical terms, and can effectively promote various medical informatization works including the standardization of the medical terms; the invention constructs a novel heterogeneous graph neural network aiming at the medical term standardization work, realizes the standardization of different types of medical terms by a uniform model, simultaneously respectively realizes a proper content coding mode aiming at different types of information units, and designs a staged training mode for the heterogeneous graph neural network.

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A medical term normalization system based on a heterogeneous graph neural network, the system comprising:

2. The system of claim 1, wherein the types of medical terms include pharmaceutical terms, disease terms, surgical terms, test terms, and examination terms.

3. The system of claim 1, wherein in the information element construction module, the sequence labeling model is a BilSTM-CRF model; marking the interval of each information unit on the medical term as training data, and simultaneously marking characters of non-information units, so that the sequence marking model can discard redundant characters which have no influence on the whole meaning of the medical term.

4. The system according to claim 1, wherein in the information unit construction module, the numerical value and the measurement unit are preliminarily normalized, the original measurement unit is normalized into a single basic unit or a plurality of basic units which are combined together through different operator numbers, and the numerical value is correspondingly converted.

5. The system of claim 1, wherein the neural network module of the heteromorphic image is used

Represents the set of all nodes in the medical term knowledge-graph, for

Memory for recording

For the content of its nodes, the node is,

encoding its content; for nodes whose contents are numerical

Its content is encoded as:

wherein

Is a node

The value of itself;

node with node content as metering unit

The node content is a sequence composed of basic units and operation symbols

Wherein

Is a basic unit or an operation symbol,

is composed of

The content is encoded as:

wherein

is a vector splicing operator;

for nodes with textual contents

Computing using pre-trained language models

As a semantic vector of

6. The system of claim 5, wherein the node content is text-based for nodes

wherein

As a BERT model

The hidden state of the layer or layers is,

is as follows

Input values of layers:

wherein

And

are all parameters obtained by the training process,

is composed of

The dimension (c) of (a) is,

Is initialized to

。

7. The system according to claim 1, wherein in the heteromorphic neural network module, a vector representation of each node is calculated based on content encoding of the node itself and its neighboring nodes in the medical term knowledge graph; knowledge graph nodes for medical terms

By using

Represents from

Set of nodes pointed directly by the starting arrow, if

Represents a medical term node, then

Is composed of

The set of primary information units of (a),

is composed of

The secondary information unit set of (2); definition of

Set of adjacent nodes of

Comprises the following steps:

then

Vector representation of

The calculation method is as follows:

wherein

As the weight parameter, the following is specifically calculated:

wherein

，

And

in order to train the parameters of the resulting matrix,

is a non-linear activation function.

8. The system of claim 1, wherein the first stage of training in the heteromorphic neural network module records as a set of parameters that can be trained

Then the goal of the training is to optimize the following objective function: