WO2023124191A1

WO2023124191A1 - Depth map matching-based automatic classification method and system for medical data elements

Info

Publication number: WO2023124191A1
Application number: PCT/CN2022/116971
Authority: WO
Inventors: 李劲松; 辛然; 杨宗峰; 周天舒; 田雨
Original assignee: 之江实验室
Priority date: 2021-12-30
Filing date: 2022-09-05
Publication date: 2023-07-06
Also published as: JP2024502730A; CN114003791B; CN114003791A; JP7432801B2

Abstract

Disclosed are a depth map matching-based automatic classification method and system for medical data elements. The present invention defines a minimum metadata information-based medical data element graph data model, and also causes a depth map matching model to be suitable for a condition of a local data swamp having extremely low metadata information, thereby achieving the purpose of completing automatic classification of data elements using the least metadata information, as well as ensuring that graph structure data collected under a graph data model standard is suitable for training the depth map matching model; a vector representation of a medical data element is calculated on the basis of a representation learning method, and an effective data element which may be mapped to a standard data model is quickly and automatically screened by means of classification of the vector representation; vector representation of a column vertex is calculated on the basis of a graph attention mechanism, and the depth map matching model is constructed to complete automatic classification of the medical data element. The method and the system of the present invention have good scalability, and can used to process various data swamp-to-data lake conversion problems.

Description

Method and system for automatic classification of medical data elements based on depth map matching

technical field

The invention belongs to the field of regional medical big data centers and data production platforms, and in particular relates to an automatic classification method and system for medical data elements based on depth map matching.

Background technique

With the construction and development of medical informatization, the combination of big data and medical services has promoted the continuous improvement of smart medical technology. At present, smart medical care has begun to take shape. Regional medical institutions form a medical consortium or a medical community and build a unified medical big data center, which has become an inevitable trend in the development of the subsequent smart medical data governance system. However, the information platforms, software, and systems with complex structures in different forms of medical institutions make it impossible to realize data sharing and interaction between different institutional platforms, and the data is fragmented, forming data islands. In the process of building a medical big data center among regional medical institutions, it is often found that data within the institution (especially data with a long history) lacks management, information system documents lack effective maintenance, field notes are lost, and document quality is low, making it difficult to quickly and effectively trace Data kinship forms a local data swamp. In the development process of the traditional medical big data center, the relevant responsible personnel of the information department of each medical institution and the information system provider are required to cooperate with the developers of the medical big data center to develop the data interface (including the database view) based on the standard data model (such as OMOP CDM). , data dictionary) to complete the tasks of data discovery, classification and data association mapping, and the data of manual classification and association mapping are stored in the standard database corresponding to the standard data model. The diversity of data sources, the density and unpredictability of data swamps generally lead to problems such as long data interface development cycle, complex coordination process, and frequent rework, which consume a lot of manpower, material and financial resources, and hinder the rapid automation of regional medical big data centers At the same time, it creates many difficulties for the in-depth utilization of follow-up medical data.

The data discovery, classification and data association mapping tasks in the development process of the medical big data center can be abstracted as the screening and classification tasks of medical data elements and the association mapping tasks of classified medical data elements. First, the platform development plan designers define the standard data element classification system and the corresponding data interface specifications based on the standard data model. Afterwards, developers filter and determine the data elements that match the data interface specification through rule search and manual search. This process is called data discovery. The data discovery process determines which data elements in the medical institution's data lake should be included Collection: Developers develop data interfaces based on the results of data discovery, and complete the data collection work. Finally, developers classify the multi-source and heterogeneous data elements in the data lake of medical institutions according to the standard data element classification system, integrate and map them to the standard data element classification system.

The disadvantages of the prior art are mainly reflected in the following two aspects:

1) There are a large number of information systems in medical institutions, different sources of providers, complex data collection process, and a large amount of manual labor, which hinders the construction of medical big data centers and the effective development of big data applications. The number of information systems in a tertiary medical institution can reach as many as 100-300, forming a huge data lake. The large amount of data in the data lake and the intricate relationship determine that the data discovery work in the data interface development stage needs to rely on the long-term cooperation of the relevant personnel in charge of the information department of the medical institution and the information system provider. The data interfaces are connected with each other, resulting in data discovery. The labor cost is large and time-consuming. Once the intermediate link fails, the troubleshooting process is very complicated. It largely hinders the development of medical big data centers and the effective development of big data applications.

2) Common problems such as frequent changes in information systems of medical institutions, difficulty in maintaining historical system documents, and serious missing data form local data swamps in the data lakes of medical institutions, further increasing the difficulty of data interface development. Medical data includes diagnosis and treatment data generated during patient diagnosis and treatment and observation data during the operation of medical institutions, with various sources and complex relationships. With the change of the information system version of the medical institution, historical data sleeps in the data lake of the medical institution without effective management, forming a local data swamp. The construction of a medical big data center requires the integration of these historical data to complete the transformation from a data swamp to a data lake. Due to the frequent turnover of relevant personnel in the information department of medical institutions and information system providers, the loss of historical system documents occurs from time to time. In the face of document loss, data interface developers can only rely on repeated trial and error methods to update all data in the data lake of medical institutions. Possible data are manually screened to complete the data discovery. Due to the large number of information systems of medical institutions and the complex correlation, the manual screening method is difficult to effectively use the global information of the data lake of medical institutions, which takes a long time and has a high error rate, which greatly increases the data Discover the duty cycle and difficulty of the job. When the correlation structure between the data in the data lake is too complex to be accepted by humans, the development of the corresponding data interface can only be abandoned, so that the data of the corresponding category cannot find the data that can be correlated and mapped, resulting in the loss of the data of this category.

Contents of the invention

During the construction of the medical big data center, problems such as local data swamps in medical institutions generally exist, resulting in long data interface development time and difficult maintenance. Traditional solutions rely on manual processing, and it is difficult to complete data discovery, classification and association mapping of massive data on a large scale. The multi-source heterogeneous data in the data lake of medical institutions can be abstracted into a set of medical data elements to be screened composed of unknown classification data elements. In the past few years, the rise and application of graph neural networks have successfully promoted the development of deep learning paradigms for graph-structured data.

The present invention utilizes the deep graph matching algorithm based on the graph neural network to improve the data element classification method based on manual processing, minimize the dependence on the data files of the information system, and obtain only a few metadata information in the data lake of the medical institution. , realize the rapid screening of effective data elements based on the text semantics of medical data, realize the automatic data discovery of data in the data lake of medical institutions, realize the rapid classification of medical data elements based on the depth map matching algorithm, and realize the conversion of data elements in the data lake of medical institutions to standard data The automatic classification and association mapping of the meta-classification system greatly improves the efficiency of data interface development in the development process of the medical big data center. The data element classification method provided by the present invention has good scalability, and can be applied to the processing of various data swamp to data lake transformation problems.

The purpose of the present invention is achieved through the following technical solutions:

One aspect of the present invention discloses a method for automatic classification of medical data elements based on depth map matching, the method includes the following steps:

(1) Define a medical data element map data model based on the minimum metadata information; the multi-source heterogeneous data elements stored in the data lake in the medical institution form a set of medical data elements to be screened, and add to the medical data element map data model Automated mapping, the mapping results are stored as metagraph data of medical data to be screened;

(2) Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; construct a medical data element screening model, and calculate the column corresponding to each column vertex based on the importance of each column vertex The possibility of mapping to the standard data model, screening out the effective column vertices, the medical data element graph data to be classified is formed by the association of the effective column vertices, and the column set corresponding to the effective column vertices forms the medical data element set to be classified;

(3) Determine the seed vertex set of the standard classification medical data meta-graph data from the medical data meta-graph data to be classified; perform the sub-graph cutting of the medical data meta-graph data to be classified based on the seed vertex set; use the depth map matching model to complete the classification to be classified Classify the column vertices in the medical data element graph data, so as to obtain the classification of the medical data elements corresponding to the column vertices.

Further, the medical data element graph data model is modeled using a directed attribute graph, and the graph is composed of two graph elements: vertices and edges;

The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label; the ontology information of the vertex includes the vertex type and the attributes corresponding to each type of vertex information, the vertex type includes database vertex, table vertex and column vertex, the attribute information corresponding to the database vertex includes database vertex index and database type information, the attribute information corresponding to the table vertex includes table vertex index, and the column vertex The corresponding attribute information includes column vertex index, column data type information and column vector representation;

The edge is composed of an edge type and an edge attribute, and each edge is a directed edge; the ontology information of the edge includes the edge type and attribute information corresponding to each type of edge, and the edge type includes the starting point being a database vertex, The parent-child association whose end point is a table vertex, the parent-child association whose starting point is a table vertex and the end point is a column vertex, and the foreign key whose starting point and end point are both column vertices. The attribute information corresponding to the three edge types is the edge index.

Further, the mapping of the multi-source heterogeneous data elements to the medical data element graph data model includes:

Collect heterogeneous medical data from multiple sources from the data lake to form a collection of medical data elements to be screened;

Use the metadata collection tool to capture the metadata stored in the data lake;

Use the column vector generator to traverse the data stored in each column of each table in the medical data element set to be screened, and use the column vector representation model to predict and obtain the column vector representation of each column of each table;

Through the map data association mapping, the collected metadata and the generated column vector representation are mapped to the medical data element graph data model to obtain the medical data element graph data to be screened.

Further, the column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;

The training of the column vector representation model includes: the training data of the column vector representation model is stored in the standard database to manually complete the medical data element classification, and the data structure conforms to the column data of the standard data model, which is recorded as a standard classification column; the standard classification medical treatment There is a one-to-one correspondence between the column vertices in the data element graph data and the corresponding standard classification columns;

Let the column vertex set in the standard classification medical data element graph data be C={c _{k, j} }, wherein c _{k, j} represent the kth column in the standard classification column corresponding to the column vertex set, the data of the jth row, c _{k, j} ={w _t } _{t=1, 2,..., m} , m is the total number of characters in line j, w _t is the character that constitutes data c _{k, j} ; the initial value of character w _t is obtained by calculating the text representation model h Vector representation h(w _t ); random sampling of n rows of data {c _{k, j} } _{j=1, 2,..., n} under the column vertex C _k of the standard classified medical data element graph data, the jth row of data A vector is expressed as

According to the calculation of the self-attention mechanism, the correlation of each row of data under the column vertex C _k in the standard classified medical data element graph data is obtained, and the column vector representation H(C _k ) of the column vertex C _k is obtained. The calculation formula is:

Where v(C _k ) is the vector representation of column vertex C _k , d _k is the dimension of v(C _k ), and softmax is the softmax function;

The prediction of the column vector representation model includes: the prediction data of the column vector representation model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is performed using the column as a traversal unit. Traversing; using the column vector representation model to calculate the column vector representation of each random sampling of the column vertices; calculating the average of the predicted multiple random sampling column vector representation results, as the final column vector representation of the column vertices.

Further, the calculation of the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model includes:

For the column vertices C _k stored in the medical data element graph data to be screened, randomly select p column vertices {C _t } _{t=1, 2, ..., p} from the set of column vertices except C _k , and calculate the column The correlation between vertex C _k and the extracted column vertices is calculated as the importance score Im(C _k ) of C _k in the medical data element graph data model, and Im(C _k ) is defined as:

Among them, Importance_score is an importance function.

Further, the training and prediction of the medical data element screening model are as follows:

Convert the standard classification medical data element set constructed according to the standard data element classification system, artificial classification and association mapping into standard classification medical data element graph data, and set the column vertex set stored in the standard classification medical data element graph data as S={s _k }, set the column vertex set corresponding to the column excluded by manual screening in the process of constructing the standard classification medical data element set as S′={s′ _k };

During training, q column vertices are randomly selected from the set S as the positive sample set {s _t } _{t=1, 2, ..., q} , and q column vertices are randomly selected from the set S′ as the negative sample set {s′ _t } _{t=1, 2, ..., q} ; Let the importance score of the sample (s _i , y _i ) be Im(s _i ), s _i represents the i-th column vertex, y _i ∈ {0, 1 } represents the true category of the sample, then calculate the loss function Loss of the medical data element screening model based on the importance score:

When predicting, the medical data element screening model judges whether the column in the medical data element set to be screened corresponding to the column vertex C _k is a valid data element by calculating the threshold L', and the calculation formula of the threshold L' is:

If L'≥0.5, it indicates that the column vertex C _k is a valid column vertex, and the corresponding column is a valid data element;

The medical data element graph data to be classified is formed by association of the filtered effective column vertex sets, and the corresponding filtered column sets form the medical data element set to be classified.

Further, the determination of the seed vertex set of the standard classified medical data meta graph data from the medical data meta graph data to be classified includes:

Let all the standard classification sets in the standard data element classification system defined by the standard data model be E, the set of column vertices in the standard classification medical data element graph data be D, and the classification of D _i ∈ D in the standard data element classification system be E _i ∈ E; suppose the set of column vertices stored in the medical data element graph data to be classified is C; the medical data element classification process is abstracted as finding the column vertex D _i with the highest matching degree with the column vertex C _k ∈ C in D, so as to determine The classification of the column corresponding to the column vertex C _k is E _i ;

For a column vertex D _i ∈ D, randomly select r ₀ data from the column corresponding to D _i

For a column vertex C _k ∈ C, randomly sample r ₀ data from the column corresponding to C _k

Then the matching degree match_1(D _i , C _k ) of D _i and C _k is:

Where v(x) represents the vector representation of data x, then the seed vertex corresponding to D _i is the column vertex with the highest matching degree

Right now:

Further, the subgraph cutting of the metagraph data of medical data to be classified based on the set of seed vertices includes:

by

Represents the medical data element graph data to be classified and

A collection of column vertices with a parent-child relationship, with

Represents the medical data element graph data to be classified and

A collection of column vertices with foreign key relationships, based on the seed vertex

The subgraph obtained by cutting

for:

Let N(D _i ) denote the set of column vertices associated with the same parent vertex as D _i in the standard classification medical data element graph data, then the goal of the depth graph matching model is to obtain

Search the subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D _i ) one by one, and realize

The classification of the medical data elements corresponding to the vertices of the middle column.

Further, the use of the depth map matching model to complete the classification of column vertices in the medical data element graph data to be classified includes:

According to the graph attention mechanism, the vector representation V(D _i ) of column vertices D _i in the standard classification medical data element graph data is calculated as:

in

r ₁ pieces of data are randomly selected from the column corresponding to the column vertex D';w(D', D _i ) represents the weight function of a certain column vertex D' in N(D _i ) for the column vertex D _i ;

According to the graph attention mechanism, calculate the column vertices of the medical data element graph data to be classified

vector representation of

for:

in

Randomly sample r ₁ pieces of data from the column corresponding to column vertex C′;

express

A certain column vertex C′ in the column vertex

weight function;

Column vertex D′∈N(D _i ) and column vertex

The matching degree match_2(D', C') is:

Take the column vertex with the highest matching degree with C'

Right now:

The classification of the column corresponding to the column vertex C′ in the medical data element graph data to be classified is

The corresponding category in the standard data element taxonomy.

Another aspect of the present invention discloses a medical data element automatic classification system based on depth map matching, the system includes:

Standardized acquisition and mapping module of multi-source heterogeneous data elements: define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatically mapping to the medical data element map data model, and storing the mapping result as medical data element map data to be screened;

Effective medical data element screening module: Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out. The corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a collection of medical data elements to be classified;

Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the deep graph matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices.

The beneficial effects of the present invention are:

1) The present invention only utilizes the minimal metadata information stored in the data lake of the medical institution, and uses the medical data element diagram data model to realize the standardized collection of medical data elements in the medical institution and the relationship information between medical data elements to be screened and classified full use of.

2) The method of the present invention reduces the dependence of the data discovery, classification and association mapping process on the historical documents of the information system of the medical institution, and the absence and error of the historical documents have little influence on the classification results of the medical data elements.

3) The method of the present invention greatly reduces the manual intervention in the process of data discovery, classification and association mapping, and classifies the medical data elements to be classified through the artificial intelligence algorithm, which meets the needs of real-time update, dynamic aggregation and deep utilization of medical big data center data Provides a heuristic solution to the difficult problem of automatic classification of medical data elements in .

Description of drawings

Fig. 1 is the overall flowchart of the method of the present invention;

Fig. 2 is the flowchart of traditional medical data element classification method;

3 is a schematic diagram of the implementation process of the automatic classification method for medical data elements based on depth map matching provided by the present invention;

Fig. 4 is an example of medical data element diagram data model;

Fig. 5 is a schematic diagram of the mapping of multi-source heterogeneous data elements to the medical data element graph data model.

Detailed ways

In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.

In the following description, a lot of specific details are set forth in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do it without departing from the meaning of the present invention. By analogy, the present invention is therefore not limited to the specific examples disclosed below.

The terms involved in the present invention are first described below:

Metadata: Data that describes other data. Metadata is data about data. Sometimes it does not specifically refer to a single data. It can be understood as a set of information groups/data groups used to describe data. All data and information in this information group/data group , all describe/reflect a certain aspect of a certain data, then this information group/data group can be called a metadata. Metadata can describe data about its elements or properties (name, size, data type, etc.), its structure (length, fields, data columns), or its related data (where it is located, how it is contacted, who owns it). In everyday life, metadata is ubiquitous. As long as there is a class of things, a set of metadata can be defined.

Data element: can be understood as the basic unit of data. The basic data elements of health information standardize and define the unique Chinese names and codes of all relevant information in the field of medicine and health, and the codes are expressed in letters, Chinese characters, and digital strings. A data element enumerates and defines an information resource in a specific semantic environment. Complete data element name = object class term + feature class term + representation class term + (qualified class term).

Differences and connections between data elements and metadata: Metadata cannot possibly cover all the information necessary to understand the data that a data element is intended to represent. Information about data elements is an integral part of any (organizational) metadata. Each element of metadata is a data element, and metadata attributes and description methods conforming to data element standards are used to describe metadata. Storing and codifying metadata in a repository requires modeling, which requires obtaining metadata from a registry of data elements or from a repository. Metadata, which is a data element expressed in a consistent and standard way. Both metadata and data element dictionary formats are composed of attributes such as line number, Chinese name, English name, identifier (phrase), definition, constraint/condition, maximum number of occurrences, data type, and data value range. The difference is that there are other attributes such as context and synonym name in the data element dictionary format.

Data Lake: A data lake is a method of storing data in a natural format in a system or repository, which facilitates the configuration of data in various schema and structural forms, usually object blocks or files. The main idea of a data lake is the unified storage of all data in an enterprise, from raw data (an exact copy of source system data) to target data for various tasks such as reporting, visualization, analysis, and machine learning. In China, the entire HDFS is generally called a data warehouse (in a broad sense), that is, the place where all data is stored, while in foreign countries it is generally called a data lake. When data lakes are left unmanaged, data swamps form. It is easy to build a data lake, but it is difficult to make the data lake play a role. In the end, the data lake just pours data into it all the time, and there are very few application scenarios, with no output or very little output, forming a one-way lake. Most enterprises that use data lakes often fail to use the data because the quality of the data in the data lake is too poor when the data really needs to be used.

Graph Neural Networks: In the past few years, the rise and application of neural networks has successfully promoted the research of pattern recognition and data mining. Many machine learning tasks (such as object detection, machine translation, and speech recognition) that once relied heavily on manually extracted features have been revolutionized by various end-to-end deep learning paradigms. Although traditional deep learning methods have been applied to extract features of Euclidean space data with great success, the data in many practical application scenarios are generated from non-Euclidean spaces, and traditional deep learning methods are not effective in processing non-Euclidean space data. Performance is still unsatisfactory. Each data sample (node) in the graph will have edges related to other real data samples in the graph, and this information can be used to capture the interdependencies between instances. Graph neural network is a neural network applied to graph-structured data (non-Euclidean space).

Deep graph matching: Graph matching is a classic problem in artificial intelligence and has important applications in several fields, such as matching 2D/3D shapes in computer vision, matching protein networks in bioinformatics, and matching different networks in social networks. user etc. Deep graph matching is a method based on graph neural network to solve the graph matching problem.

As shown in Figure 1, the present invention provides a kind of automatic classification method of medical data element based on depth map matching, and this method comprises the following steps:

(1) Standardized collection and mapping of multi-source heterogeneous data elements, including:

Define a medical data metagraph data model based on minimal metadata information;

Combine the multi-source heterogeneous data elements stored in the data lake in the medical institution to form a collection of medical data elements to be screened, automatically map to the data model of the medical data element map, and store the mapping results as the data of the medical data element map to be screened;

Fig. 2 is a flowchart of traditional medical data element classification method. The implementation process of each part of the method of the present invention will be described in detail below with reference to FIG. 3 .

1. Standardized collection and mapping of multi-source heterogeneous data elements

1.1 Definition of medical data element graph data model

The data of medical institutions are aggregated to form a data lake. The data of the data lake has the characteristics of multi-source heterogeneity, including the observation data of the diagnosis and treatment process and the operation process of medical institutions in the medical process. The purpose and design of the observation database are different. The electronic medical records formed during the diagnosis and treatment process are designed to support clinical practice, while the operating data of medical institutions are constructed for in-hospital management and medical insurance reimbursement processes. Each is collected for a different purpose, resulting in data having a different logical organization and physical format.

The data model is a tool used to abstract the real world in database design. By establishing a standard and unified data model and defining data structure, data operation, and data constraints, it can effectively ensure the quality of collected data and the controllability of data representation standards, as shown in Fig. The data model is a data model developed based on the graph database.

Due to the different types of databases in the data lake, the relationship between data tables and data columns is complex. The time span of observation data in medical institutions is large, and the phenomenon of missing information in database documents is common. In order to make the effect of the depth map matching model mentioned in the present invention also applicable to the situation of local data swamps with extremely low metadata information, achieve the purpose of using the minimum metadata information to complete the automatic classification of data elements, and at the same time ensure the standard of the graph data model The graph structure data collected below is suitable for the training of the depth graph matching model. Based on the minimum metadata information of the database in the data lake, the present invention defines a medical data metadata graph data model based on the minimum metadata information, which is a medical big data center Automated classification of medical data elements during establishment provides a heuristic solution.

The graph data model is modeled by a directed attribute graph, which consists of two graph elements: vertex Vertex and edge Edge. The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label. Vertex ontology information includes vertex types and attribute information corresponding to each type of vertex.

The ontology information of the vertex of the medical data element graph data model defined by the present invention is shown in the following table:

Table 1 The ontology information table of the vertices of the medical data element graph data model

Among them, vid is the unique index id of each vertex in the graph, which can be hash coded uniformly. vector_embeddings is a column vector representing the result of the model prediction.

In the graph data model, an edge is composed of an edge type and an edge attribute, and each edge is a directed edge, and a directed edge indicates an association relationship between one vertex (start point src) and another vertex (end point dst). Edge ontology information includes edge types and attribute information corresponding to each type of edge.

The ontology information of the edge of the medical data element graph data model defined by the present invention is shown in the following table:

Table 2 The ontology information table of the edge of the medical data element graph data model

起点标签start tag	终点标签end point label	边类型edge type	属性Attributes	属性说明property description
DatabaseDatabase	Tabletable	父子关联parent-child relationship	eideid	边索引edge index
Tabletable	ColumnColumn	父子关联parent-child relationship	eideid	边索引edge index
ColumnColumn	ColumnColumn	外键foreign key	eideid	边索引edge index

Figure 4 is an example of a medical data element graph data model.

1.2 Mapping of multi-source heterogeneous data elements to medical data element graph data model

The data collection and association mapping process of the present invention collects heterogeneous medical data from multiple sources from the data lake to form a set of medical data elements to be screened. Use the metadata collection tool to capture the metadata stored in the data lake. Use the column vector generator to traverse the data stored in each column of each table in the medical data element set to be screened, and use the column vector representation model to predict and obtain the column vector representation of each column of each table. Finally, through graph data association mapping, the collected metadata and the generated column vector representation are associated and mapped to the medical data element graph data model to obtain the medical data element graph data to be screened. Referring to Figure 5, the specific implementation is described as follows:

(1) Metadata collection tool

a) Database adaptation: Since data lakes in medical institutions usually contain different types of databases, metadata collection tools need to develop database adaptation modules for different types of databases to achieve adaptation.

b) Parsing configuration: Since the final association mapping target is the medical data element graph data model, the collection information is configured to only collect table column information, blood relationship information and foreign key information of each column in the metadata; for primary keys, constraints, and indexes Common metadata such as , permissions, and triggers are not within the scope of collection.

c) Metadata capture: perform metadata capture operations on each database in the data lake according to the parsing configuration.

d) Data association: According to the database adaptation situation, the field types of different types of databases are uniformly mapped to the graph database data types. For example, the varchar2 type of the Oracle database and the varchar type of the MySQL database are uniformly mapped to the string type of the graph database, and the same is true for other types of databases.

(2) Column vector generator

The column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;

a) The column vector represents the training of the model

The column vector indicates that the training data of the model is the column data stored in the standard database that manually completes the classification of medical data and whose data structure conforms to the standard data model, referred to as the standard classification column.

There is a one-to-one correspondence between the column vertices in the standard classification medical data element graph data and the corresponding standard classification columns.

The method of obtaining the column vertex vector representation in the medical data element graph data is to convert the data stored in the column in the corresponding medical data element set into text data, and add [CLS] and [SEP] to the head and tail of each column of text data to represent The beginning and end of the data.

Let the column vertex set in the standard classification medical data element graph data be C={c _{k, j} }, where c _{k, j} represent the kth column and the data of the jth row in the standard classification column corresponding to the column vertex set, C _{k, j} ={w _t } _{t=1, 2,..., m} , m is the total number of characters in row j, and w _t is the characters constituting the data c _{k, j} . The initial vector representation h(w _t ) of the character w _t is obtained by calculating the text representation model h. The text representation model h can adopt a deep bidirectional language representation model (BERT model) based on the Transformer model. Randomly extract n rows of data {c _{k, j} } _{j=1, 2,..., n} under the column vertex C _k of the standard classified medical data element graph data, and the vector of the jth row of data is expressed as

According to the self-attention mechanism (self-attention) calculation, the correlation of the data of each row under the column vertex C _k in the standard classified medical data element graph data is obtained, and the column vector representation H(C _k ) of the column vertex C _k is obtained. The calculation formula is:

Where v(C _k ) is the vector representation of column vertex C _k , d _k is the dimension of v(C _k ), and softmax is the softmax function.

In order to obtain a more accurate column vertex vector representation, when a sufficient amount of standard classification columns has been accumulated as training data, the standard classification column data can be used for further transfer learning of the column vector representation model. Take the column as a unit, randomly cover 15% of the characters in the corresponding column data, and use the [MASK] label to replace the covered characters. Use the column vector representation model to predict the covered characters to further train and update the model, so that the obtained column vector representation model is more suitable for the task of screening valid data elements.

b) The column vector represents the prediction of the model

The column vector indicates that the prediction data of the model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is traversed with the column as the traversal unit. In order to avoid the performance degradation of the column vector generator due to the large amount of column data in the set of medical data elements to be screened, in the process of calculating the column vector representation using the column vector representation model, random sampling can be used (such as random sampling of 1000 data in a single column , sampled 100 times), use the column vector representation model to calculate the column vector representation H _s (C _k ) for the sth sampling of the column vertex C _k . Calculate the average of the column vector representation results of the predicted total S samples, and use it as the final column vector representation of C _k

Store H(C _k ) in the vector_embeddings attribute of the data model column vertex C _k of the medical data element graph.

(3) Graph data association mapping

The calculated column vector representation of each column in the medical data element set to be screened, as well as the metadata collection results, are respectively associated and mapped into objects corresponding to vertices and edges in the medical data element graph data model, and stored in the medical data element graph The corresponding mapping relationship is shown in the following table in the medical data element graph data whose data model is the data standard to be screened.

Table 3 Map data association mapping table

序号serial number	映射对象map object	对象属性object properties	元数据信息metadata information
11	DatabaseDatabase	顶点vertex	医疗机构内数据库名称(编号)Name (number) of the database in the medical institution
22	Tabletable	顶点vertex	数据库内数据表名称(编号)Data table name (number) in the database
33	ColumnColumn	顶点vertex	数据表内列名称(编号)Column name (number) in the data table
44	Database-TableDatabase-Table	边side	数据库和数据表的从属关系Dependencies of databases and data tables
55	Table-ColumnTable-Column	边side	数据表和表内列的包含关系The inclusion relationship between the data table and the columns in the table
66	Column-ColumnColumn-Column	边side	数据库列外键，列间血缘关系Database column foreign key, blood relationship between columns

2. Rapid and automatic screening of effective medical data elements

There are many types of information stored in data lakes in medical institutions. Compared with the data coverage of standard data models, there is usually a lot of information redundancy. In order to quickly and automatically screen effective medical data elements, before performing the automatic classification task of medical data elements, you can The data elements in the medical data element collection to be screened are screened to reduce the complexity of the data element classification task. The present invention proposes a method for quickly and automatically screening effective medical data elements, including the following two steps: (1) calculating the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model. (2) Construct a medical data element screening model, calculate the possibility of mapping the column corresponding to each column vertex to the standard data model based on the importance of each column vertex, and filter out the effective medical data elements to form a set of medical data elements to be classified.

2.1 Calculate the importance of column vertices in the medical data element graph data model based on the column vertex vector representation

There is a one-to-one correspondence between the column vertices stored in the medical data element graph data to be screened and the columns in the medical data element set to be screened. For the column vertices C _k stored in the medical data element graph data to be screened, randomly select p column vertices {C _t } _{t=1, 2, ..., p} from the set of column vertices except C _k , and calculate the column The correlation between vertex C _k and the extracted column vertices is calculated as the importance score Im(C _k ) of C _k in the medical data element graph data model, and Im(C _k ) is defined as:

Among them, Importance_score is an importance function.

2.2 Training and prediction of medical data element screening model

Convert the standard classification medical data element set constructed according to the standard data element classification system, artificial classification and association mapping into standard classification medical data element graph data, and set the column vertex set stored in the standard classification medical data element graph data as S={s _k }, and the set of column vertices corresponding to the columns excluded by manual screening in the process of constructing the standard classification medical data element set is S′={s′ _k }.

The importance function is updated through the Adam algorithm, and the medical data element screening model is updated.

When predicting, the medical data element screening model judges whether the column in the set of medical data elements to be screened corresponding to the column vertex C _k is a valid data element by calculating the threshold L'. The formula for calculating the threshold L' is:

If L'≥0.5, it means that the column vertex C _k is a valid column vertex, and the corresponding column is a valid data element.

Finally, the filtered effective column vertex set is associated to form the medical data element graph data to be classified, and the corresponding filtered column set forms the medical data element set to be classified.

3. Determine the category of medical data elements based on the depth map matching model

3.1 Determine the seed vertex set of standard classified medical data meta graph data from the medical data meta graph data to be classified

There is a one-to-one correspondence between the column vertices stored in the medical data element graph data to be classified and the columns in the medical data element set to be classified. Let all the standard classification sets in the standard data element classification system defined by the standard data model be E, the set of column vertices in the standard classification medical data element graph data be D, and the classification of D _i ∈ D in the standard data element classification system be E _i ∈ E; set the set of column vertices stored in the medical data element graph data to be classified as C. Then the medical data element classification process can be abstracted as finding the column vertex D _i with the highest matching degree with the column vertex C _k ∈ C in D, so as to determine the classification of the column corresponding to the column vertex C _k as E _i , and the medical big data center develops The data classification and association mapping process in the process can be abstracted as finding the C _k with the highest matching degree for all the classifications E _i of the standard data element classification system.

The data format or content of some columns in the standard database with the standard data model as the data standard will be relatively uniform, and the format or content of the columns of the standard classified medical data element set that has an associated mapping relationship with it will also be relatively uniform. If the vertices corresponding to these columns are firstly located to the corresponding vertices (called seed vertices) in the medical data element graph data to be classified, the search space of the depth map matching model can be reduced, thereby improving its efficiency. For a column vertex D _i ∈ D, randomly select r ₀ data from the column corresponding to D _i

For the column vertex C _k ∈ C in the medical data element graph data to be classified, r ₀ data are randomly selected from the column corresponding to C _k

Then the matching degree match_1(D _i , C _k ) of D _i and C _k is:

Right now:

3.2 Based on the seed vertex set, the subgraph cutting of the metagraph data of the medical data to be classified is performed

by

Represents the medical data element graph data to be classified and

A collection of column vertices with a parent-child relationship, with

Represents the medical data element graph data to be classified and

The subgraph obtained by cutting

for:

Search for a suitable subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D _i ) one by one, so that

3.3 Use the depth graph matching model to complete the classification of column vertices in the medical data element graph data to be classified

The medical data metadata classification process includes the following steps:

(1) Combining with the graph attention mechanism, calculate the vector representation V(D _i ) of the column vertices D _i in the standard classified medical data meta graph data and the column vertices of the unclassified medical data meta graph data

vector representation of

Specifically:

According to the graph attention mechanism, calculate the vector representation V(D _i ) of D _i as:

in

_r1 data is randomly selected from the column corresponding to the column vertex D′; w(D′, D _i ) represents the weight function of a certain column vertex D′ in N(D _i ) for the column vertex D _i , and the specific calculation method for:

in

is a nonlinear activation function, and W ₁ is the matrix parameter obtained from training.

According to the graph attention mechanism, calculate

vector representation of

for:

in

express

A certain column vertex C′ in the column vertex

The weight function of , the specific calculation method is:

in

is a nonlinear activation function, and W ₂ is a matrix parameter obtained from training.

(2) Calculate all D′∈N(D _i ) and

The matching degree is calculated based on the matching degree to obtain the classification of the column vertex C', which corresponds to the classification result of the column corresponding to C' in the medical data element set to be classified.

The matching degree match_2(D', C') of the column vertex D' of the standard classified medical data element graph data and the column vertex C' of the medical data element graph data to be classified is:

Take the column vertex with the highest matching degree with C'

Right now:

Then it shows that the classification of the column corresponding to the column vertex C′ in the medical data element graph data to be classified is

The corresponding category in the standard data element taxonomy.

The embodiment of the present invention also provides a medical data element automatic classification system based on depth map matching, the system includes:

Standardized acquisition and mapping module of multi-source heterogeneous data elements: define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatic mapping to the medical data element graph data model, and the mapping result is stored as medical data element graph data to be screened; the implementation of this module can refer to the above step 1.

Effective medical data element screening module: Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out. The corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a set of medical data elements to be classified; the realization of this module can refer to the above step 2.

Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the depth map matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices; the implementation of this module can refer to the above step three.

The key points of the medical data element automatic classification method and system based on depth map matching proposed by the present invention are as follows:

1) Based on the minimum metadata information of the data lake in the medical institution, a medical data metadata graph data model based on the minimum metadata information is defined, so that the effect of the depth map matching model is also applicable to the local data swamp with extremely low metadata information To achieve the goal of using the least metadata information to complete the automatic classification of data elements, and at the same time ensure that the graph structure data collected under the graph data model standard is suitable for the training of the deep graph matching model.

2) Calculate the vector representation of medical data elements based on the representation learning method, and quickly and automatically screen effective data elements that may be mapped to standard data models through the classification of vector representations.

3) Calculate the vector representation of column vertices based on the graph attention mechanism, and build a deep graph matching model to complete the automatic classification of medical data elements.

The above descriptions are only preferred implementations of the present invention. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the method and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims

A method for automatic classification of medical data elements based on depth map matching, characterized in that it includes:

(1) Define a medical data element map data model based on the minimum metadata information; the multi-source heterogeneous data elements stored in the data lake in the medical institution form a set of medical data elements to be screened, and add to the medical data element map data model Automated mapping, the mapping result is stored as medical data element graph data to be screened; the medical data element graph data model is modeled by a directed attribute graph, and the graph is composed of two types of graph elements: vertices and edges;

The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label; the ontology information of the vertex includes the vertex type and the attributes corresponding to each type of vertex information, the vertex type includes database vertex, table vertex and column vertex, the attribute information corresponding to the database vertex includes database vertex index and database type information, the attribute information corresponding to the table vertex includes table vertex index, and the column vertex The corresponding attribute information includes column vertex index, column data type information and column vector representation;

The edge is composed of an edge type and an edge attribute, and each edge is a directed edge; the ontology information of the edge includes the edge type and attribute information corresponding to each type of edge, and the edge type includes the starting point being a database vertex, The parent-child association whose end point is a table vertex, the parent-child association whose starting point is a table vertex and the end point is a column vertex, and the foreign key whose starting point and end point are both column vertices. The attribute information corresponding to the three edge types is an edge index;

(2) Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; construct a medical data element screening model, and calculate the column corresponding to each column vertex based on the importance of each column vertex The possibility of mapping to the standard data model, screening out the effective column vertices, the medical data element graph data to be classified is formed by the association of the effective column vertices, and the column set corresponding to the effective column vertices forms the medical data element set to be classified;

(3) Determine the seed vertex set of the standard classification medical data meta-graph data from the medical data meta-graph data to be classified; perform the sub-graph cutting of the medical data meta-graph data to be classified based on the seed vertex set; use the depth map matching model to complete the classification to be classified Classify the column vertices in the medical data element graph data, so as to obtain the classification of the medical data elements corresponding to the column vertices.
The method according to claim 1, wherein the mapping of the multi-source heterogeneous data elements to the medical data element graph data model includes:

Collect heterogeneous medical data from multiple sources from the data lake to form a collection of medical data elements to be screened;

Use the metadata collection tool to capture the metadata stored in the data lake;

Use the column vector generator to traverse the data stored in each column of each table in the medical data element set to be screened, and use the column vector representation model to predict and obtain the column vector representation of each column of each table;

Through the map data association mapping, the collected metadata and the generated column vector representation are mapped to the medical data element graph data model to obtain the medical data element graph data to be screened.
The method according to claim 2, wherein the column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;

The training of the column vector representation model includes: the training data of the column vector representation model is stored in the standard database to manually complete the medical data element classification, and the data structure conforms to the column data of the standard data model, which is recorded as a standard classification column; the standard classification medical treatment There is a one-to-one correspondence between the column vertices in the data element graph data and the corresponding standard classification columns;

Let the column vertex set in the standard classification medical data element graph data be C={c k, j }, wherein c k, j represent the kth column in the standard classification column corresponding to the column vertex set, the data of the jth row, c k, j ={w t } t=1, 2,..., m , m is the total number of characters in line j, w t is the character that constitutes data c k, j ; the initial value of character w t is obtained by calculating the text representation model h Vector representation h(w t ); random sampling of n rows of data {c k, j } j=1, 2,..., n under the column vertex C k of the standard classified medical data element graph data, the jth row of data A vector is expressed as
According to the calculation of the self-attention mechanism, the correlation of each row of data under the column vertex C k in the standard classified medical data element graph data is obtained, and the column vector representation H(C k ) of the column vertex C k is obtained. The calculation formula is:

Where v(C k ) is the vector representation of column vertex C k , d k is the dimension of v(C k ), and softmax is the softmax function;

The prediction of the column vector representation model includes: the prediction data of the column vector representation model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is performed using the column as a traversal unit. Traversing; using the column vector representation model to calculate the column vector representation of each random sampling of the column vertices; calculating the average of the predicted multiple random sampling column vector representation results, as the final column vector representation of the column vertices.
The method according to claim 3, wherein the calculation of the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model includes:

For the column vertices C k stored in the medical data element graph data to be screened, randomly select p column vertices {C t } t=1, 2, ..., p from the set of column vertices except C k , and calculate the column The correlation between vertex C k and the extracted column vertices is calculated as the importance score Im(C k ) of C k in the medical data element graph data model, and Im(C k ) is defined as:

Among them, Importance_score is an importance function.
The method according to claim 1, wherein the training and prediction of the medical data element screening model are specifically:

Convert the standard classification medical data element set constructed according to the standard data element classification system, artificial classification and association mapping into standard classification medical data element graph data, and set the column vertex set stored in the standard classification medical data element graph data as S={s k }, set the column vertex set corresponding to the column excluded by manual screening in the process of constructing the standard classification medical data element set as S′={s′ k };

During training, q column vertices are randomly selected from the set S as the positive sample set {s t } t=1, 2, ..., q , and q column vertices are randomly selected from the set S′ as the negative sample set {s′ t } t=1, 2, ..., q ; Let the importance score of the sample (s i , y i ) be Im(s i ), s i represents the i-th column vertex, y i ∈ {0, 1 } represents the true category of the sample, then calculate the loss function Loss of the medical data element screening model based on the importance score:

When predicting, the medical data element screening model judges whether the column in the medical data element set to be screened corresponding to the column vertex C k is a valid data element by calculating the threshold L', and the calculation formula of the threshold L' is:

If L'≥0.5, it indicates that the column vertex C k is a valid column vertex, and the corresponding column is a valid data element;

The medical data element graph data to be classified is formed by association of the filtered effective column vertex sets, and the corresponding filtered column sets form the medical data element set to be classified.
The method according to claim 1, wherein said determining the seed vertex set of the standard classification medical data metadata from the medical data metadata to be classified comprises:

Let all the standard classification sets in the standard data element classification system defined by the standard data model be E, the set of column vertices in the standard classification medical data element graph data be D, and the classification of D i ∈ D in the standard data element classification system be E i ∈ E; suppose the set of column vertices stored in the medical data element graph data to be classified is C; the medical data element classification process is abstracted as finding the column vertex D i with the highest matching degree with the column vertex C k ∈ C in D, so as to determine The classification of the column corresponding to the column vertex C k is E i ;

For a column vertex D i ∈ D, randomly select r 0 data from the column corresponding to D i
For a column vertex C k ∈ C, randomly sample r 0 data from the column corresponding to C k
Then the matching degree match_1(D i , C k ) of D i and C k is:

Where v(x) represents the vector representation of data x, then the seed vertex corresponding to D i is the column vertex with the highest matching degree
Right now:
The method according to claim 6, wherein the subgraph cutting of the medical data element graph data to be classified based on the seed vertex set includes:

by
Represents the medical data element graph data to be classified and
A collection of column vertices with a parent-child relationship, with
Represents the medical data element graph data to be classified and
A collection of column vertices with foreign key relationships, based on the seed vertex
The subgraph obtained by cutting
for:

Let N(D i ) denote the set of column vertices associated with the same parent vertex as D i in the standard classification medical data element graph data, then the goal of the depth graph matching model is to obtain
Search the subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D i ) one by one, and realize
The classification of the medical data elements corresponding to the vertices of the middle column.
The method according to claim 7, wherein the use of the depth map matching model to complete the classification of column vertices in the medical data element graph data to be classified includes:

According to the graph attention mechanism, the vector representation V(D i ) of column vertices D i in the standard classification medical data element graph data is calculated as:

in
r 1 pieces of data are randomly selected from the column corresponding to the column vertex D';w(D', D i ) represents the weight function of a certain column vertex D' in N(D i ) for the column vertex D i ;

According to the graph attention mechanism, calculate the column vertices of the medical data element graph data to be classified
vector representation of
for:

in
Randomly sample r 1 pieces of data from the column corresponding to column vertex C′;
express
A certain column vertex C′ in the column vertex
weight function;

Column vertex D′∈N(D i ) and column vertex
The matching degree match_2(D', C') is:

Take the column vertex with the highest matching degree with C'
Right now:

The classification of the column corresponding to the column vertex C′ in the medical data element graph data to be classified is
The corresponding category in the standard data element taxonomy.
An automatic classification system for medical data elements based on depth map matching, characterized in that it includes:

Standardized acquisition and mapping module of multi-source heterogeneous data elements: define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatically map to the medical data element graph data model, and the mapping result is stored as medical data element graph data to be screened; the medical data element graph data model is modeled by a directed attribute graph, and the graph consists of two types of graph elements: vertices and edges constitute;

The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label; the ontology information of the vertex includes the vertex type and the attributes corresponding to each type of vertex information, the vertex type includes database vertex, table vertex and column vertex, the attribute information corresponding to the database vertex includes database vertex index and database type information, the attribute information corresponding to the table vertex includes table vertex index, and the column vertex The corresponding attribute information includes column vertex index, column data type information and column vector representation;

The edge is composed of an edge type and an edge attribute, and each edge is a directed edge; the ontology information of the edge includes the edge type and attribute information corresponding to each type of edge, and the edge type includes the starting point being a database vertex, The parent-child association whose end point is a table vertex, the parent-child association whose starting point is a table vertex and the end point is a column vertex, and the foreign key whose starting point and end point are both column vertices. The attribute information corresponding to the three edge types is an edge index;

Effective medical data element screening module: Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out. The corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a collection of medical data elements to be classified;

Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the deep graph matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices.