WO2023124191A1 - Depth map matching-based automatic classification method and system for medical data elements - Google Patents

Depth map matching-based automatic classification method and system for medical data elements Download PDF

Info

Publication number
WO2023124191A1
WO2023124191A1 PCT/CN2022/116971 CN2022116971W WO2023124191A1 WO 2023124191 A1 WO2023124191 A1 WO 2023124191A1 CN 2022116971 W CN2022116971 W CN 2022116971W WO 2023124191 A1 WO2023124191 A1 WO 2023124191A1
Authority
WO
WIPO (PCT)
Prior art keywords
column
data
vertex
medical data
data element
Prior art date
Application number
PCT/CN2022/116971
Other languages
French (fr)
Chinese (zh)
Inventor
李劲松
辛然
杨宗峰
周天舒
田雨
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to JP2023536557A priority Critical patent/JP7432801B2/en
Publication of WO2023124191A1 publication Critical patent/WO2023124191A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • the invention belongs to the field of regional medical big data centers and data production platforms, and in particular relates to an automatic classification method and system for medical data elements based on depth map matching.
  • the data discovery, classification and data association mapping tasks in the development process of the medical big data center can be abstracted as the screening and classification tasks of medical data elements and the association mapping tasks of classified medical data elements.
  • the platform development plan designers define the standard data element classification system and the corresponding data interface specifications based on the standard data model.
  • developers filter and determine the data elements that match the data interface specification through rule search and manual search. This process is called data discovery.
  • the data discovery process determines which data elements in the medical institution's data lake should be included Collection: Developers develop data interfaces based on the results of data discovery, and complete the data collection work.
  • developers classify the multi-source and heterogeneous data elements in the data lake of medical institutions according to the standard data element classification system, integrate and map them to the standard data element classification system.
  • Medical data includes diagnosis and treatment data generated during patient diagnosis and treatment and observation data during the operation of medical institutions, with various sources and complex relationships.
  • diagnosis and treatment data generated during patient diagnosis and treatment and observation data during the operation of medical institutions, with various sources and complex relationships.
  • historical data sleeps in the data lake of the medical institution without effective management, forming a local data swamp.
  • the construction of a medical big data center requires the integration of these historical data to complete the transformation from a data swamp to a data lake. Due to the frequent turnover of relevant personnel in the information department of medical institutions and information system providers, the loss of historical system documents occurs from time to time.
  • the present invention utilizes the deep graph matching algorithm based on the graph neural network to improve the data element classification method based on manual processing, minimize the dependence on the data files of the information system, and obtain only a few metadata information in the data lake of the medical institution.
  • realize the rapid screening of effective data elements based on the text semantics of medical data realize the automatic data discovery of data in the data lake of medical institutions, realize the rapid classification of medical data elements based on the depth map matching algorithm, and realize the conversion of data elements in the data lake of medical institutions to standard data
  • the automatic classification and association mapping of the meta-classification system greatly improves the efficiency of data interface development in the development process of the medical big data center.
  • the data element classification method provided by the present invention has good scalability, and can be applied to the processing of various data swamp to data lake transformation problems.
  • One aspect of the present invention discloses a method for automatic classification of medical data elements based on depth map matching, the method includes the following steps:
  • (1) Define a medical data element map data model based on the minimum metadata information; the multi-source heterogeneous data elements stored in the data lake in the medical institution form a set of medical data elements to be screened, and add to the medical data element map data model Automated mapping, the mapping results are stored as metagraph data of medical data to be screened;
  • the medical data element graph data model is modeled using a directed attribute graph, and the graph is composed of two graph elements: vertices and edges;
  • the vertex is composed of a label and an attribute group corresponding to the label.
  • the label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label;
  • the ontology information of the vertex includes the vertex type and the attributes corresponding to each type of vertex information
  • the vertex type includes database vertex, table vertex and column vertex
  • the attribute information corresponding to the database vertex includes database vertex index and database type information
  • the attribute information corresponding to the table vertex includes table vertex index
  • the column vertex The corresponding attribute information includes column vertex index, column data type information and column vector representation;
  • the edge is composed of an edge type and an edge attribute, and each edge is a directed edge; the ontology information of the edge includes the edge type and attribute information corresponding to each type of edge, and the edge type includes the starting point being a database vertex, The parent-child association whose end point is a table vertex, the parent-child association whose starting point is a table vertex and the end point is a column vertex, and the foreign key whose starting point and end point are both column vertices.
  • the attribute information corresponding to the three edge types is the edge index.
  • mapping of the multi-source heterogeneous data elements to the medical data element graph data model includes:
  • the collected metadata and the generated column vector representation are mapped to the medical data element graph data model to obtain the medical data element graph data to be screened.
  • the column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;
  • the training of the column vector representation model includes: the training data of the column vector representation model is stored in the standard database to manually complete the medical data element classification, and the data structure conforms to the column data of the standard data model, which is recorded as a standard classification column; the standard classification medical treatment There is a one-to-one correspondence between the column vertices in the data element graph data and the corresponding standard classification columns;
  • a vector is expressed as According to the calculation of the self-attention mechanism, the correlation of each row of data under the column vertex C k in the standard classified medical data element graph data is obtained, and the column vector
  • v(C k ) is the vector representation of column vertex C k
  • d k is the dimension of v(C k )
  • softmax is the softmax function
  • the prediction of the column vector representation model includes: the prediction data of the column vector representation model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is performed using the column as a traversal unit. Traversing; using the column vector representation model to calculate the column vector representation of each random sampling of the column vertices; calculating the average of the predicted multiple random sampling column vector representation results, as the final column vector representation of the column vertices.
  • the calculation of the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model includes:
  • Importance_score is an importance function.
  • the medical data element screening model judges whether the column in the medical data element set to be screened corresponding to the column vertex C k is a valid data element by calculating the threshold L', and the calculation formula of the threshold L' is:
  • the medical data element graph data to be classified is formed by association of the filtered effective column vertex sets, and the corresponding filtered column sets form the medical data element set to be classified.
  • the determination of the seed vertex set of the standard classified medical data meta graph data from the medical data meta graph data to be classified includes:
  • subgraph cutting of the metagraph data of medical data to be classified based on the set of seed vertices includes:
  • N(D i ) denote the set of column vertices associated with the same parent vertex as D i in the standard classification medical data element graph data
  • the goal of the depth graph matching model is to obtain Search the subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D i ) one by one, and realize The classification of the medical data elements corresponding to the vertices of the middle column.
  • the use of the depth map matching model to complete the classification of column vertices in the medical data element graph data to be classified includes:
  • the vector representation V(D i ) of column vertices D i in the standard classification medical data element graph data is calculated as:
  • the classification of the column corresponding to the column vertex C′ in the medical data element graph data to be classified is The corresponding category in the standard data element taxonomy.
  • Another aspect of the present invention discloses a medical data element automatic classification system based on depth map matching, the system includes:
  • Standardized acquisition and mapping module of multi-source heterogeneous data elements define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatically mapping to the medical data element map data model, and storing the mapping result as medical data element map data to be screened;
  • Effective medical data element screening module Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out.
  • the corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a collection of medical data elements to be classified;
  • Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the deep graph matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices.
  • the present invention only utilizes the minimal metadata information stored in the data lake of the medical institution, and uses the medical data element diagram data model to realize the standardized collection of medical data elements in the medical institution and the relationship information between medical data elements to be screened and classified full use of.
  • the method of the present invention reduces the dependence of the data discovery, classification and association mapping process on the historical documents of the information system of the medical institution, and the absence and error of the historical documents have little influence on the classification results of the medical data elements.
  • the method of the present invention greatly reduces the manual intervention in the process of data discovery, classification and association mapping, and classifies the medical data elements to be classified through the artificial intelligence algorithm, which meets the needs of real-time update, dynamic aggregation and deep utilization of medical big data center data Provides a heuristic solution to the difficult problem of automatic classification of medical data elements in .
  • Fig. 1 is the overall flowchart of the method of the present invention
  • Fig. 2 is the flowchart of traditional medical data element classification method
  • FIG. 3 is a schematic diagram of the implementation process of the automatic classification method for medical data elements based on depth map matching provided by the present invention
  • Fig. 4 is an example of medical data element diagram data model
  • Fig. 5 is a schematic diagram of the mapping of multi-source heterogeneous data elements to the medical data element graph data model.
  • Metadata Data that describes other data. Metadata is data about data. Sometimes it does not specifically refer to a single data. It can be understood as a set of information groups/data groups used to describe data. All data and information in this information group/data group , all describe/reflect a certain aspect of a certain data, then this information group/data group can be called a metadata. Metadata can describe data about its elements or properties (name, size, data type, etc.), its structure (length, fields, data columns), or its related data (where it is located, how it is contacted, who owns it). In everyday life, metadata is ubiquitous. As long as there is a class of things, a set of metadata can be defined.
  • Data element can be understood as the basic unit of data.
  • the basic data elements of health information standardize and define the unique Chinese names and codes of all relevant information in the field of medicine and health, and the codes are expressed in letters, Chinese characters, and digital strings.
  • a data element enumerates and defines an information resource in a specific semantic environment.
  • Complete data element name object class term + feature class term + representation class term + (qualified class term).
  • Metadata cannot possibly cover all the information necessary to understand the data that a data element is intended to represent.
  • Information about data elements is an integral part of any (organizational) metadata.
  • Each element of metadata is a data element, and metadata attributes and description methods conforming to data element standards are used to describe metadata.
  • Storing and codifying metadata in a repository requires modeling, which requires obtaining metadata from a registry of data elements or from a repository.
  • Metadata which is a data element expressed in a consistent and standard way.
  • Both metadata and data element dictionary formats are composed of attributes such as line number, Chinese name, English name, identifier (phrase), definition, constraint/condition, maximum number of occurrences, data type, and data value range. The difference is that there are other attributes such as context and synonym name in the data element dictionary format.
  • a data lake is a method of storing data in a natural format in a system or repository, which facilitates the configuration of data in various schema and structural forms, usually object blocks or files.
  • the main idea of a data lake is the unified storage of all data in an enterprise, from raw data (an exact copy of source system data) to target data for various tasks such as reporting, visualization, analysis, and machine learning.
  • the entire HDFS is generally called a data warehouse (in a broad sense), that is, the place where all data is stored, while in foreign countries it is generally called a data lake.
  • data lakes When data lakes are left unmanaged, data swamps form. It is easy to build a data lake, but it is difficult to make the data lake play a role.
  • the data lake just pours data into it all the time, and there are very few application scenarios, with no output or very little output, forming a one-way lake.
  • Most enterprises that use data lakes often fail to use the data because the quality of the data in the data lake is too poor when the data really needs to be used.
  • Graph Neural Networks In the past few years, the rise and application of neural networks has successfully promoted the research of pattern recognition and data mining. Many machine learning tasks (such as object detection, machine translation, and speech recognition) that once relied heavily on manually extracted features have been revolutionized by various end-to-end deep learning paradigms. Although traditional deep learning methods have been applied to extract features of Euclidean space data with great success, the data in many practical application scenarios are generated from non-Euclidean spaces, and traditional deep learning methods are not effective in processing non-Euclidean space data. Performance is still unsatisfactory. Each data sample (node) in the graph will have edges related to other real data samples in the graph, and this information can be used to capture the interdependencies between instances.
  • Graph neural network is a neural network applied to graph-structured data (non-Euclidean space).
  • Deep graph matching is a classic problem in artificial intelligence and has important applications in several fields, such as matching 2D/3D shapes in computer vision, matching protein networks in bioinformatics, and matching different networks in social networks. user etc. Deep graph matching is a method based on graph neural network to solve the graph matching problem.
  • the present invention provides a kind of automatic classification method of medical data element based on depth map matching, and this method comprises the following steps:
  • Standardized collection and mapping of multi-source heterogeneous data elements including:
  • Fig. 2 is a flowchart of traditional medical data element classification method. The implementation process of each part of the method of the present invention will be described in detail below with reference to FIG. 3 .
  • the data of medical institutions are aggregated to form a data lake.
  • the data of the data lake has the characteristics of multi-source heterogeneity, including the observation data of the diagnosis and treatment process and the operation process of medical institutions in the medical process.
  • the purpose and design of the observation database are different.
  • the electronic medical records formed during the diagnosis and treatment process are designed to support clinical practice, while the operating data of medical institutions are constructed for in-hospital management and medical insurance reimbursement processes. Each is collected for a different purpose, resulting in data having a different logical organization and physical format.
  • the data model is a tool used to abstract the real world in database design. By establishing a standard and unified data model and defining data structure, data operation, and data constraints, it can effectively ensure the quality of collected data and the controllability of data representation standards, as shown in Fig.
  • the data model is a data model developed based on the graph database.
  • the present invention Based on the minimum metadata information of the database in the data lake, the present invention defines a medical data metadata graph data model based on the minimum metadata information, which is a medical big data center Automated classification of medical data elements during establishment provides a heuristic solution.
  • the graph data model is modeled by a directed attribute graph, which consists of two graph elements: vertex Vertex and edge Edge.
  • the vertex is composed of a label and an attribute group corresponding to the label.
  • the label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label.
  • Vertex ontology information includes vertex types and attribute information corresponding to each type of vertex.
  • the ontology information of the vertex of the medical data element graph data model defined by the present invention is shown in the following table:
  • Table 1 The ontology information table of the vertices of the medical data element graph data model
  • vid is the unique index id of each vertex in the graph, which can be hash coded uniformly.
  • vector_embeddings is a column vector representing the result of the model prediction.
  • an edge is composed of an edge type and an edge attribute, and each edge is a directed edge, and a directed edge indicates an association relationship between one vertex (start point src) and another vertex (end point dst).
  • Edge ontology information includes edge types and attribute information corresponding to each type of edge.
  • the ontology information of the edge of the medical data element graph data model defined by the present invention is shown in the following table:
  • Table 2 The ontology information table of the edge of the medical data element graph data model
  • Figure 4 is an example of a medical data element graph data model.
  • the data collection and association mapping process of the present invention collects heterogeneous medical data from multiple sources from the data lake to form a set of medical data elements to be screened.
  • Use the metadata collection tool to capture the metadata stored in the data lake.
  • Use the column vector generator to traverse the data stored in each column of each table in the medical data element set to be screened, and use the column vector representation model to predict and obtain the column vector representation of each column of each table.
  • graph data association mapping the collected metadata and the generated column vector representation are associated and mapped to the medical data element graph data model to obtain the medical data element graph data to be screened.
  • the collection information is configured to only collect table column information, blood relationship information and foreign key information of each column in the metadata; for primary keys, constraints, and indexes Common metadata such as , permissions, and triggers are not within the scope of collection.
  • Metadata capture perform metadata capture operations on each database in the data lake according to the parsing configuration.
  • the column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;
  • the column vector indicates that the training data of the model is the column data stored in the standard database that manually completes the classification of medical data and whose data structure conforms to the standard data model, referred to as the standard classification column.
  • the method of obtaining the column vertex vector representation in the medical data element graph data is to convert the data stored in the column in the corresponding medical data element set into text data, and add [CLS] and [SEP] to the head and tail of each column of text data to represent The beginning and end of the data.
  • the initial vector representation h(w t ) of the character w t is obtained by calculating the text representation model h.
  • the text representation model h can adopt a deep bidirectional language representation model (BERT model) based on the Transformer model.
  • v(C k ) is the vector representation of column vertex C k
  • d k is the dimension of v(C k )
  • softmax is the softmax function
  • the standard classification column data can be used for further transfer learning of the column vector representation model. Take the column as a unit, randomly cover 15% of the characters in the corresponding column data, and use the [MASK] label to replace the covered characters. Use the column vector representation model to predict the covered characters to further train and update the model, so that the obtained column vector representation model is more suitable for the task of screening valid data elements.
  • the column vector indicates that the prediction data of the model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is traversed with the column as the traversal unit.
  • random sampling can be used (such as random sampling of 1000 data in a single column , sampled 100 times), use the column vector representation model to calculate the column vector representation H s (C k ) for the sth sampling of the column vertex C k .
  • the calculated column vector representation of each column in the medical data element set to be screened, as well as the metadata collection results, are respectively associated and mapped into objects corresponding to vertices and edges in the medical data element graph data model, and stored in the medical data element graph
  • the corresponding mapping relationship is shown in the following table in the medical data element graph data whose data model is the data standard to be screened.
  • serial number map object object properties metadata information 1 Database vertex Name (number) of the database in the medical institution 2 table vertex Data table name (number) in the database 3 Column vertex Column name (number) in the data table 4 Database-Table side Dependencies of databases and data tables 5 Table-Column side The inclusion relationship between the data table and the columns in the table 6 Column-Column side Database column foreign key, blood relationship between columns
  • the present invention proposes a method for quickly and automatically screening effective medical data elements, including the following two steps: (1) calculating the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model. (2) Construct a medical data element screening model, calculate the possibility of mapping the column corresponding to each column vertex to the standard data model based on the importance of each column vertex, and filter out the effective medical data elements to form a set of medical data elements to be classified.
  • Importance_score is an importance function.
  • the importance function is updated through the Adam algorithm, and the medical data element screening model is updated.
  • the medical data element screening model judges whether the column in the set of medical data elements to be screened corresponding to the column vertex C k is a valid data element by calculating the threshold L'.
  • the formula for calculating the threshold L' is:
  • the filtered effective column vertex set is associated to form the medical data element graph data to be classified, and the corresponding filtered column set forms the medical data element set to be classified.
  • the medical data element classification process can be abstracted as finding the column vertex D i with the highest matching degree with the column vertex C k ⁇ C in D, so as to determine the classification of the column corresponding to the column vertex C k as E i , and the medical big data center develops
  • the data classification and association mapping process in the process can be abstracted as finding the C k with the highest matching degree for all the classifications E i of the standard data element classification system.
  • the data format or content of some columns in the standard database with the standard data model as the data standard will be relatively uniform, and the format or content of the columns of the standard classified medical data element set that has an associated mapping relationship with it will also be relatively uniform. If the vertices corresponding to these columns are firstly located to the corresponding vertices (called seed vertices) in the medical data element graph data to be classified, the search space of the depth map matching model can be reduced, thereby improving its efficiency.
  • N(D i ) denote the set of column vertices associated with the same parent vertex as D i in the standard classification medical data element graph data
  • the goal of the depth graph matching model is to obtain Search for a suitable subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D i ) one by one, so that The classification of the medical data elements corresponding to the vertices of the middle column.
  • the medical data metadata classification process includes the following steps:
  • w(D′, D i ) represents the weight function of a certain column vertex D′ in N(D i ) for the column vertex D i , and the specific calculation method for:
  • W 1 is the matrix parameter obtained from training.
  • W 2 is a matrix parameter obtained from training.
  • the matching degree match_2(D', C') of the column vertex D' of the standard classified medical data element graph data and the column vertex C' of the medical data element graph data to be classified is:
  • the embodiment of the present invention also provides a medical data element automatic classification system based on depth map matching, the system includes:
  • Standardized acquisition and mapping module of multi-source heterogeneous data elements define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatic mapping to the medical data element graph data model, and the mapping result is stored as medical data element graph data to be screened; the implementation of this module can refer to the above step 1.
  • Effective medical data element screening module Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out.
  • the corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a set of medical data elements to be classified; the realization of this module can refer to the above step 2.
  • Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the depth map matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices; the implementation of this module can refer to the above step three.

Abstract

Disclosed are a depth map matching-based automatic classification method and system for medical data elements. The present invention defines a minimum metadata information-based medical data element graph data model, and also causes a depth map matching model to be suitable for a condition of a local data swamp having extremely low metadata information, thereby achieving the purpose of completing automatic classification of data elements using the least metadata information, as well as ensuring that graph structure data collected under a graph data model standard is suitable for training the depth map matching model; a vector representation of a medical data element is calculated on the basis of a representation learning method, and an effective data element which may be mapped to a standard data model is quickly and automatically screened by means of classification of the vector representation; vector representation of a column vertex is calculated on the basis of a graph attention mechanism, and the depth map matching model is constructed to complete automatic classification of the medical data element. The method and the system of the present invention have good scalability, and can used to process various data swamp-to-data lake conversion problems.

Description

基于深度图匹配的医疗数据元自动化分类方法及系统Method and system for automatic classification of medical data elements based on depth map matching 技术领域technical field
本发明属于区域性医疗大数据中心、数据生产平台领域,尤其涉及一种基于深度图匹配的医疗数据元自动化分类方法及系统。The invention belongs to the field of regional medical big data centers and data production platforms, and in particular relates to an automatic classification method and system for medical data elements based on depth map matching.
背景技术Background technique
随着医疗信息化的建设与发展,大数据与医疗服务的结合,促进了智慧医疗技术不断提升。目前,智慧医疗已经初具雏形,区域性医疗机构组成医联体或医共体并构建统一的医疗大数据中心已成后续智慧医疗数据治理体系发展的必然趋势。然而,医疗机构形态各异的信息平台、软件以及结构复杂的系统,导致不同机构平台之间无法实现数据的共享与交互,数据呈碎片化,形成数据孤岛。在区域性医疗机构间构建医疗大数据中心的过程中,时常发现机构内数据(尤其是历史久远的数据)缺乏管理,信息系统文档缺乏有效维护,字段备注丢失,文档质量低下,难以快速有效追溯数据血缘,形成局部的数据沼泽。传统的医疗大数据中心开发过程中,需要各医疗机构信息化部门和信息系统提供厂商相关负责人员配合医疗大数据中心的开发人员基于标准数据模型(如OMOP CDM)开发的数据接口(包括数据库视图、数据字典)完成数据发现、分类和数据关联映射任务,并完成人工分类和关联映射的数据存在标准数据模型对应的标准数据库中。数据来源的多样性,数据沼泽的密集和不可预知性普遍造成数据接口开发周期长、协调过程复杂、返工次数多等问题,耗费大量的人力物力财力,阻碍了区域性医疗大数据中心的快速自动化构建,同时为后续医疗数据的深度利用创造了很多困难。With the construction and development of medical informatization, the combination of big data and medical services has promoted the continuous improvement of smart medical technology. At present, smart medical care has begun to take shape. Regional medical institutions form a medical consortium or a medical community and build a unified medical big data center, which has become an inevitable trend in the development of the subsequent smart medical data governance system. However, the information platforms, software, and systems with complex structures in different forms of medical institutions make it impossible to realize data sharing and interaction between different institutional platforms, and the data is fragmented, forming data islands. In the process of building a medical big data center among regional medical institutions, it is often found that data within the institution (especially data with a long history) lacks management, information system documents lack effective maintenance, field notes are lost, and document quality is low, making it difficult to quickly and effectively trace Data kinship forms a local data swamp. In the development process of the traditional medical big data center, the relevant responsible personnel of the information department of each medical institution and the information system provider are required to cooperate with the developers of the medical big data center to develop the data interface (including the database view) based on the standard data model (such as OMOP CDM). , data dictionary) to complete the tasks of data discovery, classification and data association mapping, and the data of manual classification and association mapping are stored in the standard database corresponding to the standard data model. The diversity of data sources, the density and unpredictability of data swamps generally lead to problems such as long data interface development cycle, complex coordination process, and frequent rework, which consume a lot of manpower, material and financial resources, and hinder the rapid automation of regional medical big data centers At the same time, it creates many difficulties for the in-depth utilization of follow-up medical data.
医疗大数据中心开发过程中的数据发现、分类和数据关联映射任务,可以抽象为医疗数据元的筛选、分类任务和分类后的医疗数据元关联映射任务。首先,平台开发方案设计人员基于标准数据模型定义标准数据元分类体系和对应的数据接口规范。其后,开发人员通过规则查找和人工搜索筛选并确定与数据 接口规范匹配的数据元,这一过程称为数据发现,数据发现过程确定了平台开发过程中医疗机构数据湖内哪些数据元应该被采集;开发人员根据数据发现的结果开发数据接口,并通过完成数据采集工作。最后,开发人员将医疗机构数据湖内的多源异构的数据元按照标准数据元分类体系进行分类,整合并关联映射到标准数据元分类体系上。The data discovery, classification and data association mapping tasks in the development process of the medical big data center can be abstracted as the screening and classification tasks of medical data elements and the association mapping tasks of classified medical data elements. First, the platform development plan designers define the standard data element classification system and the corresponding data interface specifications based on the standard data model. Afterwards, developers filter and determine the data elements that match the data interface specification through rule search and manual search. This process is called data discovery. The data discovery process determines which data elements in the medical institution's data lake should be included Collection: Developers develop data interfaces based on the results of data discovery, and complete the data collection work. Finally, developers classify the multi-source and heterogeneous data elements in the data lake of medical institutions according to the standard data element classification system, integrate and map them to the standard data element classification system.
现有技术缺点主要体现在以下两个方面:The disadvantages of the prior art are mainly reflected in the following two aspects:
1)医疗机构信息系统数量多、提供厂商来源各异,数据采集过程复杂,依赖大量人工,阻碍了医疗大数据中心的建设和大数据应用的有效开展。一家三甲级医疗机构的信息系统数量可以达到100-300之多,形成了一个巨大的数据湖。数据湖中数据量大,关系错综复杂,决定了数据接口开发阶段的数据发现工作需要依赖医疗机构信息化部门和信息系统提供厂商相关负责人员的长期配合,数据接口之间相互衔接,造成数据发现工作的人工成本大,耗费时间长。中间环节一旦出现故障,问题的排查过程非常复杂。很大程度上阻碍了医疗大数据中心的开发和大数据应用的有效开展。1) There are a large number of information systems in medical institutions, different sources of providers, complex data collection process, and a large amount of manual labor, which hinders the construction of medical big data centers and the effective development of big data applications. The number of information systems in a tertiary medical institution can reach as many as 100-300, forming a huge data lake. The large amount of data in the data lake and the intricate relationship determine that the data discovery work in the data interface development stage needs to rely on the long-term cooperation of the relevant personnel in charge of the information department of the medical institution and the information system provider. The data interfaces are connected with each other, resulting in data discovery. The labor cost is large and time-consuming. Once the intermediate link fails, the troubleshooting process is very complicated. It largely hinders the development of medical big data centers and the effective development of big data applications.
2)医疗机构信息系统更迭频繁,历史系统文档维护困难、缺失严重等常见问题在医疗机构的数据湖内形成局部的数据沼泽,进一步增加了数据接口开发的难度。医疗数据包含病人诊疗过程中生成的诊疗数据和医疗机构运营过程中的观测数据,来源多样,关系复杂。随着医疗机构信息系统版本的更迭,历史数据沉睡在医疗机构数据湖中缺乏有效管理,形成局部的数据沼泽。医疗大数据中心的构建需要对这些历史数据进行整合,完成数据沼泽向数据湖的转化。由于医疗机构信息化部门和信息系统提供厂商相关负责人员更替频繁,历史系统文档丢失情况时有发生,面对文档丢失,数据接口开发人员只能依靠重复试错的方法对医疗机构数据湖中所有可能的数据进行人工筛选来完成数据发现,由于医疗机构信息系统的数量多,关联关系复杂,人工筛选的方法难以有效利用医疗机构数据湖的全局信息,耗时长,错误率高,大幅增加了数据发现工作的工作周期和难度。当数据湖内数据间的关联结构过于复杂超过人工能接受的 程度时,只能放弃对应数据接口的开发,使得对应类别的数据无法找到可关联映射的数据,造成该分类的数据丢失。2) Common problems such as frequent changes in information systems of medical institutions, difficulty in maintaining historical system documents, and serious missing data form local data swamps in the data lakes of medical institutions, further increasing the difficulty of data interface development. Medical data includes diagnosis and treatment data generated during patient diagnosis and treatment and observation data during the operation of medical institutions, with various sources and complex relationships. With the change of the information system version of the medical institution, historical data sleeps in the data lake of the medical institution without effective management, forming a local data swamp. The construction of a medical big data center requires the integration of these historical data to complete the transformation from a data swamp to a data lake. Due to the frequent turnover of relevant personnel in the information department of medical institutions and information system providers, the loss of historical system documents occurs from time to time. In the face of document loss, data interface developers can only rely on repeated trial and error methods to update all data in the data lake of medical institutions. Possible data are manually screened to complete the data discovery. Due to the large number of information systems of medical institutions and the complex correlation, the manual screening method is difficult to effectively use the global information of the data lake of medical institutions, which takes a long time and has a high error rate, which greatly increases the data Discover the duty cycle and difficulty of the job. When the correlation structure between the data in the data lake is too complex to be accepted by humans, the development of the corresponding data interface can only be abandoned, so that the data of the corresponding category cannot find the data that can be correlated and mapped, resulting in the loss of the data of this category.
发明内容Contents of the invention
医疗大数据中心的构建过程中,医疗机构局部数据沼泽普遍存在等问题导致数据接口开发时间长,维护困难。传统的解决方案依赖人工处理,难以大规模完成海量数据的数据发现、分类和关联映射问题。医疗机构数据湖内的多源异构的数据可以抽象为由未知分类的数据元组成的待筛选医疗数据元集合。过去的几年里,图神经网络的兴起与应用成功推动了图结构数据的深度学习范式的发展。During the construction of the medical big data center, problems such as local data swamps in medical institutions generally exist, resulting in long data interface development time and difficult maintenance. Traditional solutions rely on manual processing, and it is difficult to complete data discovery, classification and association mapping of massive data on a large scale. The multi-source heterogeneous data in the data lake of medical institutions can be abstracted into a set of medical data elements to be screened composed of unknown classification data elements. In the past few years, the rise and application of graph neural networks have successfully promoted the development of deep learning paradigms for graph-structured data.
本发明利用基于图神经网络的深度图匹配算法,改进基于人工处理的数据元分类方法,最大程度降低对于信息系统数据文档的依赖,在只获取医疗机构数据湖内极少元数据信息的条件下,基于医疗数据文本语义实现有效数据元的快速筛选,实现医疗机构数据湖内数据的自动化数据发现,基于深度图匹配算法实现医疗数据元的快速分类,实现医疗机构数据湖内数据元向标准数据元分类体系的自动化分类和关联映射,大幅度提升医疗大数据中心开发过程中数据接口开发的效率。本发明提供的数据元分类方法具有良好的可拓展性,可应用于各类数据沼泽向数据湖转化问题的处理。The present invention utilizes the deep graph matching algorithm based on the graph neural network to improve the data element classification method based on manual processing, minimize the dependence on the data files of the information system, and obtain only a few metadata information in the data lake of the medical institution. , realize the rapid screening of effective data elements based on the text semantics of medical data, realize the automatic data discovery of data in the data lake of medical institutions, realize the rapid classification of medical data elements based on the depth map matching algorithm, and realize the conversion of data elements in the data lake of medical institutions to standard data The automatic classification and association mapping of the meta-classification system greatly improves the efficiency of data interface development in the development process of the medical big data center. The data element classification method provided by the present invention has good scalability, and can be applied to the processing of various data swamp to data lake transformation problems.
本发明的目的是通过以下技术方案来实现的:The purpose of the present invention is achieved through the following technical solutions:
本发明一方面公开了一种基于深度图匹配的医疗数据元自动化分类方法,该方法包括以下步骤:One aspect of the present invention discloses a method for automatic classification of medical data elements based on depth map matching, the method includes the following steps:
(1)定义基于最小元数据信息的医疗数据元图数据模型;将医疗机构内数据湖中存储的多源异构的数据元组成待筛选医疗数据元集合,向所述医疗数据元图数据模型自动化映射,映射结果存储为待筛选医疗数据元图数据;(1) Define a medical data element map data model based on the minimum metadata information; the multi-source heterogeneous data elements stored in the data lake in the medical institution form a set of medical data elements to be screened, and add to the medical data element map data model Automated mapping, the mapping results are stored as metagraph data of medical data to be screened;
(2)计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度;构建医疗数据元筛选模型,基于各列顶点的重要度计算各列顶点对应的列映射到标准数据模型的可能性,筛选出有效列顶点,由有效列顶点 集合关联组成待分类医疗数据元图数据,有效列顶点对应的列集合组成待分类医疗数据元集合;(2) Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; construct a medical data element screening model, and calculate the column corresponding to each column vertex based on the importance of each column vertex The possibility of mapping to the standard data model, screening out the effective column vertices, the medical data element graph data to be classified is formed by the association of the effective column vertices, and the column set corresponding to the effective column vertices forms the medical data element set to be classified;
(3)从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合;基于种子顶点集合进行待分类医疗数据元图数据的子图切割;利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类,从而得到列顶点对应的医疗数据元的分类。(3) Determine the seed vertex set of the standard classification medical data meta-graph data from the medical data meta-graph data to be classified; perform the sub-graph cutting of the medical data meta-graph data to be classified based on the seed vertex set; use the depth map matching model to complete the classification to be classified Classify the column vertices in the medical data element graph data, so as to obtain the classification of the medical data elements corresponding to the column vertices.
进一步地,所述医疗数据元图数据模型采用有向属性图建模,图由顶点和边两种图元素构成;Further, the medical data element graph data model is modeled using a directed attribute graph, and the graph is composed of two graph elements: vertices and edges;
所述顶点是由标签和对应标签的属性组构成的,标签代表顶点的类型,属性组代表标签拥有的一种或多种属性;所述顶点的本体信息包含顶点类型及每类顶点对应的属性信息,所述顶点类型包括数据库顶点、表顶点和列顶点,所述数据库顶点对应的属性信息包括数据库顶点索引和数据库类型信息,所述表顶点对应的属性信息包括表顶点索引,所述列顶点对应的属性信息包括列顶点索引、列数据类型信息和列向量表示;The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label; the ontology information of the vertex includes the vertex type and the attributes corresponding to each type of vertex information, the vertex type includes database vertex, table vertex and column vertex, the attribute information corresponding to the database vertex includes database vertex index and database type information, the attribute information corresponding to the table vertex includes table vertex index, and the column vertex The corresponding attribute information includes column vertex index, column data type information and column vector representation;
所述边是由边类型和边属性构成的,每一条边均为有向边;所述边的本体信息包含边类型及每类边对应的属性信息,所述边类型包括起点为数据库顶点、终点为表顶点的父子关联,起点为表顶点、终点为列顶点的父子关联,以及起点和终点均为列顶点的外键,三种边类型对应的属性信息均为边索引。The edge is composed of an edge type and an edge attribute, and each edge is a directed edge; the ontology information of the edge includes the edge type and attribute information corresponding to each type of edge, and the edge type includes the starting point being a database vertex, The parent-child association whose end point is a table vertex, the parent-child association whose starting point is a table vertex and the end point is a column vertex, and the foreign key whose starting point and end point are both column vertices. The attribute information corresponding to the three edge types is the edge index.
进一步地,所述多源异构的数据元向医疗数据元图数据模型的映射,包括:Further, the mapping of the multi-source heterogeneous data elements to the medical data element graph data model includes:
将来自多源异构的医疗数据从数据湖中采集,组成待筛选医疗数据元集合;Collect heterogeneous medical data from multiple sources from the data lake to form a collection of medical data elements to be screened;
使用元数据采集工具对数据湖中存储的元数据进行抓取;Use the metadata collection tool to capture the metadata stored in the data lake;
使用列向量生成器,对待筛选医疗数据元集合中各表各列中存储的数据进行遍历,利用列向量表示模型预测得到各表各列的列向量表示;Use the column vector generator to traverse the data stored in each column of each table in the medical data element set to be screened, and use the column vector representation model to predict and obtain the column vector representation of each column of each table;
通过图数据关联映射,将采集的元数据和产生的列向量表示向医疗数据元图数据模型关联映射,得到待筛选医疗数据元图数据。Through the map data association mapping, the collected metadata and the generated column vector representation are mapped to the medical data element graph data model to obtain the medical data element graph data to be screened.
进一步地,所述列向量生成器以数据表中的单列作为一个数据元单位,使用列向量表示模型转化各列存储的数据,计算各列的向量表示;Further, the column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;
所述列向量表示模型的训练包括:列向量表示模型的训练数据为存储在标准数据库中的人工完成医疗数据元分类、数据结构符合标准数据模型的列数据,记为标准分类列;标准分类医疗数据元图数据中的列顶点与对应标准分类列存在一一对应关系;The training of the column vector representation model includes: the training data of the column vector representation model is stored in the standard database to manually complete the medical data element classification, and the data structure conforms to the column data of the standard data model, which is recorded as a standard classification column; the standard classification medical treatment There is a one-to-one correspondence between the column vertices in the data element graph data and the corresponding standard classification columns;
设标准分类医疗数据元图数据中列顶点集合为C={c k,j},其中c k,j表示列顶点集合对应的标准分类列中第k列,第j行的数据,c k,j={w t} t=1,2,...,m,m为第j行字符总数,w t为构成数据c k,j的字符;通过文本表示模型h计算得到字符w t的初始向量表示h(w t);在标准分类医疗数据元图数据的列顶点C k下随机抽取n行数据{c k,j} j=1,2,...,n,第j行数据的向量表示为
Figure PCTCN2022116971-appb-000001
根据自注意力机制计算得到标准分类医疗数据元图数据中列顶点C k下各行数据的相关性,得到列顶点C k的列向量表示H(C k),计算公式为:
Let the column vertex set in the standard classification medical data element graph data be C={c k, j }, wherein c k, j represent the kth column in the standard classification column corresponding to the column vertex set, the data of the jth row, c k, j ={w t } t=1, 2,..., m , m is the total number of characters in line j, w t is the character that constitutes data c k, j ; the initial value of character w t is obtained by calculating the text representation model h Vector representation h(w t ); random sampling of n rows of data {c k, j } j=1, 2,..., n under the column vertex C k of the standard classified medical data element graph data, the jth row of data A vector is expressed as
Figure PCTCN2022116971-appb-000001
According to the calculation of the self-attention mechanism, the correlation of each row of data under the column vertex C k in the standard classified medical data element graph data is obtained, and the column vector representation H(C k ) of the column vertex C k is obtained. The calculation formula is:
Figure PCTCN2022116971-appb-000002
Figure PCTCN2022116971-appb-000002
Figure PCTCN2022116971-appb-000003
Figure PCTCN2022116971-appb-000003
其中v(C k)为列顶点C k的向量表示,d k为v(C k)的维度,softmax为softmax函数; Where v(C k ) is the vector representation of column vertex C k , d k is the dimension of v(C k ), and softmax is the softmax function;
所述列向量表示模型的预测包括:列向量表示模型的预测数据为数据湖中各数据库中各表各列所组成的待筛选医疗数据元集合,以列为遍历单元对待筛选医疗数据元集合进行遍历;使用列向量表示模型计算对列顶点每次随机抽样的列向量表示;对预测的多次随机抽样的列向量表示结果求平均值,作为所述列顶点最终的列向量表示。The prediction of the column vector representation model includes: the prediction data of the column vector representation model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is performed using the column as a traversal unit. Traversing; using the column vector representation model to calculate the column vector representation of each random sampling of the column vertices; calculating the average of the predicted multiple random sampling column vector representation results, as the final column vector representation of the column vertices.
进一步地,所述计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度,包括:Further, the calculation of the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model includes:
对于待筛选医疗数据元图数据中存储的列顶点C k,在除去C k的列顶点集合中随机抽取p个列顶点{C t} t=1,2,...,p,通过计算列顶点C k与抽取的列顶点的相关性,计算C k在医疗数据元图数据模型中的重要度分数Im(C k),Im(C k)定义为: For the column vertices C k stored in the medical data element graph data to be screened, randomly select p column vertices {C t } t=1, 2, ..., p from the set of column vertices except C k , and calculate the column The correlation between vertex C k and the extracted column vertices is calculated as the importance score Im(C k ) of C k in the medical data element graph data model, and Im(C k ) is defined as:
Figure PCTCN2022116971-appb-000004
Figure PCTCN2022116971-appb-000004
Figure PCTCN2022116971-appb-000005
Figure PCTCN2022116971-appb-000005
其中Importance_score为重要度函数。Among them, Importance_score is an importance function.
进一步地,所述医疗数据元筛选模型的训练与预测具体为:Further, the training and prediction of the medical data element screening model are as follows:
将根据标准数据元分类体系,人工分类和关联映射构建的标准分类医疗数据元集合转换为标准分类医疗数据元图数据,设标准分类医疗数据元图数据中存储的列顶点集合为S={s k},设构建标准分类医疗数据元集合过程中被人工筛选排除的列对应的列顶点集合为S′={s′ k}; Convert the standard classification medical data element set constructed according to the standard data element classification system, artificial classification and association mapping into standard classification medical data element graph data, and set the column vertex set stored in the standard classification medical data element graph data as S={s k }, set the column vertex set corresponding to the column excluded by manual screening in the process of constructing the standard classification medical data element set as S′={s′ k };
训练时从集合S中随机抽取q个列顶点作为正样本集合{s t} t=1,2,...,q,从集合S′中随机抽取q个列顶点作为负样本集合{s′ t} t=1,2,...,q;设样本(s i,y i)的重要度分数为Im(s i),s i表示第i个列顶点,y i∈{0,1}表示样本真实类别,则基于重要度分数计算医疗数据元筛选模型的损失函数Loss: During training, q column vertices are randomly selected from the set S as the positive sample set {s t } t=1, 2, ..., q , and q column vertices are randomly selected from the set S′ as the negative sample set {s′ t } t=1, 2, ..., q ; Let the importance score of the sample (s i , y i ) be Im(s i ), s i represents the i-th column vertex, y i ∈ {0, 1 } represents the true category of the sample, then calculate the loss function Loss of the medical data element screening model based on the importance score:
Figure PCTCN2022116971-appb-000006
Figure PCTCN2022116971-appb-000006
所述医疗数据元筛选模型在预测时,通过计算阈值L′判断列顶点C k对应的待筛选医疗数据元集合中的列是否为有效数据元,阈值L′计算公式: When predicting, the medical data element screening model judges whether the column in the medical data element set to be screened corresponding to the column vertex C k is a valid data element by calculating the threshold L', and the calculation formula of the threshold L' is:
Figure PCTCN2022116971-appb-000007
Figure PCTCN2022116971-appb-000007
若L′≥0.5,则说明列顶点C k为有效列顶点,对应的列为有效数据元; If L'≥0.5, it indicates that the column vertex C k is a valid column vertex, and the corresponding column is a valid data element;
由筛选后的有效列顶点集合关联组成待分类医疗数据元图数据,对应的筛选后的列集合组成待分类医疗数据元集合。The medical data element graph data to be classified is formed by association of the filtered effective column vertex sets, and the corresponding filtered column sets form the medical data element set to be classified.
进一步地,所述从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合,包括:Further, the determination of the seed vertex set of the standard classified medical data meta graph data from the medical data meta graph data to be classified includes:
设由标准数据模型定义的标准数据元分类体系中所有标准分类集合为E,标准分类医疗数据元图数据中的列顶点集合为D,D i∈D在标准数据元分类体系中的分类为E i∈E;设待分类医疗数据元图数据中存储的列顶点集合为C;医疗数据元分类过程抽象为在D中找到与列顶点C k∈C匹配度最高的列顶点D i,从而确定列顶点C k对应的列的分类为E iLet all the standard classification sets in the standard data element classification system defined by the standard data model be E, the set of column vertices in the standard classification medical data element graph data be D, and the classification of D i ∈ D in the standard data element classification system be E i ∈ E; suppose the set of column vertices stored in the medical data element graph data to be classified is C; the medical data element classification process is abstracted as finding the column vertex D i with the highest matching degree with the column vertex C k ∈ C in D, so as to determine The classification of the column corresponding to the column vertex C k is E i ;
对于列顶点D i∈D,从D i对应的列中随机抽取r 0个数据
Figure PCTCN2022116971-appb-000008
对于列顶点C k∈C,从C k对应的列中随机抽取r 0个数据
Figure PCTCN2022116971-appb-000009
则D i和C k的匹配度match_1(D i,C k)为:
For a column vertex D i ∈ D, randomly select r 0 data from the column corresponding to D i
Figure PCTCN2022116971-appb-000008
For a column vertex C k ∈ C, randomly sample r 0 data from the column corresponding to C k
Figure PCTCN2022116971-appb-000009
Then the matching degree match_1(D i , C k ) of D i and C k is:
Figure PCTCN2022116971-appb-000010
Figure PCTCN2022116971-appb-000010
其中v(x)代表数据x的向量表示,则D i对应的种子顶点为与其匹配度最高的列顶点
Figure PCTCN2022116971-appb-000011
即:
Where v(x) represents the vector representation of data x, then the seed vertex corresponding to D i is the column vertex with the highest matching degree
Figure PCTCN2022116971-appb-000011
Right now:
Figure PCTCN2022116971-appb-000012
Figure PCTCN2022116971-appb-000012
进一步地,所述基于种子顶点集合进行待分类医疗数据元图数据的子图切割,包括:Further, the subgraph cutting of the metagraph data of medical data to be classified based on the set of seed vertices includes:
Figure PCTCN2022116971-appb-000013
表示待分类医疗数据元图数据中与
Figure PCTCN2022116971-appb-000014
存在父子关系的列顶点集合,以
Figure PCTCN2022116971-appb-000015
表示待分类医疗数据元图数据中与
Figure PCTCN2022116971-appb-000016
存在外键关系的列顶点集合,则基于种子顶点
Figure PCTCN2022116971-appb-000017
切割得到的子图
Figure PCTCN2022116971-appb-000018
为:
by
Figure PCTCN2022116971-appb-000013
Represents the medical data element graph data to be classified and
Figure PCTCN2022116971-appb-000014
A collection of column vertices with a parent-child relationship, with
Figure PCTCN2022116971-appb-000015
Represents the medical data element graph data to be classified and
Figure PCTCN2022116971-appb-000016
A collection of column vertices with foreign key relationships, based on the seed vertex
Figure PCTCN2022116971-appb-000017
The subgraph obtained by cutting
Figure PCTCN2022116971-appb-000018
for:
Figure PCTCN2022116971-appb-000019
Figure PCTCN2022116971-appb-000019
以N(D i)表示标准分类医疗数据元图数据中与D i关联同一父顶点的列顶点集合,则深度图匹配模型的目标是从子图
Figure PCTCN2022116971-appb-000020
中搜索子图,使得搜索到的子 图中的列顶点与N(D i)中的列顶点一一匹配,实现
Figure PCTCN2022116971-appb-000021
中列顶点对应的医疗数据元的分类。
Let N(D i ) denote the set of column vertices associated with the same parent vertex as D i in the standard classification medical data element graph data, then the goal of the depth graph matching model is to obtain
Figure PCTCN2022116971-appb-000020
Search the subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D i ) one by one, and realize
Figure PCTCN2022116971-appb-000021
The classification of the medical data elements corresponding to the vertices of the middle column.
进一步地,所述利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类,包括:Further, the use of the depth map matching model to complete the classification of column vertices in the medical data element graph data to be classified includes:
根据图注意力机制,计算标准分类医疗数据元图数据中列顶点D i的向量表示V(D i)为: According to the graph attention mechanism, the vector representation V(D i ) of column vertices D i in the standard classification medical data element graph data is calculated as:
Figure PCTCN2022116971-appb-000022
Figure PCTCN2022116971-appb-000022
其中
Figure PCTCN2022116971-appb-000023
为从列顶点D′对应的列中随机抽取r 1个数据;w(D′,D i)表示N(D i)中的某一列顶点D′对于列顶点D i的权重函数;
in
Figure PCTCN2022116971-appb-000023
r 1 pieces of data are randomly selected from the column corresponding to the column vertex D';w(D', D i ) represents the weight function of a certain column vertex D' in N(D i ) for the column vertex D i ;
根据图注意力机制,计算待分类医疗数据元图数据的列顶点
Figure PCTCN2022116971-appb-000024
的向量表示
Figure PCTCN2022116971-appb-000025
为:
According to the graph attention mechanism, calculate the column vertices of the medical data element graph data to be classified
Figure PCTCN2022116971-appb-000024
vector representation of
Figure PCTCN2022116971-appb-000025
for:
Figure PCTCN2022116971-appb-000026
Figure PCTCN2022116971-appb-000026
其中
Figure PCTCN2022116971-appb-000027
为从列顶点C′对应的列中随机抽取r 1个数据;
Figure PCTCN2022116971-appb-000028
表示
Figure PCTCN2022116971-appb-000029
中的某一列顶点C′对于列顶点
Figure PCTCN2022116971-appb-000030
的权重函数;
in
Figure PCTCN2022116971-appb-000027
Randomly sample r 1 pieces of data from the column corresponding to column vertex C′;
Figure PCTCN2022116971-appb-000028
express
Figure PCTCN2022116971-appb-000029
A certain column vertex C′ in the column vertex
Figure PCTCN2022116971-appb-000030
weight function;
列顶点D′∈N(D i)和列顶点
Figure PCTCN2022116971-appb-000031
的匹配度match_2(D′,C′)为:
Column vertex D′∈N(D i ) and column vertex
Figure PCTCN2022116971-appb-000031
The matching degree match_2(D', C') is:
Figure PCTCN2022116971-appb-000032
Figure PCTCN2022116971-appb-000032
取与C′匹配度最高的列顶点
Figure PCTCN2022116971-appb-000033
即:
Take the column vertex with the highest matching degree with C'
Figure PCTCN2022116971-appb-000033
Right now:
Figure PCTCN2022116971-appb-000034
Figure PCTCN2022116971-appb-000034
待分类医疗数据元图数据中的列顶点C′对应的列的分类为
Figure PCTCN2022116971-appb-000035
对应的标准数据元分类体系中的类别。
The classification of the column corresponding to the column vertex C′ in the medical data element graph data to be classified is
Figure PCTCN2022116971-appb-000035
The corresponding category in the standard data element taxonomy.
本发明另一方面公开了一种基于深度图匹配的医疗数据元自动化分类系统,该系统包括:Another aspect of the present invention discloses a medical data element automatic classification system based on depth map matching, the system includes:
多源异构数据元的规范化采集与映射模块:定义基于最小元数据信息的医疗数据元图数据模型;将医疗机构内数据湖中存储的多源异构的数据元组成待筛选医疗数据元集合,向所述医疗数据元图数据模型自动化映射,映射结果存储为待筛选医疗数据元图数据;Standardized acquisition and mapping module of multi-source heterogeneous data elements: define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatically mapping to the medical data element map data model, and storing the mapping result as medical data element map data to be screened;
有效医疗数据元筛选模块:计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度;构建医疗数据元筛选模型,基于各列顶点的重要度计算各列顶点对应的列映射到标准数据模型的可能性,筛选出有效列顶点,对应的列为有效医疗数据元,由有效列顶点集合关联组成待分类医疗数据元图数据,有效列顶点对应的列集合组成待分类医疗数据元集合;Effective medical data element screening module: Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out. The corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a collection of medical data elements to be classified;
基于深度图匹配模型的医疗数据元分类模块:从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合;基于种子顶点集合进行待分类医疗数据元图数据的子图切割;利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类,从而得到列顶点对应的医疗数据元的分类。Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the deep graph matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices.
本发明的有益效果是:The beneficial effects of the present invention are:
1)本发明只利用了医疗机构数据湖中存储的极少的元数据信息,使用医疗数据元图数据模型实现医疗机构内医疗数据元的规范化采集和待筛选、分类医疗数据元之间关系信息的充分利用。1) The present invention only utilizes the minimal metadata information stored in the data lake of the medical institution, and uses the medical data element diagram data model to realize the standardized collection of medical data elements in the medical institution and the relationship information between medical data elements to be screened and classified full use of.
2)本发明方法缩小了数据发现、分类和关联映射过程对医疗机构信息系统历史文档的依赖,历史文档的缺失、错误对于医疗数据元的分类结果影响较小。2) The method of the present invention reduces the dependence of the data discovery, classification and association mapping process on the historical documents of the information system of the medical institution, and the absence and error of the historical documents have little influence on the classification results of the medical data elements.
3)本发明方法大幅度减少了人工对数据发现、分类和关联映射过程的干预,通过人工智能算法对待分类医疗数据元进行分类,为医疗大数据中心数据的实时更新和动态汇聚、深度利用需求中存在的医疗数据元自动化分类难题提供了启发式的解决方案。3) The method of the present invention greatly reduces the manual intervention in the process of data discovery, classification and association mapping, and classifies the medical data elements to be classified through the artificial intelligence algorithm, which meets the needs of real-time update, dynamic aggregation and deep utilization of medical big data center data Provides a heuristic solution to the difficult problem of automatic classification of medical data elements in .
附图说明Description of drawings
图1为本发明方法整体流程图;Fig. 1 is the overall flowchart of the method of the present invention;
图2为传统医疗数据元分类方法流程图;Fig. 2 is the flowchart of traditional medical data element classification method;
图3为本发明提供的基于深度图匹配的医疗数据元自动化分类方法实现过程示意图;3 is a schematic diagram of the implementation process of the automatic classification method for medical data elements based on depth map matching provided by the present invention;
图4为医疗数据元图数据模型的一个示例;Fig. 4 is an example of medical data element diagram data model;
图5为多源异构数据元向医疗数据元图数据模型的映射示意图。Fig. 5 is a schematic diagram of the mapping of multi-source heterogeneous data elements to the medical data element graph data model.
具体实施方式Detailed ways
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其他不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。In the following description, a lot of specific details are set forth in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do it without departing from the meaning of the present invention. By analogy, the present invention is therefore not limited to the specific examples disclosed below.
以下首先对本发明中涉及的术语进行说明:The terms involved in the present invention are first described below:
元数据:描述其它数据的数据。元数据是关于数据的数据,在某些时候不特指某个单独的数据,可以理解为是一组用来描述数据的信息组/数据组,该信息组/数据组中的一切数据、信息,都描述/反映了某个数据的某方面特征,则该信息组/数据组可称为一个元数据。元数据可以为数据说明其元素或属性(名称、大小、数据类型等),或其结构(长度、字段、数据列),或其相关数据(位于何处、如何联系、拥有者)。在日常生活中,元数据无所不在。只要有一类事物,就可以定义一套元数据。Metadata: Data that describes other data. Metadata is data about data. Sometimes it does not specifically refer to a single data. It can be understood as a set of information groups/data groups used to describe data. All data and information in this information group/data group , all describe/reflect a certain aspect of a certain data, then this information group/data group can be called a metadata. Metadata can describe data about its elements or properties (name, size, data type, etc.), its structure (length, fields, data columns), or its related data (where it is located, how it is contacted, who owns it). In everyday life, metadata is ubiquitous. As long as there is a class of things, a set of metadata can be defined.
数据元:可理解为数据的基本单元。卫生信息基本数据元规范和定义了医药卫生领域所有相关信息的唯一中文名称与代码,并且代码以字母、汉字、数字式的字符串形式表示。数据元列举并定义了特定语义环境中的一种信息资源。完整的数据元名称=对象类术语+特征类术语+表示类术语+(限定类术语)。Data element: can be understood as the basic unit of data. The basic data elements of health information standardize and define the unique Chinese names and codes of all relevant information in the field of medicine and health, and the codes are expressed in letters, Chinese characters, and digital strings. A data element enumerates and defines an information resource in a specific semantic environment. Complete data element name = object class term + feature class term + representation class term + (qualified class term).
数据元与元数据的区别和联系:元数据不可能涵盖理解数据元所要表示的数据所必需的所有信息。数据元的相关信息是任何一个(组织的)元数据的一个完整的组成部分。元数据的每一个元素都是一个数据元,用符合数据元标准的元数据属性和描述方法来说明元数据。将元数据存储于一个库中,并使之条 理化就需要建模,建模就需要从数据元的注册系统中或库中获取元数据。元数据,它是以一种一致、标准的方式来表达的数据元。元数据与数据元字典格式均由行号、中文名称、英文名称、标识符(短语)、定义、约束/条件、最大出现次数、数据类型、数据的值域等属性组成。不同之处是数据元字典格式中另有语境和同义词名称等属性。Differences and connections between data elements and metadata: Metadata cannot possibly cover all the information necessary to understand the data that a data element is intended to represent. Information about data elements is an integral part of any (organizational) metadata. Each element of metadata is a data element, and metadata attributes and description methods conforming to data element standards are used to describe metadata. Storing and codifying metadata in a repository requires modeling, which requires obtaining metadata from a registry of data elements or from a repository. Metadata, which is a data element expressed in a consistent and standard way. Both metadata and data element dictionary formats are composed of attributes such as line number, Chinese name, English name, identifier (phrase), definition, constraint/condition, maximum number of occurrences, data type, and data value range. The difference is that there are other attributes such as context and synonym name in the data element dictionary format.
数据湖:数据湖是一种在系统或存储库中以自然格式存储数据的方法,它有助于以各种模式和结构形式配置数据,通常是对象块或文件。数据湖的主要思想是对企业中的所有数据进行统一存储,从原始数据(源系统数据的精确副本)转换为用于报告、可视化、分析和机器学习等各种任务的目标数据。国内一般把整个HDFS叫做数据仓库(广义),即存放所有数据的地方,而国外一般叫数据湖(data lake)。当数据湖缺乏管理的时候,就会形成数据沼泽。搭建数据湖容易,但是让数据湖发挥价值是很难的。最终数据湖只是一直往里面灌数据,而应用场景极少,没有输出或者极少输出,形成单向湖。大部分使用数据湖的企业在数据真的需要使用的时候,往往因为数据湖中的数据质量太差而无法最终使用。Data Lake: A data lake is a method of storing data in a natural format in a system or repository, which facilitates the configuration of data in various schema and structural forms, usually object blocks or files. The main idea of a data lake is the unified storage of all data in an enterprise, from raw data (an exact copy of source system data) to target data for various tasks such as reporting, visualization, analysis, and machine learning. In China, the entire HDFS is generally called a data warehouse (in a broad sense), that is, the place where all data is stored, while in foreign countries it is generally called a data lake. When data lakes are left unmanaged, data swamps form. It is easy to build a data lake, but it is difficult to make the data lake play a role. In the end, the data lake just pours data into it all the time, and there are very few application scenarios, with no output or very little output, forming a one-way lake. Most enterprises that use data lakes often fail to use the data because the quality of the data in the data lake is too poor when the data really needs to be used.
图神经网络:在过去的几年中,神经网络的兴起与应用成功推动了模式识别和数据挖掘的研究。许多曾经严重依赖于手工提取特征的机器学习任务(如目标检测、机器翻译和语音识别),如今都已被各种端到端的深度学习范式彻底改变了。尽管传统的深度学习方法被应用在提取欧氏空间数据的特征方面取得了巨大的成功,但许多实际应用场景中的数据是从非欧式空间生成的,传统的深度学习方法在处理非欧式空间数据上的表现却仍难以使人满意。图中的每个数据样本(节点)都会有边与图中其他实数据样本相关,这些信息可用于捕获实例之间的相互依赖关系。图神经网络是应用于图结构数据(非欧式空间)上的神经网络。Graph Neural Networks: In the past few years, the rise and application of neural networks has successfully promoted the research of pattern recognition and data mining. Many machine learning tasks (such as object detection, machine translation, and speech recognition) that once relied heavily on manually extracted features have been revolutionized by various end-to-end deep learning paradigms. Although traditional deep learning methods have been applied to extract features of Euclidean space data with great success, the data in many practical application scenarios are generated from non-Euclidean spaces, and traditional deep learning methods are not effective in processing non-Euclidean space data. Performance is still unsatisfactory. Each data sample (node) in the graph will have edges related to other real data samples in the graph, and this information can be used to capture the interdependencies between instances. Graph neural network is a neural network applied to graph-structured data (non-Euclidean space).
深度图匹配:图匹配是人工智能中的一个经典问题,在若干领域都有重要的应用,比如计算机视觉中匹配2D/3D形状,生物信息学中匹配蛋白质网络, 社交网络中匹配不同网络当中的用户等。深度图匹配即基于图神经网络解决图匹配问题的方法。Deep graph matching: Graph matching is a classic problem in artificial intelligence and has important applications in several fields, such as matching 2D/3D shapes in computer vision, matching protein networks in bioinformatics, and matching different networks in social networks. user etc. Deep graph matching is a method based on graph neural network to solve the graph matching problem.
如图1所示,本发明提供了一种基于深度图匹配的医疗数据元自动化分类方法,该方法包括以下步骤:As shown in Figure 1, the present invention provides a kind of automatic classification method of medical data element based on depth map matching, and this method comprises the following steps:
(1)多源异构数据元的规范化采集与映射,包括:(1) Standardized collection and mapping of multi-source heterogeneous data elements, including:
定义基于最小元数据信息的医疗数据元图数据模型;Define a medical data metagraph data model based on minimal metadata information;
将医疗机构内数据湖中存储的多源异构的数据元组成待筛选医疗数据元集合,向医疗数据元图数据模型自动化映射,映射结果存储为待筛选医疗数据元图数据;Combine the multi-source heterogeneous data elements stored in the data lake in the medical institution to form a collection of medical data elements to be screened, automatically map to the data model of the medical data element map, and store the mapping results as the data of the medical data element map to be screened;
(2)计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度;构建医疗数据元筛选模型,基于各列顶点的重要度计算各列顶点对应的列映射到标准数据模型的可能性,筛选出有效列顶点,由有效列顶点集合关联组成待分类医疗数据元图数据,有效列顶点对应的列集合组成待分类医疗数据元集合;(2) Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; construct a medical data element screening model, and calculate the column corresponding to each column vertex based on the importance of each column vertex The possibility of mapping to the standard data model, screening out the effective column vertices, the medical data element graph data to be classified is formed by the association of the effective column vertices, and the column set corresponding to the effective column vertices forms the medical data element set to be classified;
(3)从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合;基于种子顶点集合进行待分类医疗数据元图数据的子图切割;利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类,从而得到列顶点对应的医疗数据元的分类。(3) Determine the seed vertex set of the standard classification medical data meta-graph data from the medical data meta-graph data to be classified; perform the sub-graph cutting of the medical data meta-graph data to be classified based on the seed vertex set; use the depth map matching model to complete the classification to be classified Classify the column vertices in the medical data element graph data, so as to obtain the classification of the medical data elements corresponding to the column vertices.
图2为传统医疗数据元分类方法流程图。以下参见图3详细描述本发明方法各部分的实现过程。Fig. 2 is a flowchart of traditional medical data element classification method. The implementation process of each part of the method of the present invention will be described in detail below with reference to FIG. 3 .
一、多源异构数据元的规范化采集与映射1. Standardized collection and mapping of multi-source heterogeneous data elements
1.1医疗数据元图数据模型的定义1.1 Definition of medical data element graph data model
医疗机构数据汇聚形成数据湖,数据湖的数据具有多源异构的特性,包括医疗过程中对诊疗过程和医疗机构运营过程的观测数据,观测数据库的目的和设计各不相同。诊疗过程形成的电子病历旨在支持临床实践,而医疗机构运营数据则是为院内管理和医保报销流程构建的。每一种都是为了不同的目的而收集的,导致数据具有不同的逻辑组织和物理格式。The data of medical institutions are aggregated to form a data lake. The data of the data lake has the characteristics of multi-source heterogeneity, including the observation data of the diagnosis and treatment process and the operation process of medical institutions in the medical process. The purpose and design of the observation database are different. The electronic medical records formed during the diagnosis and treatment process are designed to support clinical practice, while the operating data of medical institutions are constructed for in-hospital management and medical insurance reimbursement processes. Each is collected for a different purpose, resulting in data having a different logical organization and physical format.
数据模型是数据库设计中用来对现实世界进行抽象的工具,通过建立标准统一的数据模型,定义数据结构、数据操作、数据约束,可以有效保证采集的数据质量和数据表征的标准可控,图数据模型是基于图数据库开发的数据模型。The data model is a tool used to abstract the real world in database design. By establishing a standard and unified data model and defining data structure, data operation, and data constraints, it can effectively ensure the quality of collected data and the controllability of data representation standards, as shown in Fig. The data model is a data model developed based on the graph database.
由于数据湖中数据库类型不同,数据表、数据列间关系复杂。医疗机构内的观测数据时间跨度大,普遍存在数据库文档信息缺失的现象。为了使得本发明提及的深度图匹配模型的效果同样适用于极低元数据信息的局部数据沼泽的情况,达到使用最小的元数据信息完成数据元自动化分类的目的,同时保证在图数据模型标准下采集的图结构数据适用于深度图匹配模型的训练,本发明基于数据湖内数据库的最小元数据信息,定义了一种基于最小元数据信息的医疗数据元图数据模型,为医疗大数据中心建立过程中医疗数据元的自动化分类提供了一种启发式的解决方案。Due to the different types of databases in the data lake, the relationship between data tables and data columns is complex. The time span of observation data in medical institutions is large, and the phenomenon of missing information in database documents is common. In order to make the effect of the depth map matching model mentioned in the present invention also applicable to the situation of local data swamps with extremely low metadata information, achieve the purpose of using the minimum metadata information to complete the automatic classification of data elements, and at the same time ensure the standard of the graph data model The graph structure data collected below is suitable for the training of the depth graph matching model. Based on the minimum metadata information of the database in the data lake, the present invention defines a medical data metadata graph data model based on the minimum metadata information, which is a medical big data center Automated classification of medical data elements during establishment provides a heuristic solution.
图数据模型采用有向属性图来建模,图由两种图元素构成:顶点Vertex和边Edge。其中顶点由标签和对应标签的属性组构成,标签代表顶点的类型,属性组代表标签拥有的一种或多种属性。顶点的本体信息包含顶点类型及每类顶点对应的属性信息。The graph data model is modeled by a directed attribute graph, which consists of two graph elements: vertex Vertex and edge Edge. The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label. Vertex ontology information includes vertex types and attribute information corresponding to each type of vertex.
本发明定义的医疗数据元图数据模型的顶点的本体信息如下表所示:The ontology information of the vertex of the medical data element graph data model defined by the present invention is shown in the following table:
表1 医疗数据元图数据模型的顶点的本体信息表Table 1 The ontology information table of the vertices of the medical data element graph data model
Figure PCTCN2022116971-appb-000036
Figure PCTCN2022116971-appb-000036
其中vid为图中每一顶点的唯一索引id,可统一使用哈希散列编码。vector_embeddings为列向量表示模型预测的列向量表示结果。Among them, vid is the unique index id of each vertex in the graph, which can be hash coded uniformly. vector_embeddings is a column vector representing the result of the model prediction.
在图数据模型中,边由边类型和边属性构成,每一条边均为有向边,有向边表明一个顶点(起点src)指向另一个顶点(终点dst)的关联关系。边的本体信息包含边类型及每类边对应的属性信息。In the graph data model, an edge is composed of an edge type and an edge attribute, and each edge is a directed edge, and a directed edge indicates an association relationship between one vertex (start point src) and another vertex (end point dst). Edge ontology information includes edge types and attribute information corresponding to each type of edge.
本发明定义的医疗数据元图数据模型的边的本体信息如下表所示:The ontology information of the edge of the medical data element graph data model defined by the present invention is shown in the following table:
表2 医疗数据元图数据模型的边的本体信息表Table 2 The ontology information table of the edge of the medical data element graph data model
起点标签start tag 终点标签end point label 边类型edge type 属性Attributes 属性说明property description
DatabaseDatabase Tabletable 父子关联parent-child relationship eideid 边索引edge index
Tabletable ColumnColumn 父子关联parent-child relationship eideid 边索引edge index
ColumnColumn ColumnColumn 外键foreign key eideid 边索引edge index
图4为医疗数据元图数据模型的一个示例。Figure 4 is an example of a medical data element graph data model.
1.2多源异构数据元向医疗数据元图数据模型的映射1.2 Mapping of multi-source heterogeneous data elements to medical data element graph data model
本发明的数据采集与关联映射过程,将来自多源异构的医疗数据从数据湖中采集,组成待筛选医疗数据元集合。使用元数据采集工具对数据湖中存储的元数据进行抓取。使用列向量生成器,对待筛选医疗数据元集合中各表各列中存储的数据进行遍历,利用列向量表示模型预测得到各表各列的列向量表示。最后通过图数据关联映射,将采集的元数据和产生的列向量表示向医疗数据元图数据模型关联映射,得到待筛选医疗数据元图数据。参见图5,具体实现描述如下:The data collection and association mapping process of the present invention collects heterogeneous medical data from multiple sources from the data lake to form a set of medical data elements to be screened. Use the metadata collection tool to capture the metadata stored in the data lake. Use the column vector generator to traverse the data stored in each column of each table in the medical data element set to be screened, and use the column vector representation model to predict and obtain the column vector representation of each column of each table. Finally, through graph data association mapping, the collected metadata and the generated column vector representation are associated and mapped to the medical data element graph data model to obtain the medical data element graph data to be screened. Referring to Figure 5, the specific implementation is described as follows:
(1)元数据采集工具(1) Metadata collection tool
a)数据库适配:由于医疗机构内数据湖通常包含不同类型数据库,元数据采集工具需针对不同类型数据库开发数据库适配模块实现适配。a) Database adaptation: Since data lakes in medical institutions usually contain different types of databases, metadata collection tools need to develop database adaptation modules for different types of databases to achieve adaptation.
b)解析配置:由于最终的关联映射目标为医疗数据元图数据模型,采集信息配置为仅采集元数据中的表格列信息、血缘关系信息和各列的外键信息;对于主键、约束、索引、权限、触发器等常见元数据则不在采集范围之内。b) Parsing configuration: Since the final association mapping target is the medical data element graph data model, the collection information is configured to only collect table column information, blood relationship information and foreign key information of each column in the metadata; for primary keys, constraints, and indexes Common metadata such as , permissions, and triggers are not within the scope of collection.
c)元数据抓取:针对解析配置情况,对数据湖内的各数据库执行元数据抓取操作。c) Metadata capture: perform metadata capture operations on each database in the data lake according to the parsing configuration.
d)数据关联:针对数据库适配情况,将不同类型数据库的字段类型统一映射到图数据库数据类型上。如oracle数据库的varchar2类型和MySQL数据库的varchar类型统一映射为图数据库的string类型,其他类型数据库同理。d) Data association: According to the database adaptation situation, the field types of different types of databases are uniformly mapped to the graph database data types. For example, the varchar2 type of the Oracle database and the varchar type of the MySQL database are uniformly mapped to the string type of the graph database, and the same is true for other types of databases.
(2)列向量生成器(2) Column vector generator
列向量生成器以数据表中的单列作为一个数据元单位,使用列向量表示模型转化各列存储的数据,计算各列的向量表示;The column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;
a)列向量表示模型的训练a) The column vector represents the training of the model
列向量表示模型的训练数据为存储在标准数据库中的人工完成医疗数据元分类、数据结构符合标准数据模型的列数据,简称为标准分类列。The column vector indicates that the training data of the model is the column data stored in the standard database that manually completes the classification of medical data and whose data structure conforms to the standard data model, referred to as the standard classification column.
标准分类医疗数据元图数据中的列顶点与对应标准分类列存在一一对应关系。There is a one-to-one correspondence between the column vertices in the standard classification medical data element graph data and the corresponding standard classification columns.
获得医疗数据元图数据中列顶点向量表示的方法,是将对应医疗数据元集合中的列中存储的数据转化为文本数据,每列文本数据头尾分别加上[CLS]、[SEP]表示数据的开头和结束。The method of obtaining the column vertex vector representation in the medical data element graph data is to convert the data stored in the column in the corresponding medical data element set into text data, and add [CLS] and [SEP] to the head and tail of each column of text data to represent The beginning and end of the data.
设标准分类医疗数据元图数据中列顶点集合为C={c k,j},其中c k,j表示列顶点集合对应的标准分类列中第k列,第j行的数据,C k,j={w t} t=1,2,...,m,m为第j行字符总数,w t为构成数据c k,j的字符。通过文本表示模型h计算得到字符w t的初始向量表示h(w t)。文本表示模型h可以采用基于Transformer模型的深度双向语言表示模型(BERT模型)。在标准分类医疗数据元图数据的列顶点C k下随机抽取n行数据{c k,j} j=1,2,...,n,第j行数据的向量表示为
Figure PCTCN2022116971-appb-000037
根据自注意力机制(self-attention)计算得到标准分类医疗数据元图数据中列顶点C k下各行数据的相关性,得到列顶点C k的列向量表示H(C k),计算公式为:
Let the column vertex set in the standard classification medical data element graph data be C={c k, j }, where c k, j represent the kth column and the data of the jth row in the standard classification column corresponding to the column vertex set, C k, j ={w t } t=1, 2,..., m , m is the total number of characters in row j, and w t is the characters constituting the data c k, j . The initial vector representation h(w t ) of the character w t is obtained by calculating the text representation model h. The text representation model h can adopt a deep bidirectional language representation model (BERT model) based on the Transformer model. Randomly extract n rows of data {c k, j } j=1, 2,..., n under the column vertex C k of the standard classified medical data element graph data, and the vector of the jth row of data is expressed as
Figure PCTCN2022116971-appb-000037
According to the self-attention mechanism (self-attention) calculation, the correlation of the data of each row under the column vertex C k in the standard classified medical data element graph data is obtained, and the column vector representation H(C k ) of the column vertex C k is obtained. The calculation formula is:
Figure PCTCN2022116971-appb-000038
Figure PCTCN2022116971-appb-000038
Figure PCTCN2022116971-appb-000039
Figure PCTCN2022116971-appb-000039
其中v(C k)为列顶点C k的向量表示,d k为v(C k)的维度,softmax为softmax函数。 Where v(C k ) is the vector representation of column vertex C k , d k is the dimension of v(C k ), and softmax is the softmax function.
为获得更精确的列顶点向量表示,在积累了足够量的标准分类列作为训练数据的情况下,可以使用标准分类列数据对列向量表示模型进行进一步的迁移学习。以列为单位,随机覆盖对应列数据中15%的字符,使用[MASK]标签替带被覆盖字符。使用列向量表示模型预测被覆盖字符进一步训练和更新模型,这样得到的列向量表示模型更加匹配筛选有效数据元的任务。In order to obtain a more accurate column vertex vector representation, when a sufficient amount of standard classification columns has been accumulated as training data, the standard classification column data can be used for further transfer learning of the column vector representation model. Take the column as a unit, randomly cover 15% of the characters in the corresponding column data, and use the [MASK] label to replace the covered characters. Use the column vector representation model to predict the covered characters to further train and update the model, so that the obtained column vector representation model is more suitable for the task of screening valid data elements.
b)列向量表示模型的预测b) The column vector represents the prediction of the model
列向量表示模型的预测数据为数据湖中各数据库中各表各列所组成的待筛选医疗数据元集合,以列为遍历单元对待筛选医疗数据元集合进行遍历。为避免待筛选医疗数据元集合中存在列数据量过大导致列向量生成器性能下降,在使用列向量表示模型计算列向量表示过程中,可以使用随机抽样的方式(如随机抽取单列1000个数据,抽取100次),使用列向量表示模型计算对列顶点C k进行第s次抽样的列向量表示H s(C k)。对预测的共S次抽样的列向量表示结果求平均值,作为C k最终的列向量表示
Figure PCTCN2022116971-appb-000040
存储H(C k)在医疗数据元图数据模型列顶点C k的vector_embeddings属性内。
The column vector indicates that the prediction data of the model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is traversed with the column as the traversal unit. In order to avoid the performance degradation of the column vector generator due to the large amount of column data in the set of medical data elements to be screened, in the process of calculating the column vector representation using the column vector representation model, random sampling can be used (such as random sampling of 1000 data in a single column , sampled 100 times), use the column vector representation model to calculate the column vector representation H s (C k ) for the sth sampling of the column vertex C k . Calculate the average of the column vector representation results of the predicted total S samples, and use it as the final column vector representation of C k
Figure PCTCN2022116971-appb-000040
Store H(C k ) in the vector_embeddings attribute of the data model column vertex C k of the medical data element graph.
(3)图数据关联映射(3) Graph data association mapping
将计算得到的待筛选医疗数据元集合中各列的列向量表示,以及元数据采集结果,分别关联映射为医疗数据元图数据模型中顶点和边对应的对象,入库到以医疗数据元图数据模型为数据标准的待筛选医疗数据元图数据中,对应的映射关系如下表所示。The calculated column vector representation of each column in the medical data element set to be screened, as well as the metadata collection results, are respectively associated and mapped into objects corresponding to vertices and edges in the medical data element graph data model, and stored in the medical data element graph The corresponding mapping relationship is shown in the following table in the medical data element graph data whose data model is the data standard to be screened.
表3 图数据关联映射表Table 3 Map data association mapping table
序号serial number 映射对象map object 对象属性object properties 元数据信息metadata information
11 DatabaseDatabase 顶点vertex 医疗机构内数据库名称(编号)Name (number) of the database in the medical institution
22 Tabletable 顶点vertex 数据库内数据表名称(编号)Data table name (number) in the database
33 ColumnColumn 顶点vertex 数据表内列名称(编号)Column name (number) in the data table
44 Database-TableDatabase-Table side 数据库和数据表的从属关系Dependencies of databases and data tables
55 Table-ColumnTable-Column side 数据表和表内列的包含关系The inclusion relationship between the data table and the columns in the table
66 Column-ColumnColumn-Column side 数据库列外键,列间血缘关系Database column foreign key, blood relationship between columns
二、快速、自动化筛选有效医疗数据元2. Rapid and automatic screening of effective medical data elements
医疗机构内数据湖存储的信息类型繁多,相比于标准数据模型的数据覆盖范围,通常存在大量信息冗余,为了快速、自动化筛选有效医疗数据元,在进行医疗数据元自动化分类任务之前,可以对待筛选医疗数据元集合中的数据元进行筛选,降低数据元分类任务的复杂度。本发明提出如下快速、自动化筛选有效医疗数据元的方法,包括以下两个步骤:(1)计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度。(2)构建医疗数据元筛选模型,基于各列顶点的重要度计算各列顶点对应的列映射到标准数据模型的可能性,筛选出其中的有效医疗数据元,组成待分类医疗数据元集合。There are many types of information stored in data lakes in medical institutions. Compared with the data coverage of standard data models, there is usually a lot of information redundancy. In order to quickly and automatically screen effective medical data elements, before performing the automatic classification task of medical data elements, you can The data elements in the medical data element collection to be screened are screened to reduce the complexity of the data element classification task. The present invention proposes a method for quickly and automatically screening effective medical data elements, including the following two steps: (1) calculating the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model. (2) Construct a medical data element screening model, calculate the possibility of mapping the column corresponding to each column vertex to the standard data model based on the importance of each column vertex, and filter out the effective medical data elements to form a set of medical data elements to be classified.
2.1基于列顶点向量表示计算列顶点在医疗数据元图数据模型中的重要度2.1 Calculate the importance of column vertices in the medical data element graph data model based on the column vertex vector representation
待筛选医疗数据元图数据中存储的列顶点与待筛选医疗数据元集合中的列存在一一对应关系。对于待筛选医疗数据元图数据中存储的列顶点C k,在除去C k的列顶点集合中随机抽取p个列顶点{C t} t=1,2,...,p,通过计算列顶点C k与抽取的列顶点的相关性,计算C k在医疗数据元图数据模型中的重要度分数Im(C k),Im(C k)定义为: There is a one-to-one correspondence between the column vertices stored in the medical data element graph data to be screened and the columns in the medical data element set to be screened. For the column vertices C k stored in the medical data element graph data to be screened, randomly select p column vertices {C t } t=1, 2, ..., p from the set of column vertices except C k , and calculate the column The correlation between vertex C k and the extracted column vertices is calculated as the importance score Im(C k ) of C k in the medical data element graph data model, and Im(C k ) is defined as:
Figure PCTCN2022116971-appb-000041
Figure PCTCN2022116971-appb-000041
Figure PCTCN2022116971-appb-000042
Figure PCTCN2022116971-appb-000042
其中Importance_score为重要度函数。Among them, Importance_score is an importance function.
2.2医疗数据元筛选模型的训练与预测2.2 Training and prediction of medical data element screening model
将根据标准数据元分类体系,人工分类和关联映射构建的标准分类医疗数据元集合转换为标准分类医疗数据元图数据,设标准分类医疗数据元图数据中存储的列顶点集合为S={s k},设构建标准分类医疗数据元集合过程中被人工筛选排除的列对应的列顶点集合为S′={s′ k}。 Convert the standard classification medical data element set constructed according to the standard data element classification system, artificial classification and association mapping into standard classification medical data element graph data, and set the column vertex set stored in the standard classification medical data element graph data as S={s k }, and the set of column vertices corresponding to the columns excluded by manual screening in the process of constructing the standard classification medical data element set is S′={s′ k }.
训练时从集合S中随机抽取q个列顶点作为正样本集合{s t} t=1,2,...,q,从集合S′中随机抽取q个列顶点作为负样本集合{s′ t} t=1,2,...,q;设样本(s i,y i)的重要度分数为Im(s i),s i表示第i个列顶点,y i∈{0,1}表示样本真实类别,则基于重要度分数计算医疗数据元筛选模型的损失函数Loss: During training, q column vertices are randomly selected from the set S as the positive sample set {s t } t=1, 2, ..., q , and q column vertices are randomly selected from the set S′ as the negative sample set {s′ t } t=1, 2, ..., q ; Let the importance score of the sample (s i , y i ) be Im(s i ), s i represents the i-th column vertex, y i ∈ {0, 1 } represents the true category of the sample, then calculate the loss function Loss of the medical data element screening model based on the importance score:
Figure PCTCN2022116971-appb-000043
Figure PCTCN2022116971-appb-000043
通过Adam算法更新重要度函数,更新医疗数据元筛选模型。The importance function is updated through the Adam algorithm, and the medical data element screening model is updated.
医疗数据元筛选模型在预测时,通过计算阈值L′判断列顶点C k对应的待筛选医疗数据元集合中的列是否为有效数据元,阈值L′计算公式: When predicting, the medical data element screening model judges whether the column in the set of medical data elements to be screened corresponding to the column vertex C k is a valid data element by calculating the threshold L'. The formula for calculating the threshold L' is:
Figure PCTCN2022116971-appb-000044
Figure PCTCN2022116971-appb-000044
若L′≥0.5,则说明列顶点C k为有效列顶点,对应的列为有效数据元。 If L'≥0.5, it means that the column vertex C k is a valid column vertex, and the corresponding column is a valid data element.
最终由筛选后的有效列顶点集合关联组成待分类医疗数据元图数据,对应的筛选后的列集合组成待分类医疗数据元集合。Finally, the filtered effective column vertex set is associated to form the medical data element graph data to be classified, and the corresponding filtered column set forms the medical data element set to be classified.
三、基于深度图匹配模型确定医疗数据元的类别3. Determine the category of medical data elements based on the depth map matching model
3.1从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合3.1 Determine the seed vertex set of standard classified medical data meta graph data from the medical data meta graph data to be classified
待分类医疗数据元图数据中存储的列顶点与待分类医疗数据元集合中的列存在一一对应关系。设由标准数据模型定义的标准数据元分类体系中所有标准分类集合为E,标准分类医疗数据元图数据中的列顶点集合为D,D i∈D在标准数据元分类体系中的分类为E i∈E;设待分类医疗数据元图数据中存储的列顶点集合为C。则医疗数据元分类过程可以抽象为在D中找到与列顶点C k∈C匹配度 最高的列顶点D i,从而确定列顶点C k对应的列的分类为E i,而医疗大数据中心开发过程中的数据分类与关联映射过程,可以抽象为为标准数据元分类体系的所有分类E i找到匹配度最高的C kThere is a one-to-one correspondence between the column vertices stored in the medical data element graph data to be classified and the columns in the medical data element set to be classified. Let all the standard classification sets in the standard data element classification system defined by the standard data model be E, the set of column vertices in the standard classification medical data element graph data be D, and the classification of D i ∈ D in the standard data element classification system be E i ∈ E; set the set of column vertices stored in the medical data element graph data to be classified as C. Then the medical data element classification process can be abstracted as finding the column vertex D i with the highest matching degree with the column vertex C k ∈ C in D, so as to determine the classification of the column corresponding to the column vertex C k as E i , and the medical big data center develops The data classification and association mapping process in the process can be abstracted as finding the C k with the highest matching degree for all the classifications E i of the standard data element classification system.
以标准数据模型为数据标准的标准数据库中有些列的数据的格式或内容会比较统一,与之存在关联映射关系的标准分类医疗数据元集合的列的格式或内容也会比较统一。如果首先为这些列对应的顶点定位到在待分类医疗数据元图数据中对应的顶点(称为种子顶点),可以缩小深度图匹配模型的搜索空间,从而提高其效率。对于列顶点D i∈D,从D i对应的列中随机抽取r 0个数据
Figure PCTCN2022116971-appb-000045
对于待分类医疗数据元图数据中的列顶点C k∈C,同样从C k对应的列中随机抽取r 0个数据
Figure PCTCN2022116971-appb-000046
则D i和C k的匹配度match_1(D i,C k)为:
The data format or content of some columns in the standard database with the standard data model as the data standard will be relatively uniform, and the format or content of the columns of the standard classified medical data element set that has an associated mapping relationship with it will also be relatively uniform. If the vertices corresponding to these columns are firstly located to the corresponding vertices (called seed vertices) in the medical data element graph data to be classified, the search space of the depth map matching model can be reduced, thereby improving its efficiency. For a column vertex D i ∈ D, randomly select r 0 data from the column corresponding to D i
Figure PCTCN2022116971-appb-000045
For the column vertex C k ∈ C in the medical data element graph data to be classified, r 0 data are randomly selected from the column corresponding to C k
Figure PCTCN2022116971-appb-000046
Then the matching degree match_1(D i , C k ) of D i and C k is:
Figure PCTCN2022116971-appb-000047
Figure PCTCN2022116971-appb-000047
其中v(x)代表数据x的向量表示,则D i对应的种子顶点为与其匹配度最高的列顶点
Figure PCTCN2022116971-appb-000048
即:
Where v(x) represents the vector representation of data x, then the seed vertex corresponding to D i is the column vertex with the highest matching degree
Figure PCTCN2022116971-appb-000048
Right now:
Figure PCTCN2022116971-appb-000049
Figure PCTCN2022116971-appb-000049
3.2基于种子顶点集合进行待分类医疗数据元图数据的子图切割3.2 Based on the seed vertex set, the subgraph cutting of the metagraph data of the medical data to be classified is performed
Figure PCTCN2022116971-appb-000050
表示待分类医疗数据元图数据中与
Figure PCTCN2022116971-appb-000051
存在父子关系的列顶点集合,以
Figure PCTCN2022116971-appb-000052
表示待分类医疗数据元图数据中与
Figure PCTCN2022116971-appb-000053
存在外键关系的列顶点集合,则基于种子顶点
Figure PCTCN2022116971-appb-000054
切割得到的子图
Figure PCTCN2022116971-appb-000055
为:
by
Figure PCTCN2022116971-appb-000050
Represents the medical data element graph data to be classified and
Figure PCTCN2022116971-appb-000051
A collection of column vertices with a parent-child relationship, with
Figure PCTCN2022116971-appb-000052
Represents the medical data element graph data to be classified and
Figure PCTCN2022116971-appb-000053
A collection of column vertices with foreign key relationships, based on the seed vertex
Figure PCTCN2022116971-appb-000054
The subgraph obtained by cutting
Figure PCTCN2022116971-appb-000055
for:
Figure PCTCN2022116971-appb-000056
Figure PCTCN2022116971-appb-000056
以N(D i)表示标准分类医疗数据元图数据中与D i关联同一父顶点的列顶点集合,则深度图匹配模型的目标是从子图
Figure PCTCN2022116971-appb-000057
中搜索合适的子图,使得搜索到的子图中的列顶点与N(D i)中的列顶点一一匹配,从而实现
Figure PCTCN2022116971-appb-000058
中列顶点对应的医疗数据元的分类。
Let N(D i ) denote the set of column vertices associated with the same parent vertex as D i in the standard classification medical data element graph data, then the goal of the depth graph matching model is to obtain
Figure PCTCN2022116971-appb-000057
Search for a suitable subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D i ) one by one, so that
Figure PCTCN2022116971-appb-000058
The classification of the medical data elements corresponding to the vertices of the middle column.
3.3利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类3.3 Use the depth graph matching model to complete the classification of column vertices in the medical data element graph data to be classified
医疗数据元分类过程包括以下步骤:The medical data metadata classification process includes the following steps:
(1)结合图注意力机制,分别计算标准分类医疗数据元图数据中列顶点D i的向量表示V(D i)和待分类医疗数据元图数据的列顶点
Figure PCTCN2022116971-appb-000059
的向量表示
Figure PCTCN2022116971-appb-000060
具体为:
(1) Combining with the graph attention mechanism, calculate the vector representation V(D i ) of the column vertices D i in the standard classified medical data meta graph data and the column vertices of the unclassified medical data meta graph data
Figure PCTCN2022116971-appb-000059
vector representation of
Figure PCTCN2022116971-appb-000060
Specifically:
根据图注意力机制,计算D i的向量表示V(D i)为: According to the graph attention mechanism, calculate the vector representation V(D i ) of D i as:
Figure PCTCN2022116971-appb-000061
Figure PCTCN2022116971-appb-000061
其中
Figure PCTCN2022116971-appb-000062
为从列顶点D′对应的列中随机抽取r 1个数据;w(D′,D i)表示N(D i)中的某一列顶点D′对于列顶点D i的权重函数,具体计算方式为:
in
Figure PCTCN2022116971-appb-000062
r1 data is randomly selected from the column corresponding to the column vertex D′; w(D′, D i ) represents the weight function of a certain column vertex D′ in N(D i ) for the column vertex D i , and the specific calculation method for:
Figure PCTCN2022116971-appb-000063
Figure PCTCN2022116971-appb-000063
其中
Figure PCTCN2022116971-appb-000064
为非线性激活函数,W 1为训练得到的矩阵参数。
in
Figure PCTCN2022116971-appb-000064
is a nonlinear activation function, and W 1 is the matrix parameter obtained from training.
根据图注意力机制,计算
Figure PCTCN2022116971-appb-000065
的向量表示
Figure PCTCN2022116971-appb-000066
为:
According to the graph attention mechanism, calculate
Figure PCTCN2022116971-appb-000065
vector representation of
Figure PCTCN2022116971-appb-000066
for:
Figure PCTCN2022116971-appb-000067
Figure PCTCN2022116971-appb-000067
其中
Figure PCTCN2022116971-appb-000068
为从列顶点C′对应的列中随机抽取r 1个数据;
Figure PCTCN2022116971-appb-000069
表示
Figure PCTCN2022116971-appb-000070
中的某一列顶点C′对于列顶点
Figure PCTCN2022116971-appb-000071
的权重函数,具体计算方式为:
in
Figure PCTCN2022116971-appb-000068
Randomly sample r 1 pieces of data from the column corresponding to column vertex C′;
Figure PCTCN2022116971-appb-000069
express
Figure PCTCN2022116971-appb-000070
A certain column vertex C′ in the column vertex
Figure PCTCN2022116971-appb-000071
The weight function of , the specific calculation method is:
Figure PCTCN2022116971-appb-000072
Figure PCTCN2022116971-appb-000072
其中
Figure PCTCN2022116971-appb-000073
为非线性激活函数,W 2为训练得到的矩阵参数。
in
Figure PCTCN2022116971-appb-000073
is a nonlinear activation function, and W 2 is a matrix parameter obtained from training.
(2)计算所有D′∈N(D i)与
Figure PCTCN2022116971-appb-000074
的匹配度,基于匹配度计算得到列顶点C′的分类,对应得到待分类医疗数据元集合中C′对应列的分类结果。
(2) Calculate all D′∈N(D i ) and
Figure PCTCN2022116971-appb-000074
The matching degree is calculated based on the matching degree to obtain the classification of the column vertex C', which corresponds to the classification result of the column corresponding to C' in the medical data element set to be classified.
标准分类医疗数据元图数据的列顶点D′和待分类医疗数据元图数据的列顶点C′的匹配度match_2(D′,C′)为:The matching degree match_2(D', C') of the column vertex D' of the standard classified medical data element graph data and the column vertex C' of the medical data element graph data to be classified is:
Figure PCTCN2022116971-appb-000075
Figure PCTCN2022116971-appb-000075
取与C′匹配度最高的列顶点
Figure PCTCN2022116971-appb-000076
即:
Take the column vertex with the highest matching degree with C'
Figure PCTCN2022116971-appb-000076
Right now:
Figure PCTCN2022116971-appb-000077
Figure PCTCN2022116971-appb-000077
则说明待分类医疗数据元图数据中的列顶点C′对应的列的分类为
Figure PCTCN2022116971-appb-000078
对应的标准数据元分类体系中的类别。
Then it shows that the classification of the column corresponding to the column vertex C′ in the medical data element graph data to be classified is
Figure PCTCN2022116971-appb-000078
The corresponding category in the standard data element taxonomy.
本发明实施例还提供一种基于深度图匹配的医疗数据元自动化分类系统,该系统包括:The embodiment of the present invention also provides a medical data element automatic classification system based on depth map matching, the system includes:
多源异构数据元的规范化采集与映射模块:定义基于最小元数据信息的医疗数据元图数据模型;将医疗机构内数据湖中存储的多源异构的数据元组成待筛选医疗数据元集合,向所述医疗数据元图数据模型自动化映射,映射结果存储为待筛选医疗数据元图数据;该模块的实现可以参考上述步骤一。Standardized acquisition and mapping module of multi-source heterogeneous data elements: define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatic mapping to the medical data element graph data model, and the mapping result is stored as medical data element graph data to be screened; the implementation of this module can refer to the above step 1.
有效医疗数据元筛选模块:计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度;构建医疗数据元筛选模型,基于各列顶点的重要度计算各列顶点对应的列映射到标准数据模型的可能性,筛选出有效列顶点,对应的列为有效医疗数据元,由有效列顶点集合关联组成待分类医疗数据元图数据,有效列顶点对应的列集合组成待分类医疗数据元集合;该模块的实现可以参考上述步骤二。Effective medical data element screening module: Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out. The corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a set of medical data elements to be classified; the realization of this module can refer to the above step 2.
基于深度图匹配模型的医疗数据元分类模块:从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合;基于种子顶点集合进行待分类医疗数据元图数据的子图切割;利用深度图匹配模型完成对待分类医疗数 据元图数据中列顶点的分类,从而得到列顶点对应的医疗数据元的分类;该模块的实现可以参考上述步骤三。Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the depth map matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices; the implementation of this module can refer to the above step three.
本发明提出的基于深度图匹配的医疗数据元自动化分类方法及系统的关键点如下:The key points of the medical data element automatic classification method and system based on depth map matching proposed by the present invention are as follows:
1)基于医疗机构内数据湖的最小元数据信息,定义了一种基于最小元数据信息的医疗数据元图数据模型,使得深度图匹配模型的效果同样适用于极低元数据信息的局部数据沼泽的情况,达到使用最少的元数据信息完成数据元自动化分类的目的,同时保证在图数据模型标准下采集的图结构数据适用于深度图匹配模型的训练。1) Based on the minimum metadata information of the data lake in the medical institution, a medical data metadata graph data model based on the minimum metadata information is defined, so that the effect of the depth map matching model is also applicable to the local data swamp with extremely low metadata information To achieve the goal of using the least metadata information to complete the automatic classification of data elements, and at the same time ensure that the graph structure data collected under the graph data model standard is suitable for the training of the deep graph matching model.
2)基于表示学习方法计算医疗数据元的向量表示,通过向量表示的分类,快速、自动化筛选有可能映射到标准数据模型的有效数据元。2) Calculate the vector representation of medical data elements based on the representation learning method, and quickly and automatically screen effective data elements that may be mapped to standard data models through the classification of vector representations.
3)基于图注意力机制计算列顶点的向量表示,构建深度图匹配模型完成医疗数据元的自动化分类。3) Calculate the vector representation of column vertices based on the graph attention mechanism, and build a deep graph matching model to complete the automatic classification of medical data elements.
以上所述仅是本发明的优选实施方式,虽然本发明已以较佳实施例披露如上,然而并非用以限定本发明。任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰,或修改为等同变化的等效实施例。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。The above descriptions are only preferred implementations of the present invention. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the method and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims (9)

  1. 一种基于深度图匹配的医疗数据元自动化分类方法,其特征在于,包括:A method for automatic classification of medical data elements based on depth map matching, characterized in that it includes:
    (1)定义基于最小元数据信息的医疗数据元图数据模型;将医疗机构内数据湖中存储的多源异构的数据元组成待筛选医疗数据元集合,向所述医疗数据元图数据模型自动化映射,映射结果存储为待筛选医疗数据元图数据;所述医疗数据元图数据模型采用有向属性图建模,图由顶点和边两种图元素构成;(1) Define a medical data element map data model based on the minimum metadata information; the multi-source heterogeneous data elements stored in the data lake in the medical institution form a set of medical data elements to be screened, and add to the medical data element map data model Automated mapping, the mapping result is stored as medical data element graph data to be screened; the medical data element graph data model is modeled by a directed attribute graph, and the graph is composed of two types of graph elements: vertices and edges;
    所述顶点是由标签和对应标签的属性组构成的,标签代表顶点的类型,属性组代表标签拥有的一种或多种属性;所述顶点的本体信息包含顶点类型及每类顶点对应的属性信息,所述顶点类型包括数据库顶点、表顶点和列顶点,所述数据库顶点对应的属性信息包括数据库顶点索引和数据库类型信息,所述表顶点对应的属性信息包括表顶点索引,所述列顶点对应的属性信息包括列顶点索引、列数据类型信息和列向量表示;The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label; the ontology information of the vertex includes the vertex type and the attributes corresponding to each type of vertex information, the vertex type includes database vertex, table vertex and column vertex, the attribute information corresponding to the database vertex includes database vertex index and database type information, the attribute information corresponding to the table vertex includes table vertex index, and the column vertex The corresponding attribute information includes column vertex index, column data type information and column vector representation;
    所述边是由边类型和边属性构成的,每一条边均为有向边;所述边的本体信息包含边类型及每类边对应的属性信息,所述边类型包括起点为数据库顶点、终点为表顶点的父子关联,起点为表顶点、终点为列顶点的父子关联,以及起点和终点均为列顶点的外键,三种边类型对应的属性信息均为边索引;The edge is composed of an edge type and an edge attribute, and each edge is a directed edge; the ontology information of the edge includes the edge type and attribute information corresponding to each type of edge, and the edge type includes the starting point being a database vertex, The parent-child association whose end point is a table vertex, the parent-child association whose starting point is a table vertex and the end point is a column vertex, and the foreign key whose starting point and end point are both column vertices. The attribute information corresponding to the three edge types is an edge index;
    (2)计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度;构建医疗数据元筛选模型,基于各列顶点的重要度计算各列顶点对应的列映射到标准数据模型的可能性,筛选出有效列顶点,由有效列顶点集合关联组成待分类医疗数据元图数据,有效列顶点对应的列集合组成待分类医疗数据元集合;(2) Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; construct a medical data element screening model, and calculate the column corresponding to each column vertex based on the importance of each column vertex The possibility of mapping to the standard data model, screening out the effective column vertices, the medical data element graph data to be classified is formed by the association of the effective column vertices, and the column set corresponding to the effective column vertices forms the medical data element set to be classified;
    (3)从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合;基于种子顶点集合进行待分类医疗数据元图数据的子图切割;利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类,从而得到列顶点对应的医疗数据元的分类。(3) Determine the seed vertex set of the standard classification medical data meta-graph data from the medical data meta-graph data to be classified; perform the sub-graph cutting of the medical data meta-graph data to be classified based on the seed vertex set; use the depth map matching model to complete the classification to be classified Classify the column vertices in the medical data element graph data, so as to obtain the classification of the medical data elements corresponding to the column vertices.
  2. 根据权利要求1所述的方法,其特征在于,所述多源异构的数据元向医疗数据元图数据模型的映射,包括:The method according to claim 1, wherein the mapping of the multi-source heterogeneous data elements to the medical data element graph data model includes:
    将来自多源异构的医疗数据从数据湖中采集,组成待筛选医疗数据元集合;Collect heterogeneous medical data from multiple sources from the data lake to form a collection of medical data elements to be screened;
    使用元数据采集工具对数据湖中存储的元数据进行抓取;Use the metadata collection tool to capture the metadata stored in the data lake;
    使用列向量生成器,对待筛选医疗数据元集合中各表各列中存储的数据进行遍历,利用列向量表示模型预测得到各表各列的列向量表示;Use the column vector generator to traverse the data stored in each column of each table in the medical data element set to be screened, and use the column vector representation model to predict and obtain the column vector representation of each column of each table;
    通过图数据关联映射,将采集的元数据和产生的列向量表示向医疗数据元图数据模型关联映射,得到待筛选医疗数据元图数据。Through the map data association mapping, the collected metadata and the generated column vector representation are mapped to the medical data element graph data model to obtain the medical data element graph data to be screened.
  3. 根据权利要求2所述的方法,其特征在于,所述列向量生成器以数据表中的单列作为一个数据元单位,使用列向量表示模型转化各列存储的数据,计算各列的向量表示;The method according to claim 2, wherein the column vector generator uses a single column in the data table as a data element unit, uses the column vector representation model to convert the data stored in each column, and calculates the vector representation of each column;
    所述列向量表示模型的训练包括:列向量表示模型的训练数据为存储在标准数据库中的人工完成医疗数据元分类、数据结构符合标准数据模型的列数据,记为标准分类列;标准分类医疗数据元图数据中的列顶点与对应标准分类列存在一一对应关系;The training of the column vector representation model includes: the training data of the column vector representation model is stored in the standard database to manually complete the medical data element classification, and the data structure conforms to the column data of the standard data model, which is recorded as a standard classification column; the standard classification medical treatment There is a one-to-one correspondence between the column vertices in the data element graph data and the corresponding standard classification columns;
    设标准分类医疗数据元图数据中列顶点集合为C={c k,j},其中c k,j表示列顶点集合对应的标准分类列中第k列,第j行的数据,c k,j={w t} t=1,2,...,m,m为第j行字符总数,w t为构成数据c k,j的字符;通过文本表示模型h计算得到字符w t的初始向量表示h(w t);在标准分类医疗数据元图数据的列顶点C k下随机抽取n行数据{c k,j} j=1,2,...,n,第j行数据的向量表示为
    Figure PCTCN2022116971-appb-100001
    根据自注意力机制计算得到标准分类医疗数据元图数据中列顶点C k下各行数据的相关性,得到列顶点C k的列向量表示H(C k),计算公式为:
    Let the column vertex set in the standard classification medical data element graph data be C={c k, j }, wherein c k, j represent the kth column in the standard classification column corresponding to the column vertex set, the data of the jth row, c k, j ={w t } t=1, 2,..., m , m is the total number of characters in line j, w t is the character that constitutes data c k, j ; the initial value of character w t is obtained by calculating the text representation model h Vector representation h(w t ); random sampling of n rows of data {c k, j } j=1, 2,..., n under the column vertex C k of the standard classified medical data element graph data, the jth row of data A vector is expressed as
    Figure PCTCN2022116971-appb-100001
    According to the calculation of the self-attention mechanism, the correlation of each row of data under the column vertex C k in the standard classified medical data element graph data is obtained, and the column vector representation H(C k ) of the column vertex C k is obtained. The calculation formula is:
    Figure PCTCN2022116971-appb-100002
    Figure PCTCN2022116971-appb-100002
    Figure PCTCN2022116971-appb-100003
    Figure PCTCN2022116971-appb-100003
    其中v(C k)为列顶点C k的向量表示,d k为v(C k)的维度,softmax为softmax函数; Where v(C k ) is the vector representation of column vertex C k , d k is the dimension of v(C k ), and softmax is the softmax function;
    所述列向量表示模型的预测包括:列向量表示模型的预测数据为数据湖中各数据库中各表各列所组成的待筛选医疗数据元集合,以列为遍历单元对待筛选医疗数据元集合进行遍历;使用列向量表示模型计算对列顶点每次随机抽样的列向量表示;对预测的多次随机抽样的列向量表示结果求平均值,作为所述列顶点最终的列向量表示。The prediction of the column vector representation model includes: the prediction data of the column vector representation model is a set of medical data elements to be screened composed of each table and column in each database in the data lake, and the set of medical data elements to be screened is performed using the column as a traversal unit. Traversing; using the column vector representation model to calculate the column vector representation of each random sampling of the column vertices; calculating the average of the predicted multiple random sampling column vector representation results, as the final column vector representation of the column vertices.
  4. 根据权利要求3所述的方法,其特征在于,所述计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度,包括:The method according to claim 3, wherein the calculation of the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model includes:
    对于待筛选医疗数据元图数据中存储的列顶点C k,在除去C k的列顶点集合中随机抽取p个列顶点{C t} t=1,2,...,p,通过计算列顶点C k与抽取的列顶点的相关性,计算C k在医疗数据元图数据模型中的重要度分数Im(C k),Im(C k)定义为: For the column vertices C k stored in the medical data element graph data to be screened, randomly select p column vertices {C t } t=1, 2, ..., p from the set of column vertices except C k , and calculate the column The correlation between vertex C k and the extracted column vertices is calculated as the importance score Im(C k ) of C k in the medical data element graph data model, and Im(C k ) is defined as:
    Figure PCTCN2022116971-appb-100004
    Figure PCTCN2022116971-appb-100004
    Figure PCTCN2022116971-appb-100005
    Figure PCTCN2022116971-appb-100005
    其中Importance_score为重要度函数。Among them, Importance_score is an importance function.
  5. 根据权利要求1所述的方法,其特征在于,所述医疗数据元筛选模型的训练与预测具体为:The method according to claim 1, wherein the training and prediction of the medical data element screening model are specifically:
    将根据标准数据元分类体系,人工分类和关联映射构建的标准分类医疗数据元集合转换为标准分类医疗数据元图数据,设标准分类医疗数据元图数据中存储的列顶点集合为S={s k},设构建标准分类医疗数据元集合过程中被人工筛选排除的列对应的列顶点集合为S′={s′ k}; Convert the standard classification medical data element set constructed according to the standard data element classification system, artificial classification and association mapping into standard classification medical data element graph data, and set the column vertex set stored in the standard classification medical data element graph data as S={s k }, set the column vertex set corresponding to the column excluded by manual screening in the process of constructing the standard classification medical data element set as S′={s′ k };
    训练时从集合S中随机抽取q个列顶点作为正样本集合{s t} t=1,2,...,q,从集合S′中随机抽取q个列顶点作为负样本集合{s′ t} t=1,2,...,q;设样本(s i,y i)的重要度分 数为Im(s i),s i表示第i个列顶点,y i∈{0,1}表示样本真实类别,则基于重要度分数计算医疗数据元筛选模型的损失函数Loss: During training, q column vertices are randomly selected from the set S as the positive sample set {s t } t=1, 2, ..., q , and q column vertices are randomly selected from the set S′ as the negative sample set {s′ t } t=1, 2, ..., q ; Let the importance score of the sample (s i , y i ) be Im(s i ), s i represents the i-th column vertex, y i ∈ {0, 1 } represents the true category of the sample, then calculate the loss function Loss of the medical data element screening model based on the importance score:
    Figure PCTCN2022116971-appb-100006
    Figure PCTCN2022116971-appb-100006
    所述医疗数据元筛选模型在预测时,通过计算阈值L′判断列顶点C k对应的待筛选医疗数据元集合中的列是否为有效数据元,阈值L′计算公式: When predicting, the medical data element screening model judges whether the column in the medical data element set to be screened corresponding to the column vertex C k is a valid data element by calculating the threshold L', and the calculation formula of the threshold L' is:
    Figure PCTCN2022116971-appb-100007
    Figure PCTCN2022116971-appb-100007
    若L′≥0.5,则说明列顶点C k为有效列顶点,对应的列为有效数据元; If L'≥0.5, it indicates that the column vertex C k is a valid column vertex, and the corresponding column is a valid data element;
    由筛选后的有效列顶点集合关联组成待分类医疗数据元图数据,对应的筛选后的列集合组成待分类医疗数据元集合。The medical data element graph data to be classified is formed by association of the filtered effective column vertex sets, and the corresponding filtered column sets form the medical data element set to be classified.
  6. 根据权利要求1所述的方法,其特征在于,所述从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合,包括:The method according to claim 1, wherein said determining the seed vertex set of the standard classification medical data metadata from the medical data metadata to be classified comprises:
    设由标准数据模型定义的标准数据元分类体系中所有标准分类集合为E,标准分类医疗数据元图数据中的列顶点集合为D,D i∈D在标准数据元分类体系中的分类为E i∈E;设待分类医疗数据元图数据中存储的列顶点集合为C;医疗数据元分类过程抽象为在D中找到与列顶点C k∈C匹配度最高的列顶点D i,从而确定列顶点C k对应的列的分类为E iLet all the standard classification sets in the standard data element classification system defined by the standard data model be E, the set of column vertices in the standard classification medical data element graph data be D, and the classification of D i ∈ D in the standard data element classification system be E i ∈ E; suppose the set of column vertices stored in the medical data element graph data to be classified is C; the medical data element classification process is abstracted as finding the column vertex D i with the highest matching degree with the column vertex C k ∈ C in D, so as to determine The classification of the column corresponding to the column vertex C k is E i ;
    对于列顶点D i∈D,从D i对应的列中随机抽取r 0个数据
    Figure PCTCN2022116971-appb-100008
    对于列顶点C k∈C,从C k对应的列中随机抽取r 0个数据
    Figure PCTCN2022116971-appb-100009
    则D i和C k的匹配度match_1(D i,C k)为:
    For a column vertex D i ∈ D, randomly select r 0 data from the column corresponding to D i
    Figure PCTCN2022116971-appb-100008
    For a column vertex C k ∈ C, randomly sample r 0 data from the column corresponding to C k
    Figure PCTCN2022116971-appb-100009
    Then the matching degree match_1(D i , C k ) of D i and C k is:
    Figure PCTCN2022116971-appb-100010
    Figure PCTCN2022116971-appb-100010
    其中v(x)代表数据x的向量表示,则D i对应的种子顶点为与其匹配度最高的列顶点
    Figure PCTCN2022116971-appb-100011
    即:
    Where v(x) represents the vector representation of data x, then the seed vertex corresponding to D i is the column vertex with the highest matching degree
    Figure PCTCN2022116971-appb-100011
    Right now:
    Figure PCTCN2022116971-appb-100012
    Figure PCTCN2022116971-appb-100012
  7. 根据权利要求6所述的方法,其特征在于,所述基于种子顶点集合进行待分类医疗数据元图数据的子图切割,包括:The method according to claim 6, wherein the subgraph cutting of the medical data element graph data to be classified based on the seed vertex set includes:
    Figure PCTCN2022116971-appb-100013
    表示待分类医疗数据元图数据中与
    Figure PCTCN2022116971-appb-100014
    存在父子关系的列顶点集合,以
    Figure PCTCN2022116971-appb-100015
    表示待分类医疗数据元图数据中与
    Figure PCTCN2022116971-appb-100016
    存在外键关系的列顶点集合,则基于种子顶点
    Figure PCTCN2022116971-appb-100017
    切割得到的子图
    Figure PCTCN2022116971-appb-100018
    为:
    by
    Figure PCTCN2022116971-appb-100013
    Represents the medical data element graph data to be classified and
    Figure PCTCN2022116971-appb-100014
    A collection of column vertices with a parent-child relationship, with
    Figure PCTCN2022116971-appb-100015
    Represents the medical data element graph data to be classified and
    Figure PCTCN2022116971-appb-100016
    A collection of column vertices with foreign key relationships, based on the seed vertex
    Figure PCTCN2022116971-appb-100017
    The subgraph obtained by cutting
    Figure PCTCN2022116971-appb-100018
    for:
    Figure PCTCN2022116971-appb-100019
    Figure PCTCN2022116971-appb-100019
    以N(D i)表示标准分类医疗数据元图数据中与D i关联同一父顶点的列顶点集合,则深度图匹配模型的目标是从子图
    Figure PCTCN2022116971-appb-100020
    中搜索子图,使得搜索到的子图中的列顶点与N(D i)中的列顶点一一匹配,实现
    Figure PCTCN2022116971-appb-100021
    中列顶点对应的医疗数据元的分类。
    Let N(D i ) denote the set of column vertices associated with the same parent vertex as D i in the standard classification medical data element graph data, then the goal of the depth graph matching model is to obtain
    Figure PCTCN2022116971-appb-100020
    Search the subgraph in , so that the column vertices in the searched subgraph match the column vertices in N(D i ) one by one, and realize
    Figure PCTCN2022116971-appb-100021
    The classification of the medical data elements corresponding to the vertices of the middle column.
  8. 根据权利要求7所述的方法,其特征在于,所述利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类,包括:The method according to claim 7, wherein the use of the depth map matching model to complete the classification of column vertices in the medical data element graph data to be classified includes:
    根据图注意力机制,计算标准分类医疗数据元图数据中列顶点D i的向量表示V(D i)为: According to the graph attention mechanism, the vector representation V(D i ) of column vertices D i in the standard classification medical data element graph data is calculated as:
    Figure PCTCN2022116971-appb-100022
    Figure PCTCN2022116971-appb-100022
    其中
    Figure PCTCN2022116971-appb-100023
    为从列顶点D′对应的列中随机抽取r 1个数据;w(D′,D i)表示N(D i)中的某一列顶点D′对于列顶点D i的权重函数;
    in
    Figure PCTCN2022116971-appb-100023
    r 1 pieces of data are randomly selected from the column corresponding to the column vertex D';w(D', D i ) represents the weight function of a certain column vertex D' in N(D i ) for the column vertex D i ;
    根据图注意力机制,计算待分类医疗数据元图数据的列顶点
    Figure PCTCN2022116971-appb-100024
    的向量表示
    Figure PCTCN2022116971-appb-100025
    为:
    According to the graph attention mechanism, calculate the column vertices of the medical data element graph data to be classified
    Figure PCTCN2022116971-appb-100024
    vector representation of
    Figure PCTCN2022116971-appb-100025
    for:
    Figure PCTCN2022116971-appb-100026
    Figure PCTCN2022116971-appb-100026
    其中
    Figure PCTCN2022116971-appb-100027
    为从列顶点C′对应的列中随机抽取r 1个数据;
    Figure PCTCN2022116971-appb-100028
    表示
    Figure PCTCN2022116971-appb-100029
    中的某一列顶点C′对于列顶点
    Figure PCTCN2022116971-appb-100030
    的权重函数;
    in
    Figure PCTCN2022116971-appb-100027
    Randomly sample r 1 pieces of data from the column corresponding to column vertex C′;
    Figure PCTCN2022116971-appb-100028
    express
    Figure PCTCN2022116971-appb-100029
    A certain column vertex C′ in the column vertex
    Figure PCTCN2022116971-appb-100030
    weight function;
    列顶点D′∈N(D i)和列顶点
    Figure PCTCN2022116971-appb-100031
    的匹配度match_2(D′,C′)为:
    Column vertex D′∈N(D i ) and column vertex
    Figure PCTCN2022116971-appb-100031
    The matching degree match_2(D', C') is:
    Figure PCTCN2022116971-appb-100032
    Figure PCTCN2022116971-appb-100032
    取与C′匹配度最高的列顶点
    Figure PCTCN2022116971-appb-100033
    即:
    Take the column vertex with the highest matching degree with C'
    Figure PCTCN2022116971-appb-100033
    Right now:
    Figure PCTCN2022116971-appb-100034
    Figure PCTCN2022116971-appb-100034
    待分类医疗数据元图数据中的列顶点C′对应的列的分类为
    Figure PCTCN2022116971-appb-100035
    对应的标准数据元分类体系中的类别。
    The classification of the column corresponding to the column vertex C′ in the medical data element graph data to be classified is
    Figure PCTCN2022116971-appb-100035
    The corresponding category in the standard data element taxonomy.
  9. 一种基于深度图匹配的医疗数据元自动化分类系统,其特征在于,包括:An automatic classification system for medical data elements based on depth map matching, characterized in that it includes:
    多源异构数据元的规范化采集与映射模块:定义基于最小元数据信息的医疗数据元图数据模型;将医疗机构内数据湖中存储的多源异构的数据元组成待筛选医疗数据元集合,向所述医疗数据元图数据模型自动化映射,映射结果存储为待筛选医疗数据元图数据;所述医疗数据元图数据模型采用有向属性图建模,图由顶点和边两种图元素构成;Standardized acquisition and mapping module of multi-source heterogeneous data elements: define the medical data element graph data model based on the minimum metadata information; combine multi-source heterogeneous data elements stored in the data lake in the medical institution to form a set of medical data elements to be screened , automatically map to the medical data element graph data model, and the mapping result is stored as medical data element graph data to be screened; the medical data element graph data model is modeled by a directed attribute graph, and the graph consists of two types of graph elements: vertices and edges constitute;
    所述顶点是由标签和对应标签的属性组构成的,标签代表顶点的类型,属性组代表标签拥有的一种或多种属性;所述顶点的本体信息包含顶点类型及每类顶点对应的属性信息,所述顶点类型包括数据库顶点、表顶点和列顶点,所述数据库顶点对应的属性信息包括数据库顶点索引和数据库类型信息,所述表顶点对应的属性信息包括表顶点索引,所述列顶点对应的属性信息包括列顶点索引、列数据类型信息和列向量表示;The vertex is composed of a label and an attribute group corresponding to the label. The label represents the type of the vertex, and the attribute group represents one or more attributes owned by the label; the ontology information of the vertex includes the vertex type and the attributes corresponding to each type of vertex information, the vertex type includes database vertex, table vertex and column vertex, the attribute information corresponding to the database vertex includes database vertex index and database type information, the attribute information corresponding to the table vertex includes table vertex index, and the column vertex The corresponding attribute information includes column vertex index, column data type information and column vector representation;
    所述边是由边类型和边属性构成的,每一条边均为有向边;所述边的本体信息包含边类型及每类边对应的属性信息,所述边类型包括起点为数据库顶点、终点为表顶点的父子关联,起点为表顶点、终点为列顶点的父子关联,以及起点和终点均为列顶点的外键,三种边类型对应的属性信息均为边索引;The edge is composed of an edge type and an edge attribute, and each edge is a directed edge; the ontology information of the edge includes the edge type and attribute information corresponding to each type of edge, and the edge type includes the starting point being a database vertex, The parent-child association whose end point is a table vertex, the parent-child association whose starting point is a table vertex and the end point is a column vertex, and the foreign key whose starting point and end point are both column vertices. The attribute information corresponding to the three edge types is an edge index;
    有效医疗数据元筛选模块:计算待筛选医疗数据元图数据中存储的各列顶点在医疗数据元图数据模型中的重要度;构建医疗数据元筛选模型,基于各列顶点的重要度计算各列顶点对应的列映射到标准数据模型的可能性,筛选出有效列顶点,对应的列为有效医疗数据元,由有效列顶点集合关联组成待分类医疗数据元图数据,有效列顶点对应的列集合组成待分类医疗数据元集合;Effective medical data element screening module: Calculate the importance of each column vertex stored in the medical data element graph data to be screened in the medical data element graph data model; build a medical data element screening model, and calculate each column based on the importance of each column vertex The possibility that the column corresponding to the vertex is mapped to the standard data model, and the valid column vertex is screened out. The corresponding column is a valid medical data element, and the medical data element graph data to be classified is composed of the valid column vertex set, and the column set corresponding to the valid column vertex Form a collection of medical data elements to be classified;
    基于深度图匹配模型的医疗数据元分类模块:从待分类医疗数据元图数据中确定标准分类医疗数据元图数据的种子顶点集合;基于种子顶点集合进行待分类医疗数据元图数据的子图切割;利用深度图匹配模型完成对待分类医疗数据元图数据中列顶点的分类,从而得到列顶点对应的医疗数据元的分类。Medical data element classification module based on depth graph matching model: determine the seed vertex set of standard classified medical data element graph data from the medical data element graph data to be classified; perform subgraph cutting of the medical data element graph data to be classified based on the seed vertex set ; Use the deep graph matching model to complete the classification of the column vertices in the medical data element graph data to be classified, so as to obtain the classification of the medical data elements corresponding to the column vertices.
PCT/CN2022/116971 2021-12-30 2022-09-05 Depth map matching-based automatic classification method and system for medical data elements WO2023124191A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2023536557A JP7432801B2 (en) 2021-12-30 2022-09-05 Medical data element automated classification method and system based on depth map matching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111649231.1A CN114003791B (en) 2021-12-30 2021-12-30 Depth map matching-based automatic classification method and system for medical data elements
CN202111649231.1 2021-12-30

Publications (1)

Publication Number Publication Date
WO2023124191A1 true WO2023124191A1 (en) 2023-07-06

Family

ID=79932292

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116971 WO2023124191A1 (en) 2021-12-30 2022-09-05 Depth map matching-based automatic classification method and system for medical data elements

Country Status (3)

Country Link
JP (1) JP7432801B2 (en)
CN (1) CN114003791B (en)
WO (1) WO2023124191A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312435A (en) * 2023-11-23 2023-12-29 首都信息发展股份有限公司 Data acquisition method and device and electronic equipment
CN117763129A (en) * 2024-02-22 2024-03-26 神州医疗科技股份有限公司 medical record retrieval system based on generated pre-training model

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003791B (en) * 2021-12-30 2022-04-08 之江实验室 Depth map matching-based automatic classification method and system for medical data elements
CN116166698B (en) * 2023-01-12 2023-09-01 之江实验室 Method and system for quickly constructing queues based on general medical terms
CN117349401B (en) * 2023-12-06 2024-03-15 之江实验室 Metadata storage method, device, medium and equipment for unstructured data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017152802A1 (en) * 2016-03-07 2017-09-14 陈宽 Intelligent system and method for converting textual medical report into structured data
CN109471945A (en) * 2018-11-12 2019-03-15 中山大学 Medical file classification method, device and storage medium based on deep learning
CN109948680A (en) * 2019-03-11 2019-06-28 合肥工业大学 The classification method and system of medical record data
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
CN110349639A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter medical terms standardized system based on common therapy terminology bank
US20210089880A1 (en) * 2019-09-25 2021-03-25 International Business Machines Corporation Systems and methods for training a model using a few-shot classification process
CN113656604A (en) * 2021-10-19 2021-11-16 之江实验室 Medical term normalization system and method based on heterogeneous graph neural network
CN114003791A (en) * 2021-12-30 2022-02-01 之江实验室 Depth map matching-based automatic classification method and system for medical data elements

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8280886B2 (en) 2008-02-13 2012-10-02 Fujitsu Limited Determining candidate terms related to terms of a query
CN105354266A (en) * 2015-10-23 2016-02-24 北京航空航天大学 Rich graph model RichGraph based graph data management method
CN106250382A (en) * 2016-01-28 2016-12-21 新博卓畅技术(北京)有限公司 A kind of metadata management automotive engine system and implementation method
US11625620B2 (en) 2018-08-16 2023-04-11 Oracle International Corporation Techniques for building a knowledge graph in limited knowledge domains
US11921697B2 (en) * 2019-11-22 2024-03-05 Fraud.net, Inc. Methods and systems for detecting spurious data patterns
CN111523003A (en) * 2020-04-27 2020-08-11 北京图特摩斯科技有限公司 Data application method and platform with time sequence dynamic map as core
CN112185515A (en) * 2020-10-12 2021-01-05 安徽动感智能科技有限公司 Patient auxiliary system based on action recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017152802A1 (en) * 2016-03-07 2017-09-14 陈宽 Intelligent system and method for converting textual medical report into structured data
CN109471945A (en) * 2018-11-12 2019-03-15 中山大学 Medical file classification method, device and storage medium based on deep learning
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
CN109948680A (en) * 2019-03-11 2019-06-28 合肥工业大学 The classification method and system of medical record data
CN110349639A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter medical terms standardized system based on common therapy terminology bank
US20210089880A1 (en) * 2019-09-25 2021-03-25 International Business Machines Corporation Systems and methods for training a model using a few-shot classification process
CN113656604A (en) * 2021-10-19 2021-11-16 之江实验室 Medical term normalization system and method based on heterogeneous graph neural network
CN114003791A (en) * 2021-12-30 2022-02-01 之江实验室 Depth map matching-based automatic classification method and system for medical data elements

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312435A (en) * 2023-11-23 2023-12-29 首都信息发展股份有限公司 Data acquisition method and device and electronic equipment
CN117763129A (en) * 2024-02-22 2024-03-26 神州医疗科技股份有限公司 medical record retrieval system based on generated pre-training model

Also Published As

Publication number Publication date
JP2024502730A (en) 2024-01-23
CN114003791B (en) 2022-04-08
CN114003791A (en) 2022-02-01
JP7432801B2 (en) 2024-02-16

Similar Documents

Publication Publication Date Title
WO2023124191A1 (en) Depth map matching-based automatic classification method and system for medical data elements
CN111428053B (en) Construction method of tax field-oriented knowledge graph
WO2021103492A1 (en) Risk prediction method and system for business operations
CN111428054B (en) Construction and storage method of knowledge graph in network space security field
CN111708773B (en) Multi-source scientific and creative resource data fusion method
CN111488465A (en) Knowledge graph construction method and related device
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN113434623B (en) Fusion method based on multi-source heterogeneous space planning data
CN111444348A (en) Method, system and medium for constructing and applying knowledge graph architecture
CN111708774B (en) Industry analytic system based on big data
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN110600121A (en) Knowledge graph-based primary etiology diagnosis method
CN112463981A (en) Enterprise internal operation management risk identification and extraction method and system based on deep learning
CN112508743B (en) Technology transfer office general information interaction method, terminal and medium
CN116383399A (en) Event public opinion risk prediction method and system
CN116821376B (en) Knowledge graph construction method and system in coal mine safety production field
CN113254517A (en) Service providing method based on internet big data
CN115934969A (en) Construction method of immovable cultural relic risk assessment knowledge graph
CN116245107A (en) Electric power audit text entity identification method, device, equipment and storage medium
CN115827885A (en) Operation and maintenance knowledge graph construction method and device and electronic equipment
Su et al. Design and application of intelligent management platform based on big data
Sawarkar et al. Automated metadata harmonization using entity resolution and contextual embedding
CN117151659B (en) Ecological restoration engineering full life cycle tracing method based on large language model
Wang et al. Construction of knowledge graph for internal control of financial enterprises
CN115292274B (en) Data warehouse topic model construction method and system

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2023536557

Country of ref document: JP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913470

Country of ref document: EP

Kind code of ref document: A1