CN115662638A - Data discovery method and system for heterogeneous clinical information - Google Patents

Data discovery method and system for heterogeneous clinical information Download PDF

Info

Publication number
CN115662638A
CN115662638A CN202210434115.6A CN202210434115A CN115662638A CN 115662638 A CN115662638 A CN 115662638A CN 202210434115 A CN202210434115 A CN 202210434115A CN 115662638 A CN115662638 A CN 115662638A
Authority
CN
China
Prior art keywords
data
information
clinical
column
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210434115.6A
Other languages
Chinese (zh)
Inventor
田雨
李劲松
吴文昊
周天舒
王执晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210434115.6A priority Critical patent/CN115662638A/en
Publication of CN115662638A publication Critical patent/CN115662638A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a data discovery method and a system for heterogeneous clinical information, which comprises the steps of firstly extracting basic metadata information from an original database of the heterogeneous clinical information of a medical institution; secondly, constructing a semantic network based on Chinese clinical medical terms, standard clinical data tables and clinical data service scenes; using a semantic network to identify a domain named entity based on rules, and acquiring semantic information of original data in an original database; then grouping the original data in the original database according to the basic metadata information and the semantic information, and performing association calculation to obtain the association relation of each data in each group; and finally, performing visual chart display on the basic metadata information and the semantic information, and performing visual graphic structure display on the obtained association relation. The invention describes the original clinical data from a plurality of different dimensions, acquires multi-dimensional descriptive information, data statistical information, semantic information and the like, and is beneficial to improving the retrieval efficiency and accuracy.

Description

Data discovery method and system for heterogeneous clinical information
Technical Field
The invention relates to the technical field of medical data, in particular to a data discovery method and a data discovery system for heterogeneous clinical information.
Background
Retrospective clinical medical research based on real-world clinical data (represented by clinical information system raw data) has become a common and important tool in current medical informatics research. The accuracy, integrity and consistency of clinical data have important value for the development of retrospective clinical research, however, clinical information systems applied among different hospitals have great difference, the heterogeneity of data structures is great, the difficulty in retrieving target data from heterogeneous clinical information systems is great, and the data retrieval part of retrospective clinical medical research occupies a great amount of time. At present, target clinical data are retrieved from a hospital heterogeneous clinical information system mainly based on original database files manually, the efficiency of retrieving the target clinical data is low, and the data quality is difficult to guarantee.
Data discovery refers to a process of collecting metadata from various heterogeneous databases and integrating the metadata into a single source that can be conveniently evaluated in real time, and the metadata can describe the raw data of a clinical information system from different dimensions, so that a user can fully understand and comprehend the raw data, and the target clinical data can be accurately and completely located. Therefore, the data discovery can assist researchers to quickly complete the retrieval and acquisition of target clinical data, and the quality of clinical data of clinical retrospective research is ensured.
Because of the difficulty in retrieving target data from raw data of heterogeneous clinical information systems, many studies have been attempted to solve this problem. Wherein the most similar technical scheme to that claimed in the patent is (1) a paper published by Google [1] The problem of obtaining target data from a large amount of heterogeneous raw data is solved by collecting metadata from the raw data, relationships between data such as similarity and source from owner, time stamp, data structure, and then publishing the metadata through a service, letting a user know the raw data from the metadata, and making further comments on the raw data, and getting furtherA plurality of descriptive information. (2) Castro Fernandez et al [2] The data discovery system is realized, the relationship between data is analyzed from heterogeneous original data sources, a knowledge graph is constructed to store the relationship between the original data, more associated data are obtained from the knowledge graph, and a user can conveniently obtain data associated information of the heterogeneous data sources. (3) Method and device for managing metadata and readable storage medium in prior art scheme (CN 110795448A)]A method for managing metadata is described, wherein the metadata comprises medical terms, a time dimension, a space dimension and a degree dimension which describe target medical data, the medical terms are extracted by using keywords, and then a metadata query request is received to provide corresponding metadata, so that the convenience of medical data management is improved.
The prior art similar to (1) provides description and management of a large number of heterogeneous data sources by using metadata management, and the metadata management generally requires user interaction to provide a part of metadata therein, such as semantic information and data association information, so that a relatively complete metadata base can be obtained by long-time user use accumulation, and cannot be directly used after metadata is extracted from original data.
In the prior art similar to the prior art in the step (2), by improving the efficiency of data association information analysis, data from different sources can be associated and analyzed together, a knowledge graph of data association information is constructed after the associated data is obtained, and more associated target data can be obtained from the knowledge graph. Such techniques generally focus on data association information, upstream and downstream data migration and data blood-margin analysis, lack of description of semantic information, and affect the efficiency of retrieving target data.
The prior art similar to (3) lacks data association information, which affects the integrity of the retrieval target data.
[1]HALEVY A,KORN F,NOY N F,et al.Goods:Organizing Google’s Datasets[C]; proceedings of the 2016International Conference on Management of Data,2016:795–806.
[2]CASTRO FERNANDEZ R,ABEDJAN Z,KOKO F,et al.Aurum:A Data Discovery System[C];proceedings of the 2018IEEE 34th International Conference on Data Engineering (ICDE),2018:1001-1012.
Disclosure of Invention
The invention aims to provide a data discovery method and system for heterogeneous clinical information aiming at the defects of the prior art.
The purpose of the invention is realized by the following technical scheme: on one hand, the invention provides a data discovery method for heterogeneous clinical information, which comprises the following specific steps:
(1) Extracting basic metadata information from an original database of heterogeneous clinical information of a medical institution;
(2) Constructing a semantic network based on Chinese clinical medical terms, a standard clinical data table OMOP CDM and a clinical data service scene; using a semantic network to identify a domain named entity based on rules, and acquiring semantic information of original data in an original database;
(3) Grouping original data in an original database according to the basic metadata information and the semantic information, and performing correlation calculation to obtain the correlation of each data in each group;
(4) And (4) performing visual chart display on the basic metadata information and the semantic information, and performing visual graphic structure display on the association relation obtained in the step (3).
Further, the basic metadata information includes a database name, a column data type, a column data length, a column data number, whether a column can be empty, the number of unique and different values of the column, the number of empty values of the column, a column empty value ratio, a column final analysis time, whether the column contains numbers, whether the column contains letters, whether the column contains Chinese, a column representative data length, column representative data, and a main key judgment.
Further, the basic metadata information is written into a metadata database, so that subsequent browsing and searching are facilitated.
Further, in the step (2), a Neo4j graph database is used for constructing the semantic network.
Further, in the step (2), the original data is filtered by setting specific semantic information.
Further, in the step (3), the association relations of the four types of data sets are defined, which are respectively a complete equivalence relation, a complete inclusion relation, an incomplete inclusion relation and a similar relation.
Further, in step (3), the association calculation within each group uses Jaccard context and Jaccard Similarity to measure the relationships between sets.
On the other hand, the invention provides a data discovery system for heterogeneous clinical information, which comprises a database connection management module, a basic metadata extraction module, a clinical semantic information extraction module, a data association information extraction module and a metadata visualization module;
the database connection management module is connected with an original database of heterogeneous clinical information of the medical institution through a database connection framework;
the basic metadata extraction module extracts basic metadata information from an original database of heterogeneous clinical information of a medical institution;
the clinical semantic information extraction module constructs a semantic network based on Chinese clinical medical terms, a standard clinical data table OMOP CDM and a clinical data service scene; using a semantic network to identify domain named entities based on rules, and acquiring semantic information of original data in an original database;
the data association information extraction module defines association relations of four types of data sets, namely a complete equivalence relation, a complete inclusion relation, an incomplete inclusion relation and a similar relation; grouping original data in an original database according to the basic metadata information and the semantic information, and performing association calculation to obtain association relations of all data in each group;
and the metadata visualization module is used for performing chart visualization display on the basic metadata information and the semantic information and performing graphic structure visualization display on the association relation obtained by the data association information extraction module.
Further, the database connection framework adopts Mybatis or Hibernate to provide a uniform access interface for various different types of original data databases.
The invention has the beneficial effects that:
(1) The invention defines a metadata structure oriented to a heterogeneous clinical information system, all metadata can be obtained through automatic data analysis, and a user can directly use the metadata to retrieve target data. Raw clinical data is described from a plurality of different dimensions, and descriptive information of the raw data is obtained. The invention is beneficial to improving the retrieval efficiency and accuracy through multi-dimensional descriptive information, data statistical information, semantic information and the like.
(2) The method constructs a semantic network based on Chinese standard clinical medical terms, OMOP CDM standard clinical data tables and clinical information system service scenes, performs domain named entity identification analysis on original data of the heterogeneous clinical information system, performs semantic matching through the semantic network after acquiring named entities, acquires semantic information of the original data, and is beneficial to the accuracy of target data retrieval. In the retrieval tasks of different target data, whether the original data are matched with the target data or not can be identified by setting specific target data, so that accurate target data can be obtained. For example, to obtain data related to stroke research, stroke related diagnostic data can be obtained through the semantic network filtering.
(3) The invention explains and completes the relationship between the original clinical data columns by defining four different data association relationships, and finally expresses the relationship by a graphic structure, thereby being beneficial to the completeness of target data retrieval.
(4) The invention provides the functions of chart visualization, data browsing, searching, recommendation and data management of the metadata by developing a Web end application, and is beneficial to improving the target data retrieval efficiency.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of the system of the present invention;
fig. 3 is a detailed process diagram of metadata extraction.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
As shown in fig. 1, in a first aspect, the data discovery method for heterogeneous clinical information provided by the present invention specifically includes the following steps:
(1) Extracting basic metadata information from an original database of heterogeneous clinical information of a medical institution; the basic metadata information comprises a database name, a column data type, a column data length, a column data quantity, whether a column can be empty, a unique different value quantity of the column, a column empty value quantity, a column empty value proportion, a column final analysis time, whether the column contains numbers, whether the column contains letters, whether the column contains Chinese, a column representative data length, column representative data and a main key judgment. And basic metadata information is written into a metadata database, so that subsequent browsing and searching are facilitated.
(2) Constructing a semantic network based on Chinese clinical medical terminology, a standard clinical data table OMOP CDM and clinical data service scenarios by using a Neo4j graph database; using a semantic network to identify a domain named entity based on rules, and acquiring semantic information of original data in an original database; the semantic information includes disease diagnosis, operation, medicine, inspection items, medical instruments (medical consumables), addresses, traditional Chinese medicines, units, drug formulations, hospital departments, nationalities, doctor titles, inspection samples, payment methods, genders, and the like. On the other hand, the raw data may be filtered by setting specific semantic information, for example, data related to heat generation may be set, so that data related to diagnosis, inspection, medicine, and the like, which are related to heat generation, may be filtered from the raw data.
(3) Defining incidence relations of the four types of data sets, namely a complete equivalence relation, a complete inclusion relation, an incomplete inclusion relation and a similar relation; grouping original data in an original database according to the basic metadata information and the semantic information, and filtering the original data by setting specific semantic information; performing association calculation to obtain association relations of each data in each group; the association calculations within each group measure the relationships between sets using Jaccard Containment and Jaccard Similarity.
(4) And (4) performing visual chart display on the basic metadata information and the semantic information, and performing visual graphic structure display on the association relation obtained in the step (3).
On the other hand, the invention provides a data discovery system for heterogeneous clinical information, which comprises a database connection management module, a basic metadata extraction module, a clinical semantic information extraction module, a data association information extraction module and a metadata visualization module. And finally, the metadata is visualized to realize the framework of functions such as browsing, searching, recommending and the like.
The database connection management module realizes the following functions:
the method is characterized in that information needed by an original database, a metadata database, a Neo4j database and a Redis database for accessing heterogeneous clinical information is managed, and the method is realized by providing uniform access interfaces for various different types of original data databases through an existing database connection framework such as Mybatis or Hibernate, so that the functions of establishing database connection, switching data sources, inquiring data, writing data and the like are realized.
The basic metadata extraction module realizes the following functions:
after the database connection management module realizes connection of an original database, basic metadata information of data in the original database is automatically scanned, and the basic metadata information comprises a user name (database name), a column name, a column data type, a column data length, a column data number, whether the column can be empty, the number of unique different values of the column, the number of column empty values, a column empty value proportion, the final analysis time of the column, whether the column contains numbers, whether the column contains letters, whether the column contains Chinese, the length of column representative data, column representative data and a main key judgment. These information are written into a metadata base to support subsequent browsing and searching functions.
The clinical semantic information extraction module realizes the following functions:
the method comprises the steps of constructing a semantic network based on Chinese clinical medicine terms, an OMOP CDM standard clinical data table and clinical data service scenes, using a Neo4j graph database to realize the semantic network, identifying field named entities based on user-defined rules on original data, and then obtaining semantic information of the original data, wherein the semantic information comprises disease diagnosis, operation, medicines, inspection items, medical instruments (medical consumables), addresses, traditional Chinese medicines, units, medicine dosage forms, hospital departments, people groups, doctor titles, inspection samples, payment modes, sexes and the like. On the other hand, the raw data may be filtered by setting specific semantic information, for example, data related to heat generation, such as diagnosis, inspection, and medicine, may be filtered from the raw data by setting the data related to heat generation.
The data associated information extraction module realizes the following functions:
firstly, defining the incidence relation of four types of data sets, namely a complete equivalence relation, a complete inclusion relation, an incomplete inclusion relation and a similar relation. The original data is firstly grouped according to the previous basic metadata information and semantic information, and data such as a unit, a digit, a letter and the like needs to be excluded, so that the grouping has the advantages of greatly improving the efficiency of association calculation and reducing the interference of irrelevant data. Jaccard context and Jaccard Similarity are used within each group to measure the relationship between sets, and the formula is shown below, using the Redis database and the locally sensitive hashing algorithm to improve the computational efficiency.
Figure BDA0003612168620000051
Figure BDA0003612168620000052
Where A and B represent data of different columns, and JC (A, B) represents Jaccard content of data column A and data column B. JS (A, B) represents the Jaccard Similarity of data column A and data column B.
The metadata visualization module realizes the following functions:
and connecting the metadata database, providing chart visualization of statistical information of the original data, browsing and searching basic metadata, semantic information and associated information, recommending similar data, visualizing a graph structure of data associated information, and providing a data management function.
By practically applying the data discovery method designed by the invention to the real clinical information data of two hospitals, the basic metadata shows that the two clinical data structures are completely inconsistent, the evaluation data shows that the semantic information accuracy of the original clinical data can reach 90%, and the coverage rate can reach 85%; more than 70% of columns in the data association analysis can obtain association information, wherein the accuracy rate of the primary key association and the foreign key association can reach 80%, and all the columns are covered in the scanned data. Therefore, the data show that the data discovery method provided by the invention can be suitable for real heterogeneous clinical information data, has higher accuracy and coverage rate for extracting the metadata of the clinical information data, can better describe clinical original data, and has important significance and value for improving the retrieval efficiency, accuracy and completeness of target clinical data.
The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit and scope of the claims.

Claims (9)

1. A data discovery method for heterogeneous clinical information is characterized by comprising the following specific steps:
(1) Extracting basic metadata information from an original database of heterogeneous clinical information of a medical institution;
(2) Constructing a semantic network based on Chinese clinical medical terms, a standard clinical data table OMOP CDM and a clinical data service scene; using a semantic network to identify a domain named entity based on rules, and acquiring semantic information of original data in an original database;
(3) Grouping original data in an original database according to the basic metadata information and the semantic information, and performing association calculation to obtain association relations of all data in each group;
(4) And (4) performing chart visual display on the basic metadata information and the semantic information, and performing graph structure visual display on the association relation obtained in the step (3).
2. The data discovery method for heterogeneous clinical information according to claim 1, wherein the basic metadata information includes database name, column data type, column data length, column data number, whether a column can be empty, number of unique and different values of a column, number of empty values of a column, column empty value ratio, column last analysis time, whether a column contains numbers, whether a column contains letters, whether a column contains Chinese, column representative data length, column representative data, and primary key judgment.
3. The data discovery method for heterogeneous clinical information according to claim 1, wherein the basic metadata information is written into a metadata database for facilitating subsequent browsing and searching.
4. The method for discovering data oriented to heterogeneous clinical information according to claim 1, wherein in step (2), a Neo4j graph database is used to construct a semantic network.
5. The method for discovering data oriented to heterogeneous clinical information according to claim 1, wherein in the step (2), the raw data is filtered by setting specific semantic information.
6. The method for discovering data oriented to heterogeneous clinical information according to claim 1, wherein in the step (3), the association relationships of four types of data sets are defined, which are respectively a complete equivalence relationship, a complete containment relationship, an incomplete containment relationship and a similarity relationship.
7. The method for discovering data oriented to heterogeneous clinical information according to claim 1, wherein in step (3), the association calculation in each group measures the relationship between sets using Jaccard content and Jaccard Similarity.
8. A data discovery system for heterogeneous clinical information is characterized by comprising a database connection management module, a basic metadata extraction module, a clinical semantic information extraction module, a data association information extraction module and a metadata visualization module;
the database connection management module is connected with an original database of heterogeneous clinical information of the medical institution through a database connection framework;
the basic metadata extraction module extracts basic metadata information from an original database of heterogeneous clinical information of a medical institution;
the clinical semantic information extraction module constructs a semantic network based on Chinese clinical medical terms, a standard clinical data table OMOP CDM and a clinical data service scene; using a semantic network to identify a domain named entity based on rules, and acquiring semantic information of original data in an original database;
the data association information extraction module defines association relations of four types of data sets, namely a complete equivalence relation, a complete inclusion relation, an incomplete inclusion relation and a similar relation; grouping original data in an original database according to the basic metadata information and the semantic information, and performing association calculation to obtain association relations of all data in each group;
and the metadata visualization module is used for performing chart visualization display on the basic metadata information and the semantic information and performing graphic structure visualization display on the association relationship obtained by the data association information extraction module.
9. The heterogeneous clinical information-oriented data discovery system according to claim 8, wherein the database connection framework employs Mybatis or Hibernate to provide a uniform access interface for a plurality of different types of raw data databases.
CN202210434115.6A 2022-04-24 2022-04-24 Data discovery method and system for heterogeneous clinical information Pending CN115662638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210434115.6A CN115662638A (en) 2022-04-24 2022-04-24 Data discovery method and system for heterogeneous clinical information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210434115.6A CN115662638A (en) 2022-04-24 2022-04-24 Data discovery method and system for heterogeneous clinical information

Publications (1)

Publication Number Publication Date
CN115662638A true CN115662638A (en) 2023-01-31

Family

ID=85023720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210434115.6A Pending CN115662638A (en) 2022-04-24 2022-04-24 Data discovery method and system for heterogeneous clinical information

Country Status (1)

Country Link
CN (1) CN115662638A (en)

Similar Documents

Publication Publication Date Title
GB2293667A (en) Database management system
WO2009037615A1 (en) System and method for analyzing electronic data records
CN111243748A (en) Needle pushing health data standardization system
JP2000505222A (en) Automatic transmission of legacy system data
US20090083224A1 (en) Summarizing data removed from a query result set based on a data quality standard
US20160070751A1 (en) Database management system
CN115497631A (en) Clinical scientific research big data analysis system
Nazabal et al. Data engineering for data analytics: A classification of the issues, and case studies
Lu et al. How do author-selected keywords function semantically in scientific manuscripts?
JP6375029B2 (en) A metadata-based online analytical processing system that analyzes the importance of reports
Kruse et al. Data Anamnesis: Admitting Raw Data into an Organization.
EP1251435A2 (en) Knowledge database and method for constructing and merging knowledge database
TWI296380B (en) Method and apparatus for electronic document collection
Traina Jr et al. Using an image-extended relational database to support content-based image retrieval in a PACS
CN111460173A (en) Method for constructing disease ontology model of thyroid cancer
CN115662638A (en) Data discovery method and system for heterogeneous clinical information
US8190880B2 (en) Methods and systems for displaying standardized data
US9406037B1 (en) Interactive literature analysis and reporting
CN114706625A (en) Method, device and storage medium for constructing patient information global query plug-in
CN113094514A (en) Water affair data intelligent discovery method based on domain knowledge graph
Sattler et al. Supporting Information Fusion with Federated Database Technologies (Position Paper).
CN110853745A (en) Skin disease patient standardization system
CN111259633A (en) System for converting document into format and automatically establishing database
Pereira et al. Visualising time-evolving semantic biomedical data
Germinian et al. Utilizing PostGIS Extension to Process Spatial Data Stored in Neo4j Database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination