WO2020233256A1 - General medical termbase-based multi-center medical terminology standardization system - Google Patents

General medical termbase-based multi-center medical terminology standardization system Download PDF

Info

Publication number
WO2020233256A1
WO2020233256A1 PCT/CN2020/083586 CN2020083586W WO2020233256A1 WO 2020233256 A1 WO2020233256 A1 WO 2020233256A1 CN 2020083586 W CN2020083586 W CN 2020083586W WO 2020233256 A1 WO2020233256 A1 WO 2020233256A1
Authority
WO
WIPO (PCT)
Prior art keywords
term
medical
mapping
module
database
Prior art date
Application number
PCT/CN2020/083586
Other languages
French (fr)
Chinese (zh)
Inventor
李劲松
田雨
王执晓
周天舒
董凯奇
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to JP2021533326A priority Critical patent/JP7093593B2/en
Publication of WO2020233256A1 publication Critical patent/WO2020233256A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data

Definitions

  • the invention belongs to the field of terminology standardization, and in particular relates to a multi-center medical terminology standardization system based on a general medical term database.
  • multi-center The medical data of multiple medical data centers (referred to as “multi-center") is used for data analysis and mining, and then for clinical decision-making, medical management services, and scientific research. Providing support has become an inevitable trend. However, domestic medical terminology related standards are scarce, the system is not complete, and there are many medical information system vendors.
  • the general medical term database is a set of "Medical Terminology Nomenclature-Clinical Terminology” (SNOMED-CT), "International Classification and Codes of Diseases” (ICD-10), Clinical Drug Standard Nomenclature (RxNorm), "Observation Index Identifier” Logical Naming Precoding System (LOINC) and other international general medical term sets as the core medical concept terminology standard library covering the entire medical process.
  • SNOMED-CT Medical Terminology Nomenclature-Clinical Terminology
  • ICD-10 International Classification and Codes of Diseases
  • RxNorm Clinical Drug Standard Nomenclature
  • LINC Logical Naming Precoding System
  • the more extensive processing method is that information technology personnel and doctors and other personnel with medical background knowledge determine the mapping relationship between data in the medical system and general medical terminology one by one, and then perform semi-automatic mapping by executing SQL scripts.
  • the current method has obvious disadvantages:
  • the present invention proposes a multi-center medical terminology standardization system based on a general medical term library, which solves the multi-center medical terminology standardization problem based on multiple standardized medical term sets, simplifies the operation of medical terminology mapping, and enriches The whole process of standardization of medical terminology.
  • a multi-center medical terminology standardization system based on general medical terminology database, which includes source database, database connection management module, pre-analysis module, term mapping unit, incremental update Module, exception handling module and multi-center interaction module;
  • the source database is distributed in the front-end server of each medical data center, and stores the business data of each medical data center;
  • the database connection management module manages the information required to access the source database, and provides support for the terminology mapping tool to access and modify the source database;
  • the pre-analysis module automatically scans the source database, counts the appearance frequency of each medical term in the original medical data, gives suggestions for rejection of terms whose appearance frequency is less than the set threshold, and sends the terms greater than or equal to the set threshold to the term mapping unit Perform subsequent term mapping;
  • mapping unit includes an automatic mapping module, a fuzzy matching module and a custom term module
  • the automatic mapping module supports automatic mapping of medical terminology, and realizes multi-directional mapping according to the existing mapping relationship between the standard codes of the international general medical terminology library for terms using the international general medical terminology library standard encoding;
  • the fuzzy matching module For medical terms that cannot be directly mapped based on the mapping relationship between the internal standard codes of the existing medical terminology database, traverse and query in the general medical terminology database through fuzzy matching, and provide several sets of standards with the highest similarity Medical terms are available for selection as the target term of the term mapping;
  • the custom term module For medical terms that cannot rely on the mapping relationship between the standard codes in the existing medical term database and cannot be fuzzy matching the target term in the existing general medical term database, after the user generates a custom term application, Send to the multi-center interactive module for review and feedback;
  • the multi-center interaction module after receiving the custom term application from each medical data center sent by the custom term module, the custom term will be reviewed, and the reviewed custom term will be added to the general medical term database as a standard term , And send it to each medical data center to keep the common medical terminology database consistent in each medical data center;
  • the incremental update module For the medical term standardization process in which the source database that has performed the standardized mapping of medical terminology generates incremental data due to business reasons, the historical mapping relationship record generated by the term mapping unit is retrieved to complete the terminology standardization mapping of the incremental data ;
  • the exception handling module records the execution process of each of the above modules, generates an error log for the occurrence of errors, and can trace the entire process of medical term mapping according to the error log.
  • system also includes a data cleaning module, which is used to formulate cleaning rules, assign weights to each data element, and filter out severely missing data, including cleaning structure-level and instance-level dirty data.
  • the database connection management module specifically includes: composing a JDBC module through classes and interfaces written in a programming language, providing a unified access interface for multiple types of databases, realizing the establishment of a connection with a database or other data sources, and sending SQL to the database Commands and functions to process the results returned by the database.
  • the pre-analysis module automatically scans the structural information of all data in the source database and the statistical information of specific fields through the module to generate a statistical table, including two parts:
  • the automatic mapping module for terms with international general medical terminology standard codes in the source database, after determining the standard to which the codes belong, select the target term set to be mapped, if the term in the source database belongs to the standard There is already a reference mapping relationship between the term set encoding and the target term set encoding, then this part of the terminology can automatically generate mapping SQL statements to complete the automatic mapping and corresponding data loading of the terms in the source database.
  • the specific method of fuzzy matching is as follows:
  • Terminology segmentation segment all vocabularies in the general medical terminology database, and perform frequency statistics for each segmentation as the basic word frequency; segment the source medical term M that needs fuzzy matching before matching.
  • custom term module define constraints in advance to avoid conflicts between custom terms and known standard terms; when adding custom terms, each medical data center needs to maintain the added custom standard terms To prevent repeated addition, while ensuring that multi-center medical data can be shared after terminology mapping is standardized. Before adding a custom term, you need to submit an application for adding a custom term to the multi-center interaction module.
  • the application content includes: the custom term that needs to be added, the specific description of the custom term, the code of the custom term; pending multi-center interaction module After passing the review, the relevant operators confirm that there is no custom code of similar repeated medical terms, then a custom standard term code is generated, and then the automatic mapping module can be called to complete the term mapping and the loading of the covered data; if the review fails , Then return the existing custom term code for the medical data center to complete subsequent mapping or return the reason for the failure of custom term generation, generate an error document and prompt the user.
  • the multi-center interaction module is responsible for the coordination and unification of the general medical terminology database and term coding of each medical data center, and the highest authority personnel of the multi-center interaction module review and coordinate the use of custom standard terms.
  • the incremental update module is used in the subsequent medical term standardization process of the medical data center that has already operated the medical term mapping, and is mainly based on the previous terminology standardized mapping records generated by the term mapping unit to update the incremental data. For medical terms that still cannot be standardized, execute the custom term module repeatedly.
  • the exception handling module is used to save all logs during system operation, and record whether each module is running normally; classified and save error logs, including: errors that occur during system operation, errors that occur when each module is called, Errors that occur when each module is running for a single term mapping; classified and saved terms that were not successfully mapped, including those that were ignored in the automatic analysis module and the terms that were ignored in the custom module, and the failed term document was generated; the exception handling module passed Set the time stamp on the database, support the database backtracking function, and support users to backtrack the matched database to the data on the specified date.
  • the beneficial effects of the present invention are: the present invention systematically solves the problem of medical terminology standardization in multiple medical data centers, and maintains the consistency of medical terminology expression in each medical data center; automatically realizes automatic scanning and analysis of the source database of the medical data center, and On this basis, the automatic mapping of medical terminology with standard coding is realized; the inevitable complexity of medical term mapping is fully considered, and a spiral process of automatic mapping to fuzzy matching mapping to custom term mapping is realized; incremental update
  • the mechanism makes full use of past mapping records, greatly reduces the pressure of follow-up work, and greatly improves the standardization of medical terminology mapping.
  • Figure 1 is the system flow chart
  • Figure 2 is a system data flow diagram
  • Figure 3 is a schematic diagram of JDBC implementation of database connection management
  • Figure 4 is a flowchart of standardized mapping of medical terms
  • Figure 5 is a schematic diagram of multi-center interaction.
  • the present invention provides a multi-center medical terminology standardization system based on a general medical term database.
  • the system includes a source database, a database connection management module, a pre-analysis module, a term mapping unit, an incremental update module, and an exception
  • the processing module and the multi-center interaction module may also include a data cleaning module;
  • the source database is distributed in the front-end server of each medical data center, and stores business data of medical information systems such as HIS, LIS, PACS and EMR in each medical data center, including basic patient information, medical information, cost information, diagnosis information, medication information, Surgery information, inspection information, examination information, text medical record information and nursing vital signs information;
  • medical information systems such as HIS, LIS, PACS and EMR
  • Database connection management module manage (including loading, modifying, storing) the information needed to access the source database, and provide support for the terminology mapping tool to access and modify different types of source databases;
  • Pre-analysis module Automatically scan the source database, count the frequency of occurrence of each medical term in the original medical data, give suggestions for rejection of terms whose frequency of occurrence is less than the set threshold, and send the terms greater than or equal to the set threshold to the term mapping unit for follow-up Term mapping
  • Term mapping unit includes automatic mapping module, fuzzy matching module and custom term module
  • Automatic mapping module supports automatic mapping of medical terminology. For terms that use international general medical terminology standard codes, multi-directional mapping is realized according to the mapping relationship between the existing general medical terminology standard codes, and only the quality of the mapping results is required Control
  • Fuzzy matching module For medical terms that cannot be directly mapped based on the mapping relationship between the internal standard codes of the existing medical terminology database, the general medical terminology database can be traversed and searched through fuzzy matching to provide the most similar groups of standard medical care Term for selection as the target term of the term mapping;
  • Custom terminology module For medical terms that cannot rely on the mapping relationship between standard codes in the existing medical terminology database and cannot fuzzy match the target term in the existing general medical terminology database, after the user generates a custom term application (can pass The joint decision of technicians and doctors), sent to the multi-center interactive module for review and feedback;
  • Multi-center interaction module After receiving the custom term application from each medical data center sent by the custom term module, the custom term will be reviewed, and the reviewed custom term will be added to the general medical term database as a standard term, and Send to each medical data center to keep the common medical terminology database consistent in each medical data center;
  • Incremental update module for the medical terminology standardization process in which the source database that has performed the standardized mapping of medical terminology generates incremental data due to business reasons, the historical mapping relationship records generated by the terminology mapping unit are retrieved to complete the terminology standardization mapping of the incremental data;
  • Exception handling module Record the execution process of each of the above modules, especially generate error logs when errors occur, and ensure that the entire process of medical term mapping can be traced back based on the error logs.
  • Data cleaning module Formulate cleaning rules, assign weight to each data element, filter out serious missing data, and improve data quality.
  • the source database and the target database can be the same database system at the physical level.
  • the realization method is mainly to compose the JDBC module through the existing classes and interfaces written in the java programming language, which provides a unified access interface for various types of databases. It has good cross-platform performance and mainly realizes the establishment of the database or other data sources. Connect, send SQL commands to the database and process the returned results of the database.
  • the schematic diagram is shown in Figure 3.
  • Pre-analysis module After the database connection management module realizes the connection to the source database, the module automatically scans the structure information of all data in the source database and the statistical information of specific fields to generate statistical table A, which consists of two parts:
  • the summary statistics of all tables in the source database including the field names in each table, the value type, the maximum length of all values, the total number of rows in the table, and the proportion of empty values, are as follows:
  • the subsequent term mapping can give priority to terms with higher occurrence frequency for processing.
  • the system will give Whether it is necessary for the terms with extremely low frequency to participate in the subsequent term mapping suggestions. If it is not defined, all terms participate in the mapping by default. Users can also adjust the parameters according to the specific situation to determine the minimum frequency threshold that does not participate in the subsequent term mapping. This can greatly simplify the subsequent term mapping process, reduce a certain amount of work and improve data quality.
  • M is the set minimum frequency of participation in the mapping. If P ⁇ M, then A is the object of participation in the mapping; P ⁇ M, then A is a non-standard term with a very low frequency and does not participate in subsequent term mapping, where M is the user's basis Threshold set by the actual situation.
  • the document information generated by the above modules can be exported in pdf, excel, CSV and other formats.
  • fuzzy matching Through fuzzy matching of the part of the medical terms with the standard terms in the general medical terminology database one by one, the recommended standard terms for mapping and the standard term set codes are given. Fuzzy matching generally recommends multiple standard terms as matching objects. Professionals with medical knowledge are required to manually determine the only matching object. After determining the mapping relationship, call the automatic mapping module to complete the mapping of this part of the medical term and the data it covers. load.
  • the specific method of fuzzy matching is as follows:
  • Medical terms are mostly composed of multiple vocabularies in an orderly combination. Here, medical terms are subdivided into multiple vocabularies again according to specific rules.
  • the present invention compares the probability difference between medical terms as the standard of similarity, and the specific operations are as follows:
  • n is the number of word segmentation for each term
  • P1, P2, P3, P4...Pn is the probability of each word segmentation in the basic word frequency:
  • the custom term module will define necessary constraints in advance to avoid conflicts between custom terms and known standard terms. For example, in coding, the custom term is forced to use a limited coding range.
  • the report content includes: custom terms that need to be added, custom terms The specific description of the user-defined term code (automatically generated by the system).
  • a custom standard term code will be generated, and then the automatic mapping module can be called to complete the term mapping and the loading of the covered data; if the review is not Pass, return the existing custom term code for the medical data center to complete subsequent mapping or return the reason for the failure of custom term generation, generate an error document and prompt the user.
  • the operation diagram of the custom term module is shown in Figure 4.
  • the invention adopts the method of uniformly adding after submission for review, preventing each medical data center from generating differences in terminology expression when self-defining standard terminology.
  • the multi-center interaction module is responsible for the coordination and unification of the general medical terminology database and term coding in each medical data center.
  • the highest authority personnel of the multi-center interaction module will review and coordinate the use of custom standard terms, and the multi-center custom term interaction network As shown in Figure 5.
  • the exception handling module supports the database backtracking function by setting a time stamp on the database, and supports users to backtrack the matched database to the data on the specified date.
  • medical data cleaning is extremely necessary to improve the quality of medical data for subsequent data mining and analysis; here is a common data cleaning strategy, which mainly cleans the "dirty data" at the structure level and the instance level.
  • Structure-level cleaning rules unified data model (including data type) definition; unified integrity constraint definition; unified function dependency requirement definition.
  • Instance-level cleaning rules Analyze dirty data, formulate cleaning rules, evaluate and verify, and record cleaning actions in a log for traceability.
  • the present invention is to realize the data sharing and sharing between multiple medical data centers (mainly hospitals) as the current data mining and analysis requirements for the quantity and quality of data continue to increase, and at the same time fully ensure the safety of the data of each medical data center
  • the prerequisite for data sharing among multiple medical data centers is the standardization of medical data, which includes two parts, one is data structure standardization, and the second is medical terminology standardization. The above content is designed for the latter standardization. .
  • the technical points of the present invention are summarized as follows:
  • mapping of the medical data covered by the part of the medical terminology is automatically realized according to the mapping relationship between the existing medical term set encodings.
  • custom medical terms or unique domestic medical terms, such as Chinese medicine, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A general medical termbase-based multi-center medical terminology standardization system, comprising a source database, a database connection management module, a pre-analysis module, a terminology mapping unit, an incremental update module, an anomaly processing module and a multi-center interaction module. The system solves the problem of medical terminology standardization between multiple medical data centers, and maintains the consistency of medical terminology expressions among various medical data centers. A medical data center source database is automatically scanned and analyzed, and medical terminology having standard codes are automatically mapped on this basis. The inevitable complexities of medical terminology mapping are fully taken into account, and a spiral rising process from automated mapping to fuzzy match mapping to custom terminology mapping is implemented. An incremental update mechanism makes full use of previous mapping records, significantly reducing the pressure of subsequent work, and significantly increasing the degree of standardization of medical terminology mapping.

Description

一种基于通用医疗术语库的多中心医疗术语标准化系统A multi-center medical terminology standardization system based on general medical terminology database 技术领域Technical field
本发明属于术语标准化领域,尤其涉及一种基于通用医疗术语库的多中心医疗术语标准化系统。The invention belongs to the field of terminology standardization, and in particular relates to a multi-center medical terminology standardization system based on a general medical term database.
背景技术Background technique
随着医疗信息化的高速发展,医疗数据的类型和规模快速增长,利用多家医疗数据中心(简称“多中心”)的医疗数据进行数据分析挖掘,进而为临床决策、医疗管理服务、科学研究提供支持成为必然趋势。但国内医疗术语相关标准匮乏,体系尚不完整,加之医疗信息系统厂商众多,故而医疗数据中心间甚至医疗数据中心内的术语名称与编码的异构现象十分严重,并伴有大量的半结构化和非结构化数据;而国际上比较成熟的相关术语集在国内应用有限,且因语言壁垒,导致国际现有的标准术语集之间的映射关系很难应用于国内医疗术语的标准化;以上原因使得医疗信息系统之间无法互操作,难以实现多医疗数据中心之间医疗数据的标准化与共用共享。With the rapid development of medical informatization, the type and scale of medical data are growing rapidly. The medical data of multiple medical data centers (referred to as "multi-center") is used for data analysis and mining, and then for clinical decision-making, medical management services, and scientific research. Providing support has become an inevitable trend. However, domestic medical terminology related standards are scarce, the system is not complete, and there are many medical information system vendors. Therefore, the heterogeneity of term names and codes between medical data centers and even medical data centers is very serious, and accompanied by a large number of semi-structured And unstructured data; while the internationally mature related term sets are limited in domestic application, and due to language barriers, the mapping relationship between existing international standard term sets is difficult to apply to the standardization of domestic medical terminology; the above reasons This makes it impossible to interoperate between medical information systems, and it is difficult to realize the standardization and sharing of medical data among multiple medical data centers.
通用医疗术语库是一套以《医学术语系统命名-临床术语》(SNOMED-CT)、《国际疾病分类与代码》(ICD-10)、临床药物标准命名法(RxNorm)、《观测指标标识符逻辑命名预编码系统》(LOINC)等国际通用医疗术语集为核心的覆盖医疗全过程的医疗概念术语标准库。将多中心医疗数据映射到统一的通用医疗术语库之后,就可以方便地进行大数据分析等操作。在利用多中心医疗数据进行数据分析之前,如何对不同医疗信息系统的医疗数据进行术语标准化和清洗成为了一大难题。The general medical term database is a set of "Medical Terminology Nomenclature-Clinical Terminology" (SNOMED-CT), "International Classification and Codes of Diseases" (ICD-10), Clinical Drug Standard Nomenclature (RxNorm), "Observation Index Identifier" Logical Naming Precoding System (LOINC) and other international general medical term sets as the core medical concept terminology standard library covering the entire medical process. After the multi-center medical data is mapped to a unified general medical term database, operations such as big data analysis can be conveniently performed. Before using multi-center medical data for data analysis, how to standardize and clean the medical data of different medical information systems has become a big problem.
现有技术方案[CN201510922676-基于分词编码自动构建医学术语映射关系的方法以及系统]、[CN201710101827-一种医疗大数据的数据标准化处理方法及装置]与[CN201710152584-医疗同义词的确定方法和装置]更多的是从中文分词的角度,基于字符串匹配等分词方法实现对医疗术语的分词,继而计算医疗术语之间的相似度,从而选取相似度最高的医疗术语与目标术语之间建立映射关系。该类方案只是着力解决中文医疗术语的匹配问题,而非解决整个医疗信息系统之间的术语标准化问题,且只针对中文医疗术语之间的映射,未实现与国外标准医疗术语集之间的标准化。Existing technical solutions [CN201510922676-Method and system for automatically constructing medical term mapping relationship based on word segmentation coding], [CN201710101827-A data standardization processing method and device for medical big data] and [CN201710152584-Method and device for determining medical synonyms] More from the perspective of Chinese word segmentation, medical terms are segmented based on string matching and equal word segmentation, and then the similarity between medical terms is calculated, so as to select the most similar medical term to establish a mapping relationship between the target term . This type of solution only focuses on solving the problem of matching Chinese medical terms, rather than solving the problem of terminology standardization between the entire medical information system, and only aims at the mapping between Chinese medical terms, and has not achieved standardization with foreign standard medical term sets. .
现有技术方案[CN201610173625-一种医疗数据字典自动标准化的方法与系统]类的专利文献,主要是在逻辑层次建立基于云端的数据字典标准化模型,需要将所有医疗数据中心的术语集合抽取到云端进行统一映射处理,而现实中,如医院等医疗数据中心为保证数据安全 性,基本不会允许医疗数据中心的术语集对中心局域网外的云端进行开放,因此不符合实际的医疗术语标准化需求。Existing technical solutions [CN201610173625-A method and system for automatic standardization of medical data dictionaries] patent documents are mainly to establish a cloud-based data dictionary standardization model at the logical level, and it is necessary to extract the term set of all medical data centers to the cloud For unified mapping processing, in reality, medical data centers such as hospitals basically will not allow the terminology set of medical data centers to be opened to the cloud outside the central local area network in order to ensure data security. Therefore, it does not meet the actual medical terminology standardization requirements.
目前,较为广泛的处理方法是信息技术人员与医生等具备医疗背景知识的人员一起,逐条确定医疗系统中的数据与通用医疗术语库之间的映射关系,再通过执行sql脚本等进行半自动映射,得到标准化的医疗术语;另一种标准化医疗术语的操作是要求医务人员在数据录入时,便按照标准格式将数据标准化录入。但是,现行方法存在明显的缺点:At present, the more extensive processing method is that information technology personnel and doctors and other personnel with medical background knowledge determine the mapping relationship between data in the medical system and general medical terminology one by one, and then perform semi-automatic mapping by executing SQL scripts. Obtain standardized medical terminology; another operation of standardized medical terminology is to require medical staff to enter data in a standardized format according to a standard format when entering data. However, the current method has obvious disadvantages:
1.现有技术只是着眼于医疗术语之间的映射关系的确立,而非解决整个医疗信息系统的医疗术语标准化。1. The prior art only focuses on the establishment of the mapping relationship between medical terms, rather than solving the standardization of medical terms in the entire medical information system.
2.现有方案均为针对某一具体数据模型,不仅缺少实用性与针对性,且局限于中文医疗术语之间的映射,无法与国际通用医疗术语库之间建立映射关系。2. The existing schemes are all aimed at a specific data model, not only lacking practicability and pertinence, but also limited to the mapping between Chinese medical terms, and cannot establish a mapping relationship with the international general medical term database.
3.对使用国际通用医疗术语集编码的医疗数据,未充分发掘现有术语集之间已有的映射关系,而对医疗数据中心内未使用标准医疗术语集的医疗术语与医疗数据中心内自定义的医疗术语,一般都采用模糊匹配的方式解决,或直接舍弃,尚未建立完整的映射流程与机制。3. For the medical data encoded in the international general medical term set, the existing mapping relationship between the existing term sets is not fully explored, and the medical terminology in the medical data center that does not use the standard medical term set is related to the medical data center. The defined medical terms are generally resolved by fuzzy matching, or discarded directly, and a complete mapping process and mechanism have not been established.
4.对于必需具有医疗背景知识人员参与的术语映射,未提供较为友好的交互界面与标准化的人工审查与异常处理机制。4. For terminology mapping that requires personnel with medical background knowledge to participate, a friendly interactive interface and standardized manual review and exception handling mechanisms are not provided.
5.由于数据杂乱,以往的医疗术语标准化过程都未将术语映射与之后的数据清洗相结合,无法利用术语之间的映射关系实际完成术语标准化过程,更不能保证映射后的数据质量,严重影响后续的数据分析结果。5. Due to the messy data, the previous medical terminology standardization process did not combine terminology mapping with subsequent data cleaning, and it was impossible to use the mapping relationship between terms to actually complete the terminology standardization process, let alone guarantee the quality of the mapped data, which seriously affected Subsequent data analysis results.
6.对于映射与清洗后的医疗数据,未建立详细的质量评估机制,以保证术语映射与数据清洗的准确度。6. For the medical data after mapping and cleaning, no detailed quality assessment mechanism has been established to ensure the accuracy of terminology mapping and data cleaning.
7.未考虑相关国际通用医疗术语库更新后的处理机制与后续的增量更新机制。7. The processing mechanism after the update of the relevant international general medical terminology database and the subsequent incremental update mechanism are not considered.
发明内容Summary of the invention
本发明针对现有技术不足,提出一种基于通用医疗术语库的多中心医疗术语标准化系统,解决了基于多种标准化医疗术语集的多中心医疗术语标准化问题,简化了医疗术语映射操作,丰富了医疗术语标准化的全过程。In view of the shortcomings of the prior art, the present invention proposes a multi-center medical terminology standardization system based on a general medical term library, which solves the multi-center medical terminology standardization problem based on multiple standardized medical term sets, simplifies the operation of medical terminology mapping, and enriches The whole process of standardization of medical terminology.
本发明的目的是通过以下技术方案来实现的:一种基于通用医疗术语库的多中心医疗术语标准化系统,该系统包括源数据库、数据库连接管理模块、预分析模块、术语映射单元、增量更新模块、异常处理模块和多中心交互模块;The purpose of the present invention is achieved through the following technical solutions: a multi-center medical terminology standardization system based on general medical terminology database, which includes source database, database connection management module, pre-analysis module, term mapping unit, incremental update Module, exception handling module and multi-center interaction module;
所述源数据库分布于各医疗数据中心前置服务器内,存储各医疗数据中心的业务数据;The source database is distributed in the front-end server of each medical data center, and stores the business data of each medical data center;
所述数据库连接管理模块:管理访问源数据库所需的信息,对术语映射工具访问与修改源数据库提供支持;The database connection management module: manages the information required to access the source database, and provides support for the terminology mapping tool to access and modify the source database;
所述预分析模块:自动扫描源数据库,统计原始医疗数据中各医疗术语的出现频次,对术语出现频次小于设定阈值的术语给出舍弃建议,大于等于设定阈值的术语发送至术语映射单元进行后续术语映射;The pre-analysis module: automatically scans the source database, counts the appearance frequency of each medical term in the original medical data, gives suggestions for rejection of terms whose appearance frequency is less than the set threshold, and sends the terms greater than or equal to the set threshold to the term mapping unit Perform subsequent term mapping;
所述术语映射单元包括自动映射模块、模糊匹配模块和自定义术语模块;The term mapping unit includes an automatic mapping module, a fuzzy matching module and a custom term module;
所述自动映射模块:支持医疗术语自动化映射,对于使用国际通用医疗术语库标准编码的术语,根据现有的通用医疗术语库标准编码之间的映射关系实现多向映射;The automatic mapping module: supports automatic mapping of medical terminology, and realizes multi-directional mapping according to the existing mapping relationship between the standard codes of the international general medical terminology library for terms using the international general medical terminology library standard encoding;
所述模糊匹配模块:对于无法直接依据现有医疗术语库内部标准编码间映射关系进行映射的医疗术语,通过模糊匹配的方式在通用医疗术语库中进行遍历查询,提供相似度最高的几组标准医疗术语以供选择作为该术语映射的目标术语;The fuzzy matching module: For medical terms that cannot be directly mapped based on the mapping relationship between the internal standard codes of the existing medical terminology database, traverse and query in the general medical terminology database through fuzzy matching, and provide several sets of standards with the highest similarity Medical terms are available for selection as the target term of the term mapping;
所述自定义术语模块:对于无法依靠现有医疗术语库内标准编码间的映射关系也无法在现有通用医疗术语库中模糊匹配到目标术语的医疗术语,在用户生成自定义术语申请后,发送至多中心交互模块对其进行审核与反馈;The custom term module: For medical terms that cannot rely on the mapping relationship between the standard codes in the existing medical term database and cannot be fuzzy matching the target term in the existing general medical term database, after the user generates a custom term application, Send to the multi-center interactive module for review and feedback;
所述多中心交互模块:接收自定义术语模块发送的各医疗数据中心的自定义术语申请后,将对自定义术语进行审核,将审核通过的自定义术语作为标准术语添加到通用医疗术语库中,并发送至各医疗数据中心,保持各医疗数据中心通用医疗术语库一致;The multi-center interaction module: after receiving the custom term application from each medical data center sent by the custom term module, the custom term will be reviewed, and the reviewed custom term will be added to the general medical term database as a standard term , And send it to each medical data center to keep the common medical terminology database consistent in each medical data center;
所述增量更新模块:针对执行过医疗术语标准化映射的源数据库因业务原因生成增量数据的医疗术语标准化过程,调取术语映射单元产生的历史映射关系记录完成对增量数据的术语标准化映射;The incremental update module: For the medical term standardization process in which the source database that has performed the standardized mapping of medical terminology generates incremental data due to business reasons, the historical mapping relationship record generated by the term mapping unit is retrieved to complete the terminology standardization mapping of the incremental data ;
所述异常处理模块:对上述每一个模块的执行过程进行记录,针对发生错误的情况生成错误日志,根据错误日志能够进行医疗术语映射全过程的回溯。The exception handling module: records the execution process of each of the above modules, generates an error log for the occurrence of errors, and can trace the entire process of medical term mapping according to the error log.
进一步地,该系统还包括数据清洗模块,用于制定清洗规则,给每个数据元赋予权重,将缺失严重的数据筛除,包括清洗结构层次和实例层次的脏数据。Further, the system also includes a data cleaning module, which is used to formulate cleaning rules, assign weights to each data element, and filter out severely missing data, including cleaning structure-level and instance-level dirty data.
进一步地,所述数据库连接管理模块具体包括:通过编程语言编写的类及接口组成JDBC模块,为多种类型数据库提供统一的访问接口,实现建立与数据库或者其他数据源的连接、向数据库发送SQL命令和处理数据库返回结果的功能。Further, the database connection management module specifically includes: composing a JDBC module through classes and interfaces written in a programming language, providing a unified access interface for multiple types of databases, realizing the establishment of a connection with a database or other data sources, and sending SQL to the database Commands and functions to process the results returned by the database.
进一步地,所述预分析模块在数据库连接管理模块实现对源数据库的连接后,通过该模块自动扫描源数据库中所有数据的结构信息及其具体字段的统计信息,生成统计表格,包括两部分:Further, after the database connection management module realizes the connection to the source database, the pre-analysis module automatically scans the structural information of all data in the source database and the statistical information of specific fields through the module to generate a statistical table, including two parts:
首先对源数据库内所有表的概述性统计,包括各个表内字段名称、数值类型、所有值中的最大长度、表内总行数以及空值所占比例;First, summarize the statistics of all tables in the source database, including the field names in each table, the value type, the maximum length of all values, the total number of rows in the table, and the proportion of empty values;
其次对具体某一表内部具体术语的详细信息与出现频次作出统计,且按照出现频次高低 由大到小排列,供后续术语映射优先选择出现频次较高的术语进行处理,系统会给出对于出现频次极低的术语是否有必要参与后续术语映射的建议,未定义时,默认所有术语参与映射,用户也可根据具体情况进行调整,以此确定不参与后续术语映射的最小出现频次阈值。Secondly, make statistics on the detailed information and frequency of occurrence of specific terms in a specific table, and arrange them in descending order of frequency of occurrence, so that subsequent term mapping will give priority to terms with higher frequency of occurrence. Whether it is necessary for a term with a very low frequency to participate in the subsequent term mapping suggestions, if it is not defined, all terms participate in the mapping by default. The user can also adjust according to the specific situation to determine the minimum frequency threshold for not participating in the subsequent term mapping.
进一步地,所述自动映射模块:针对在源数据库内存在国际通用医疗术语库标准编码的术语,在确定其编码所属标准后,选定其将要映射的目标术语集,若源数据库内术语所属标准术语集编码与目标术语集编码之间已存在可参考映射关系,则此部分术语可自动生成映射SQL语句,完成对源数据库内术语的自动映射和相应数据装载。Further, the automatic mapping module: for terms with international general medical terminology standard codes in the source database, after determining the standard to which the codes belong, select the target term set to be mapped, if the term in the source database belongs to the standard There is already a reference mapping relationship between the term set encoding and the target term set encoding, then this part of the terminology can automatically generate mapping SQL statements to complete the automatic mapping and corresponding data loading of the terms in the source database.
进一步地,所述模糊匹配模块中,模糊匹配的具体方法如下:Further, in the fuzzy matching module, the specific method of fuzzy matching is as follows:
(1)术语分词:将通用医疗术语库中的所有词汇进行分词,并将每个分词进行频率统计,作为基础词频;将需要模糊匹配的源医疗术语M在匹配前进行分词。(1) Terminology segmentation: segment all vocabularies in the general medical terminology database, and perform frequency statistics for each segmentation as the basic word frequency; segment the source medical term M that needs fuzzy matching before matching.
(2)模糊匹配:通过比较医疗术语间的概率差异作为相似度大小的标准,具体操作如下:(2) Fuzzy matching: By comparing the probability difference between medical terms as the standard of similarity, the specific operation is as follows:
(2.1)从通用医疗术语库中筛选出所有包括分词的术语,并进行分词,组合为术语集A;(2.1) Filter out all terms including word segmentation from the general medical term database, and perform word segmentation, and combine them into term set A;
(2.2)利用如下公式进行匹配度计算,求术语M、术语集A中所有术语的平均加权概率;其中,n为每个术语得到分词个数,P1、P2、P3、P4…Pn为每个分词在基础词频中对应的概率:(2.2) Use the following formula to calculate the matching degree to find the average weighted probability of all terms in term M and term set A; where n is the number of word segmentation for each term, and P1, P2, P3, P4...Pn is each The corresponding probability of word segmentation in basic word frequency:
Figure PCTCN2020083586-appb-000001
Figure PCTCN2020083586-appb-000001
(2.3)将术语集A中所有标准术语的平均加权概率与需要模糊匹配的术语M做差,得到的数值取负作为匹配度,匹配度越大,二者相似度越高,公式如下:(2.3) Make the difference between the average weighted probability of all standard terms in the term set A and the term M that needs fuzzy matching, and take the negative value as the matching degree. The greater the matching degree, the higher the similarity between the two. The formula is as follows:
S(M,A)=|D(M)-D(A)|S(M,A)=|D(M)-D(A)|
进一步地,所述自定义术语模块:事先定义约束,以避免自定义术语与已知的标准术语相互冲突;在添加自定义术语时,各医疗数据中心之间需保持已添加的自定义标准术语的一致性,防止重复添加,同时保证多中心医疗数据在经术语映射标准化之后能够实现数据共享。在添加自定义术语之前,需向多中心交互模块递交添加自定义术语的申请,申请内容包括:需要添加的自定义术语、自定义术语的具体描述、自定义术语的代码;待多中心交互模块的相关操作人员审核通过后,确定无类似重复医疗术语的自定义编码,则生成一条自定义标准术语编码,而后即可调用自动映射模块,完成术语映射以及所涵盖数据的装载;若审核未通过,则返回已有的自定义术语编码供该医疗数据中心完成后续映射或返回自定义术语生成失败原因,生成错误文档并向用户提示。Further, the custom term module: define constraints in advance to avoid conflicts between custom terms and known standard terms; when adding custom terms, each medical data center needs to maintain the added custom standard terms To prevent repeated addition, while ensuring that multi-center medical data can be shared after terminology mapping is standardized. Before adding a custom term, you need to submit an application for adding a custom term to the multi-center interaction module. The application content includes: the custom term that needs to be added, the specific description of the custom term, the code of the custom term; pending multi-center interaction module After passing the review, the relevant operators confirm that there is no custom code of similar repeated medical terms, then a custom standard term code is generated, and then the automatic mapping module can be called to complete the term mapping and the loading of the covered data; if the review fails , Then return the existing custom term code for the medical data center to complete subsequent mapping or return the reason for the failure of custom term generation, generate an error document and prompt the user.
进一步地,所述多中心交互模块负责各个医疗数据中心的通用医疗术语库及其术语编码的协调与统一,由多中心交互模块的最高权限人员审核协调自定义标准术语的使用问题。Further, the multi-center interaction module is responsible for the coordination and unification of the general medical terminology database and term coding of each medical data center, and the highest authority personnel of the multi-center interaction module review and coordinate the use of custom standard terms.
进一步地,所述增量更新模块用于已经操作过医疗术语映射的医疗数据中心的后续医疗 术语标准化过程,主要依据术语映射单元产生的以往的术语标准化的映射记录实现对增量数据的更新,对于仍然无法完成标准化映射的医疗术语,重复执行自定义术语模块。Further, the incremental update module is used in the subsequent medical term standardization process of the medical data center that has already operated the medical term mapping, and is mainly based on the previous terminology standardized mapping records generated by the term mapping unit to update the incremental data. For medical terms that still cannot be standardized, execute the custom term module repeatedly.
进一步地,所述异常处理模块:用于保存系统运行时的所有日志,记录每个模块是否正常运行;分类保存错误日志,包括:系统运行时出现的错误,每个模块调用时出现的错误,各个模块运行时对于单个术语映射时出现的错误;分类保存未映射成功的术语,包括在自动分析模块中被忽略的和在自定义模块中被忽略的术语,生成失败术语文档;异常处理模块通过在数据库上设定时间戳,支持数据库回溯功能,支持用户将匹配后的数据库回溯到指定日期的数据。Further, the exception handling module is used to save all logs during system operation, and record whether each module is running normally; classified and save error logs, including: errors that occur during system operation, errors that occur when each module is called, Errors that occur when each module is running for a single term mapping; classified and saved terms that were not successfully mapped, including those that were ignored in the automatic analysis module and the terms that were ignored in the custom module, and the failed term document was generated; the exception handling module passed Set the time stamp on the database, support the database backtracking function, and support users to backtrack the matched database to the data on the specified date.
本发明的有益效果是:本发明系统性解决多家医疗数据中心的医疗术语标准化问题,且保持各医疗数据中心医疗术语表达的一致性;自动化实现医疗数据中心源数据库的自动化扫描与分析,并在此基础上实现有标准编码的医疗术语的自动化映射;充分考虑医疗术语映射必然存在的复杂性,实现了自动化映射到模糊匹配映射再到自定义术语映射这样一个螺旋上升的过程;增量更新机制充分利用了以往的映射记录,极大地减轻了后续工作的压力,并大大提高了医疗术语映射的标准化。The beneficial effects of the present invention are: the present invention systematically solves the problem of medical terminology standardization in multiple medical data centers, and maintains the consistency of medical terminology expression in each medical data center; automatically realizes automatic scanning and analysis of the source database of the medical data center, and On this basis, the automatic mapping of medical terminology with standard coding is realized; the inevitable complexity of medical term mapping is fully considered, and a spiral process of automatic mapping to fuzzy matching mapping to custom term mapping is realized; incremental update The mechanism makes full use of past mapping records, greatly reduces the pressure of follow-up work, and greatly improves the standardization of medical terminology mapping.
附图说明Description of the drawings
图1为系统流程图;Figure 1 is the system flow chart;
图2为系统数据流向图;Figure 2 is a system data flow diagram;
图3为JDBC实现数据库连接管理原理图;Figure 3 is a schematic diagram of JDBC implementation of database connection management;
图4为医疗术语标准化映射流程图;Figure 4 is a flowchart of standardized mapping of medical terms;
图5为多中心交互原理图。Figure 5 is a schematic diagram of multi-center interaction.
具体实施方式Detailed ways
下面结合附图和具体实施例对本发明作进一步详细说明。The present invention will be further described in detail below with reference to the drawings and specific embodiments.
如图1所示,本发明提供的一种基于通用医疗术语库的多中心医疗术语标准化系统,该系统包括源数据库、数据库连接管理模块、预分析模块、术语映射单元、增量更新模块、异常处理模块和多中心交互模块,还可以包括数据清洗模块;As shown in Figure 1, the present invention provides a multi-center medical terminology standardization system based on a general medical term database. The system includes a source database, a database connection management module, a pre-analysis module, a term mapping unit, an incremental update module, and an exception The processing module and the multi-center interaction module may also include a data cleaning module;
源数据库分布于各医疗数据中心前置服务器内,存储各医疗数据中心HIS、LIS、PACS与EMR等医疗信息系统的业务数据,包括患者基本信息、就诊信息、费用信息、诊断信息、用药信息、手术信息、检验信息、检查信息、文本病历信息以及护理生命体征信息;The source database is distributed in the front-end server of each medical data center, and stores business data of medical information systems such as HIS, LIS, PACS and EMR in each medical data center, including basic patient information, medical information, cost information, diagnosis information, medication information, Surgery information, inspection information, examination information, text medical record information and nursing vital signs information;
数据库连接管理模块:管理(包括加载、修改、存储)访问源数据库所需的信息,对术语映射工具访问与修改不同类型的源数据库提供支持;Database connection management module: manage (including loading, modifying, storing) the information needed to access the source database, and provide support for the terminology mapping tool to access and modify different types of source databases;
预分析模块:自动扫描源数据库,统计原始医疗数据中各医疗术语的出现频次,对术语出现频次小于设定阈值的术语给出舍弃建议,大于等于设定阈值的术语发送至术语映射单元进行后续术语映射;Pre-analysis module: Automatically scan the source database, count the frequency of occurrence of each medical term in the original medical data, give suggestions for rejection of terms whose frequency of occurrence is less than the set threshold, and send the terms greater than or equal to the set threshold to the term mapping unit for follow-up Term mapping
术语映射单元包括自动映射模块、模糊匹配模块和自定义术语模块;Term mapping unit includes automatic mapping module, fuzzy matching module and custom term module;
自动映射模块:支持医疗术语自动化映射,对于使用国际通用医疗术语库标准编码的术语,根据现有的通用医疗术语库标准编码之间的映射关系实现多向映射,只需对其映射结果做质量控制即可;Automatic mapping module: supports automatic mapping of medical terminology. For terms that use international general medical terminology standard codes, multi-directional mapping is realized according to the mapping relationship between the existing general medical terminology standard codes, and only the quality of the mapping results is required Control
模糊匹配模块:对于无法直接依据现有医疗术语库内部标准编码间映射关系进行映射的医疗术语,可通过模糊匹配的方式在通用医疗术语库中进行遍历查询,提供相似度最高的几组标准医疗术语以供选择作为该术语映射的目标术语;Fuzzy matching module: For medical terms that cannot be directly mapped based on the mapping relationship between the internal standard codes of the existing medical terminology database, the general medical terminology database can be traversed and searched through fuzzy matching to provide the most similar groups of standard medical care Term for selection as the target term of the term mapping;
自定义术语模块:对于无法依靠现有医疗术语库内标准编码间的映射关系也无法在现有通用医疗术语库中模糊匹配到目标术语的医疗术语,在用户生成自定义术语申请后(可以通过技术人员与医生的共同决定),发送至多中心交互模块对其进行审核与反馈;Custom terminology module: For medical terms that cannot rely on the mapping relationship between standard codes in the existing medical terminology database and cannot fuzzy match the target term in the existing general medical terminology database, after the user generates a custom term application (can pass The joint decision of technicians and doctors), sent to the multi-center interactive module for review and feedback;
多中心交互模块:接收自定义术语模块发送的各医疗数据中心的自定义术语申请后,将对自定义术语进行审核,将审核通过的自定义术语作为标准术语添加到通用医疗术语库中,并发送至各医疗数据中心,保持各医疗数据中心通用医疗术语库一致;Multi-center interaction module: After receiving the custom term application from each medical data center sent by the custom term module, the custom term will be reviewed, and the reviewed custom term will be added to the general medical term database as a standard term, and Send to each medical data center to keep the common medical terminology database consistent in each medical data center;
增量更新模块:针对执行过医疗术语标准化映射的源数据库因业务原因生成增量数据的医疗术语标准化过程,调取术语映射单元产生的历史映射关系记录完成对增量数据的术语标准化映射;Incremental update module: for the medical terminology standardization process in which the source database that has performed the standardized mapping of medical terminology generates incremental data due to business reasons, the historical mapping relationship records generated by the terminology mapping unit are retrieved to complete the terminology standardization mapping of the incremental data;
异常处理模块:对上述每一个模块的执行过程进行记录,尤其是针对发生错误的情况会生成错误日志,后续保证根据错误日志能够进行医疗术语映射全过程的回溯。Exception handling module: Record the execution process of each of the above modules, especially generate error logs when errors occur, and ensure that the entire process of medical term mapping can be traced back based on the error logs.
数据清洗模块:制定清洗规则,给每个数据元赋予权重,将缺失严重的数据筛除,提高数据质量。Data cleaning module: Formulate cleaning rules, assign weight to each data element, filter out serious missing data, and improve data quality.
每个模块的具体实现方式如下:The specific implementation of each module is as follows:
一、数据库连接管理模块1. Database connection management module
管理(包括加载、修改、存储)访问源数据库所需的信息,源数据库和目标数据库在物理层面上可以为同一数据库系统。实现方式主要是通过现有的如java编程语言编写的类及接口组成JDBC模块,为多种类型数据库提供统一的访问接口,具有很好的跨平台性,主要实现建立与数据库或者其他数据源的连接、向数据库发送SQL命令和处理数据库的返回结果等功能,其示意图如图3所示。Manage (including loading, modifying, storing) the information needed to access the source database. The source database and the target database can be the same database system at the physical level. The realization method is mainly to compose the JDBC module through the existing classes and interfaces written in the java programming language, which provides a unified access interface for various types of databases. It has good cross-platform performance and mainly realizes the establishment of the database or other data sources. Connect, send SQL commands to the database and process the returned results of the database. The schematic diagram is shown in Figure 3.
二、预分析模块:在数据库连接管理模块实现对源数据库的连接后,通过该模块自动扫描源数据库中所有数据的结构信息及其具体字段的统计信息生成统计表格A,该表包括两部分:2. Pre-analysis module: After the database connection management module realizes the connection to the source database, the module automatically scans the structure information of all data in the source database and the statistical information of specific fields to generate statistical table A, which consists of two parts:
首先对源数据库内所有表的概述性统计,包括各个表内字段名称、数值类型、所有值中的最大长度、表内总行数以及空值所占比例,如下所示:First, the summary statistics of all tables in the source database, including the field names in each table, the value type, the maximum length of all values, the total number of rows in the table, and the proportion of empty values, are as follows:
AA BB CC DD EE FF
表名Table Name 列名Column name 数值类型Numeric type 最大长度The maximum length 行数Rows 空行比例Blank line ratio
PATIENTPATIENT 患者标识Patient identification NUMBERNUMBER 88 30003000 00
PATIENTPATIENT 姓名Name VARCHAR2VARCHAR2 2020 30003000 00
PATIENTPATIENT 出生日期date of birth DATEDATE 1010 30003000 00
其次对具体某一表内部具体术语的详细信息与出现频次作出统计,且按照出现频次高低由大到小排列,可供后续术语映射优先选择出现频次较高的术语进行处理,系统会给出对于出现频次极低的术语是否有必要参与后续术语映射的建议,未定义时,默认所有术语参与映射,用户也可根据具体情况调整参数,以此确定不参与后续术语映射的最小出现频率阈值,如此即可极大地简化后续术语映射过程,减少一定的工作量并提高数据质量。Second, make statistics on the detailed information and frequency of occurrence of specific terms in a specific table, and arrange them in descending order of frequency of occurrence. The subsequent term mapping can give priority to terms with higher occurrence frequency for processing. The system will give Whether it is necessary for the terms with extremely low frequency to participate in the subsequent term mapping suggestions. If it is not defined, all terms participate in the mapping by default. Users can also adjust the parameters according to the specific situation to determine the minimum frequency threshold that does not participate in the subsequent term mapping. This can greatly simplify the subsequent term mapping process, reduce a certain amount of work and improve data quality.
AA BB CC
编码coding 性别gender 频次frequency
Z03.001Z03.001 male 200200
Z03.002Z03.002 Female 100100
例如:某一术语A为非标准术语,其总量为N2,数据总量为N1,则A的频率为P=N1/N2。M为设定的参与映射的最小出现频率,如若P≥M,则A为参与映射对象;P<M,则A为出现频率极低的非标准术语不参与后续术语映射,其中M为用户根据实际情况设定的阈值。For example: a term A is a non-standard term, its total amount is N2, and the total amount of data is N1, then the frequency of A is P=N1/N2. M is the set minimum frequency of participation in the mapping. If P≥M, then A is the object of participation in the mapping; P<M, then A is a non-standard term with a very low frequency and does not participate in subsequent term mapping, where M is the user's basis Threshold set by the actual situation.
上述模块产生的文档信息支持pdf、excel、CSV等格式导出。The document information generated by the above modules can be exported in pdf, excel, CSV and other formats.
三、自动映射模块Three, automatic mapping module
针对在源数据库内存在国际通用医疗术语库标准编码的术语,在确定其编码所属标准后,选定其将要映射的目标术语集,若源数据库内术语所属标准术语集编码与目标术语集编码之间已存在可参考映射关系,则此部分术语可自动生成映射SQL语句,完成对源数据库内术语的自动映射和相应数据装载。For terms with international general medical terminology standard codes in the source database, after determining the standard to which the codes belong, select the target term set to be mapped, if the term in the source database belongs to the standard term set code and the target term set code. There is already a referenceable mapping relationship, then this part of terms can automatically generate mapping SQL statements to complete the automatic mapping of terms in the source database and the corresponding data loading.
四、模糊匹配模块Four, fuzzy matching module
通过将该部分医疗术语逐个与通用医疗术语库中的标准术语进行模糊匹配,给出推荐映射的标准术语及其所在的标准术语集编码。模糊匹配一般会推荐多个标准术语作为匹配对象,需要具有医学知识背景的专业人员手动确定唯一匹配对象,在确定映射关系后,调用自动映射模块完成该部分医疗术语的映射及其所涵盖数据的装载。模糊匹配的具体方法如下:Through fuzzy matching of the part of the medical terms with the standard terms in the general medical terminology database one by one, the recommended standard terms for mapping and the standard term set codes are given. Fuzzy matching generally recommends multiple standard terms as matching objects. Professionals with medical knowledge are required to manually determine the only matching object. After determining the mapping relationship, call the automatic mapping module to complete the mapping of this part of the medical term and the data it covers. load. The specific method of fuzzy matching is as follows:
(1)术语分词(1) Term segmentation
医疗术语大多是多个词汇有序组合而成,此处按照特定的规律将医疗术语再次细分为多个词汇。Medical terms are mostly composed of multiple vocabularies in an orderly combination. Here, medical terms are subdivided into multiple vocabularies again according to specific rules.
(1.1)依照这样的方法,将通用医疗术语库中的所有词汇进行分词,并将每个分词进行频率统计,作为基础词频。(1.1) According to this method, all words in the general medical terminology database are segmented, and the frequency of each segmentation is counted as the basic word frequency.
(1.2)将需要模糊匹配的源医疗术语在匹配前也进行分词。例如:术语M分词后得到[分词1,分词2,…分词n]。(1.2) The source medical terms that need fuzzy matching are also segmented before matching. For example: after the term M is participle, you get [participle 1, participle 2, ... participle n].
(2)模糊匹配(2) Fuzzy matching
本发明通过比较医疗术语间的概率差异作为相似度大小的标准,具体操作如下:The present invention compares the probability difference between medical terms as the standard of similarity, and the specific operations are as follows:
(2.1)从通用医疗术语库中筛选出所有包括分词的术语,并进行分词,组合为术语集A{a,b,c,d,e,…};(2.1) Filter out all terms including word segmentation from the general medical term database, and perform word segmentation, and combine them into term set A{a,b,c,d,e,...};
(2.2)利用如下公式进行匹配度计算,求术语M、术语集A中所有术语的平均加权概率。其中,n为每个术语得到分词个数,P1、P2、P3、P4…Pn为每个分词在基础词频中对应的概率:(2.2) Use the following formula to calculate the matching degree and find the average weighted probability of all terms in term M and term set A. Among them, n is the number of word segmentation for each term, P1, P2, P3, P4...Pn is the probability of each word segmentation in the basic word frequency:
Figure PCTCN2020083586-appb-000002
Figure PCTCN2020083586-appb-000002
(2.3)将术语集A中所有标准术语的平均加权概率与需要模糊匹配的术语M做差,得到的数值取负作为匹配度,匹配度越大,二者相似度越高。公式如下:(2.3) Make the difference between the average weighted probability of all standard terms in the term set A and the term M that needs fuzzy matching, and take the negative value as the matching degree. The greater the matching degree, the higher the similarity between the two. The formula is as follows:
S(M,A)=|D(M)-D(A)|S(M,A)=|D(M)-D(A)|
以术语“阿胶益寿口服液”为例:Take the term "Ejiao Yishou Oral Liquid" as an example:
a)对通用医疗术语库术语进行分词,并得到每个分词的概率;a) Perform word segmentation on the general medical terminology database and get the probability of each word segmentation;
b)对术语“阿胶益寿口服液”分词后得“阿胶\益寿\口服液”。在基础词频中查询到其对应的概率,分别得到“阿胶”频率p1、“益寿”p2、“口服液”p3,求出其每个分词平均概率D(M);b) The term "Ejiao Yishou Oral Liquid" is divided into "Ejiao\Yishou\Oral Liquid". Find the corresponding probability in the basic word frequency, get the frequency of "Donkey-hide gelatin" p1, "Yi Shou" p2, "Oral Liquid" p3, and find the average probability D(M) of each word segmentation;
c)在通用医疗术语库中查询所有包含“阿胶”、“益寿”、“口服液”的术语,并进行分词,得到术语集A{[“阿胶”,“钙”,“口服液”],[“阿胶”,“颗粒”],[“阿胶”,“补血”,“口服液”]…},得到D(a),D(b),D(c)…;c) Search all the terms containing "Ejiao", "YiShou" and "Oral Liquid" in the general medical term database, and perform word segmentation to obtain the term set A{["Ejiao", "Calcium", "Oral Liquid"] ,["Ejiao","granules"],["Ejiao","blood-enriching","oral liquid"]...}, get D(a), D(b), D(c)...;
d)求得匹配度并排序d) Find the matching degree and sort
模糊匹配术语Fuzzy matching term 通用数据库术语General database terminology 匹配度suitability
阿胶益寿口服液Ejiao Yishou Oral Liquid 阿胶钙口服液Ejiao calcium oral solution S(M,a)S(M,a)
 To 阿胶补血口服液Ejiao Buxue Oral Liquid S(M,c)S(M,c)
 To 阿胶颗粒Donkey-hide gelatin particles S(M,b)S(M,b)
五、自定义术语模块Five, custom terminology module
在复杂情况下,尤其对于国内医疗数据中心数据冗杂、存在较多与中药以及传统治疗手段相关医疗术语的实际情况而言,存在无法与国际通用医疗术语库相匹配的情况。自定义术语模块会事先定义必要约束,来避免自定义术语与已知的标准术语相互冲突,例如:在编码上,强制自定义术语使用限定的编码范围。In complex situations, especially for the actual situation where the domestic medical data center has redundant data and many medical terms related to Chinese medicine and traditional treatment methods, there is a situation that cannot match the international general medical term database. The custom term module will define necessary constraints in advance to avoid conflicts between custom terms and known standard terms. For example, in coding, the custom term is forced to use a limited coding range.
在添加自定义术语时,各医疗数据中心之间需保持已添加的自定义标准术语的一致性,防止重复添加,同时保证多中心医疗数据在经术语映射标准化之后能够实现数据共享。故而在对医疗数据中心的医疗数据进行术语标准化映射时,在添加自定义术语之前,需向多中心交互模块递交添加自定义术语的报告,报告内容包括:需要添加的自定义术语,自定义术语的具体描述,自定义术语的代码(系统自动生成)。待中心相关操作人员审核通过后,确定无类似重复医疗术语的自定义编码,则生成一条自定义标准术语编码,而后即可调用自动映射模块,完成术语映射以及所涵盖数据的装载;若审核未通过,则返回已有的自定义术语编码供该医疗数据中心完成后续映射或返回自定义术语生成失败原因,生成错误文档并向用户提示,自定义术语模块操作示意图如图4。When adding custom terms, medical data centers must maintain the consistency of the added custom standard terms to prevent repeated additions, and at the same time ensure that multi-center medical data can be shared after terminology mapping is standardized. Therefore, when performing terminology standardization mapping on medical data in the medical data center, before adding custom terms, you need to submit a report adding custom terms to the multi-center interaction module. The report content includes: custom terms that need to be added, custom terms The specific description of the user-defined term code (automatically generated by the system). After the relevant operators of the center have passed the review and determined that there is no custom code of similar repeated medical terms, a custom standard term code will be generated, and then the automatic mapping module can be called to complete the term mapping and the loading of the covered data; if the review is not Pass, return the existing custom term code for the medical data center to complete subsequent mapping or return the reason for the failure of custom term generation, generate an error document and prompt the user. The operation diagram of the custom term module is shown in Figure 4.
六、多中心交互模块Six, multi-center interactive module
各医疗数据中心的医疗信息系统之间要实现数据标准化和数据共享,则要求所有的医疗数据中心使用统一的通用医疗术语库与统一的医疗术语集编码。本发明采用提交审核后统一添加的方式,防止每个医疗数据中心在自定义标准术语时产生术语表达差异。在提交、审核、授权的过程中,存在多医疗数据中心的交互问题。多中心交互模块负责各个医疗数据中心的通用医疗术语库及其术语编码的协调与统一,由多中心交互模块的最高权限人员审核协调自定义标准术语的使用问题,多中心的自定义术语交互网络如图5。To achieve data standardization and data sharing between the medical information systems of various medical data centers, all medical data centers are required to use a unified general medical term database and a unified medical term set coding. The invention adopts the method of uniformly adding after submission for review, preventing each medical data center from generating differences in terminology expression when self-defining standard terminology. In the process of submission, review, and authorization, there are interaction issues among multiple medical data centers. The multi-center interaction module is responsible for the coordination and unification of the general medical terminology database and term coding in each medical data center. The highest authority personnel of the multi-center interaction module will review and coordinate the use of custom standard terms, and the multi-center custom term interaction network As shown in Figure 5.
七、增量更新模块Seven, incremental update module
用于已经操作过医疗术语映射的医疗数据中心的后续医疗术语标准化过程,主要依据术语映射单元产生的以往的术语标准化的映射记录实现对增量数据的更新,对于仍然无法完成标准化映射的医疗术语,重复执行自定义术语模块。It is used in the subsequent medical term standardization process of the medical data center that has already operated the medical terminology mapping. It is mainly based on the previous terminology standardization mapping records generated by the terminology mapping unit to update the incremental data. For medical terms that still cannot be standardized , Repeat the custom term module.
八、异常处理模块Eight, exception handling module
用于保存系统运行时的所有日志,记录每个模块是否正常运行;分类保存错误日志,包括:系统运行时出现的错误,每个模块调用时出现的错误,各个模块运行时对于单个术语映射时出现的错误;分类保存未映射成功的术语,包括在自动分析模块中被忽略的和在自定义模块中被忽略的术语,生成失败术语文档。异常处理模块通过在数据库上设定时间戳,支持数据库回溯功能,支持用户将匹配后的数据库回溯到指定日期的数据。It is used to save all the logs of the system during operation and record whether each module is operating normally; the error logs are stored in categories, including: errors during system operation, errors during each module call, and when each module is running for a single term mapping Errors occurred; classified and saved terms that were not successfully mapped, including those that were ignored in the automatic analysis module and the terms that were ignored in the custom module, and generated a failed term document. The exception handling module supports the database backtracking function by setting a time stamp on the database, and supports users to backtrack the matched database to the data on the specified date.
九、数据清洗模块Nine, data cleaning module
完成医疗术语标准化映射之后,医疗数据清洗对于提高医疗数据质量以便于后续的数据挖掘和分析是极其必要的;此处提供常用的数据清洗策略,主要清洗结构层次和实例层次的“脏数据”,分别包括违反数据模式及完整性约束要求的数据,如数据值超出范围、属性依赖关系破坏、唯一性关系破坏、参照完整性破坏等以及值对应错误属性和属性间依赖关系破坏的数据,如缺失值、重复记录、矛盾记录、参照错误等;最大程度满足数据的完整性、唯一性、权威性、合法性、一致性,减少数据冗余,提高数据质量。After completing the standardized mapping of medical terminology, medical data cleaning is extremely necessary to improve the quality of medical data for subsequent data mining and analysis; here is a common data cleaning strategy, which mainly cleans the "dirty data" at the structure level and the instance level. Including data that violates the data model and integrity constraint requirements, such as data value out of range, property dependency damage, unique relationship damage, referential integrity damage, etc., as well as the value corresponding to the wrong attribute and the data of the dependency damage between the attributes, such as missing Values, duplicate records, contradictory records, reference errors, etc.; to maximize data integrity, uniqueness, authority, legality, and consistency, reduce data redundancy, and improve data quality.
1)结构级清洗规则:统一的数据模式(包括数据类型)定义;统一的完整性约束定义;统一的函数依赖要求定义。1) Structure-level cleaning rules: unified data model (including data type) definition; unified integrity constraint definition; unified function dependency requirement definition.
2)实例级清洗规则:分析脏数据,制定清洗规则,评估与验证,同时将清洗动作记录入日志,以供追溯。2) Instance-level cleaning rules: Analyze dirty data, formulate cleaning rules, evaluate and verify, and record cleaning actions in a log for traceability.
本发明是随着当前数据挖掘与分析对于数据数量和质量的要求不断提升,为实现多家医疗数据中心(主要是医院)之间的数据共享共用,同时充分保证各家医疗数据中心数据的安全性而设计出的一种协同模式,以期通过共享医疗数据能够优化医疗流程、加快相关科研进展最终提升患者的医疗服务质量。而多家医疗数据中心之间数据共享的前提便是医疗数据的标准化,包括两部分内容,一是数据结构标准化,其次便是医疗术语标准化,以上内容便均是针对后者的标准化而设计的。本发明的技术要点概括如下:The present invention is to realize the data sharing and sharing between multiple medical data centers (mainly hospitals) as the current data mining and analysis requirements for the quantity and quality of data continue to increase, and at the same time fully ensure the safety of the data of each medical data center A collaborative model designed based on nature, with the aim of optimizing the medical process, accelerating the progress of related scientific research and ultimately improving the quality of patients’ medical services by sharing medical data. The prerequisite for data sharing among multiple medical data centers is the standardization of medical data, which includes two parts, one is data structure standardization, and the second is medical terminology standardization. The above content is designed for the latter standardization. . The technical points of the present invention are summarized as follows:
1.通过各个模块之间的相互作用,实现医疗信息系统内数据库的自动分析扫描,返回数据库内医疗术语出现频次等统计信息,为后续医疗术语映射和性能优化提供实际依据。1. Through the interaction between the various modules, the automatic analysis and scanning of the database in the medical information system is realized, and statistical information such as the frequency of occurrence of medical terms in the database is returned to provide a practical basis for subsequent medical term mapping and performance optimization.
2.对于采用国际通用医疗术语集编码的数据优先根据现有医疗术语集编码之间的映射关系,自动实现对该部分医疗术语所覆盖的医疗数据的映射。2. For the data encoded by the international general medical term set, the mapping of the medical data covered by the part of the medical terminology is automatically realized according to the mapping relationship between the existing medical term set encodings.
3.对于医疗数据中心内部未使用标准术语集编码的数据、自定义的医疗术语或国内独有的医疗术语,如中药等,支持根据该医疗术语在医疗数据中心内数据的出现频次等信息,支持相关人员可视化的进行合理科学的模糊匹配或直接增加自定义标准术语。3. For data in the medical data center that is not coded with the standard term set, custom medical terms, or unique domestic medical terms, such as Chinese medicine, etc., support information based on the frequency of occurrence of the medical term in the data in the medical data center. Support relevant personnel to visually perform reasonable and scientific fuzzy matching or directly add custom standard terms.
4.定时完成与多家医疗数据中心之间的交互需求,确保各医疗数据中心在完成医疗术语标准化后,各家的通用医疗数据库之间的标准保持统一,可以实现数据共享。4. Complete the interaction requirements with multiple medical data centers on a regular basis to ensure that after each medical data center completes the standardization of medical terminology, the standards between the general medical databases of each family are maintained uniform, and data sharing can be realized.
5.按照清洗策略清洗数据,保证数据质量。5. Clean the data in accordance with the cleaning strategy to ensure data quality.
6.记录所有的错误异常并写入日志,方便实现错误排查和质量评估等功能。6. Record all errors and exceptions and write them in the log to facilitate the implementation of functions such as error troubleshooting and quality evaluation.
7.充分利用已建立的医疗数据中心内医疗术语与国际通用标准医疗术语集之间的映射关系,实现对后续医疗数据中心内术语的半自动化甚至是自动化的映射与标准化。7. Make full use of the established mapping relationship between the medical terminology in the medical data center and the international standard medical terminology to realize the semi-automatic or even automated mapping and standardization of the terminology in the subsequent medical data center.
以上仅为本发明的实施实例,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,不经过创造性劳动所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。The above are only implementation examples of the present invention and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made without creative work within the spirit and principle of the present invention are all included in the protection scope of the present invention.

Claims (10)

  1. 一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,该系统包括源数据库、数据库连接管理模块、预分析模块、术语映射单元、增量更新模块、异常处理模块和多中心交互模块;A multi-center medical terminology standardization system based on a general medical term database, which is characterized in that the system includes a source database, a database connection management module, a pre-analysis module, a term mapping unit, an incremental update module, an exception handling module, and multi-center interaction Module
    所述源数据库分布于各医疗数据中心前置服务器内,存储各医疗数据中心的业务数据;The source database is distributed in the front-end server of each medical data center, and stores the business data of each medical data center;
    所述数据库连接管理模块:管理访问源数据库所需的信息,对术语映射工具访问与修改源数据库提供支持;The database connection management module: manages the information required to access the source database, and provides support for the terminology mapping tool to access and modify the source database;
    所述预分析模块:自动扫描源数据库,统计原始医疗数据中各医疗术语的出现频次,对术语出现频次小于设定阈值的术语给出舍弃建议,大于等于设定阈值的术语发送至术语映射单元进行后续术语映射;The pre-analysis module: automatically scans the source database, counts the appearance frequency of each medical term in the original medical data, gives suggestions for rejection of terms whose appearance frequency is less than the set threshold, and sends the terms greater than or equal to the set threshold to the term mapping unit Perform subsequent term mapping;
    所述术语映射单元包括自动映射模块、模糊匹配模块和自定义术语模块;The term mapping unit includes an automatic mapping module, a fuzzy matching module and a custom term module;
    所述自动映射模块:支持医疗术语自动化映射,对于使用国际通用医疗术语库标准编码的术语,根据现有的通用医疗术语库标准编码之间的映射关系实现多向映射;The automatic mapping module: supports automatic mapping of medical terminology, and realizes multi-directional mapping according to the existing mapping relationship between the standard codes of the international general medical terminology library for terms using the international general medical terminology library standard encoding;
    所述模糊匹配模块:对于无法直接依据现有医疗术语库内部标准编码间映射关系进行映射的医疗术语,通过模糊匹配的方式在通用医疗术语库中进行遍历查询,提供相似度最高的几组标准医疗术语以供选择作为该术语映射的目标术语;The fuzzy matching module: For medical terms that cannot be directly mapped based on the mapping relationship between the internal standard codes of the existing medical terminology database, traverse and query in the general medical terminology database through fuzzy matching, and provide several sets of standards with the highest similarity Medical terms are available for selection as the target term of the term mapping;
    所述自定义术语模块:对于无法依靠现有医疗术语库内标准编码间的映射关系也无法在现有通用医疗术语库中模糊匹配到目标术语的医疗术语,在用户生成自定义术语申请后,发送至多中心交互模块对其进行审核与反馈;The custom term module: For medical terms that cannot rely on the mapping relationship between the standard codes in the existing medical term database and cannot be fuzzy matching the target term in the existing general medical term database, after the user generates a custom term application, Send to the multi-center interactive module for review and feedback;
    所述多中心交互模块:接收自定义术语模块发送的各医疗数据中心的自定义术语申请后,将对自定义术语进行审核,将审核通过的自定义术语作为标准术语添加到通用医疗术语库中,并发送至各医疗数据中心,保持各医疗数据中心通用医疗术语库一致;The multi-center interaction module: after receiving the custom term application from each medical data center sent by the custom term module, the custom term will be reviewed, and the reviewed custom term will be added to the general medical term database as a standard term , And send it to each medical data center to keep the common medical terminology database consistent in each medical data center;
    所述增量更新模块:针对执行过医疗术语标准化映射的源数据库因业务原因生成增量数据的医疗术语标准化过程,调取术语映射单元产生的历史映射关系记录完成对增量数据的术语标准化映射;The incremental update module: For the medical term standardization process in which the source database that has performed the standardized mapping of medical terminology generates incremental data due to business reasons, the historical mapping relationship record generated by the term mapping unit is retrieved to complete the terminology standardization mapping of the incremental data ;
    所述异常处理模块:对上述每一个模块的执行过程进行记录,针对发生错误的情况生成错误日志,根据错误日志能够进行医疗术语映射全过程的回溯。The exception handling module: records the execution process of each of the above modules, generates an error log for the occurrence of errors, and can trace the entire process of medical term mapping according to the error log.
  2. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,该系统还包括数据清洗模块,用于制定清洗规则,给每个数据元赋予权重,将缺失严重的数据筛除,包括清洗结构层次和实例层次的脏数据。A multi-center medical terminology standardization system based on a general medical term database according to claim 1, characterized in that the system further includes a data cleaning module for formulating cleaning rules, assigning weights to each data element, and avoiding missing Severe data filtering, including cleaning dirty data at the structure level and instance level.
  3. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述数据库连接管理模块具体包括:通过编程语言编写的类及接口组成JDBC模块,为多种类型数据库提供统一的访问接口,实现建立与数据库或者其他数据源的连接、向数据库发送SQL命令和处理数据库返回结果的功能。The multi-center medical terminology standardization system based on a general medical term database according to claim 1, wherein the database connection management module specifically includes: classes and interfaces written in a programming language form a JDBC module, which is a variety of The type database provides a unified access interface to realize the functions of establishing a connection with a database or other data sources, sending SQL commands to the database, and processing the results returned by the database.
  4. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述预分析模块在数据库连接管理模块实现对源数据库的连接后,通过该模块自动扫描源数据库中所有数据的结构信息及其具体字段的统计信息,生成统计表格,包括两部分:The multi-center medical terminology standardization system based on a general medical term database according to claim 1, wherein the pre-analysis module automatically scans the source after the database connection management module realizes the connection to the source database. The structure information of all data in the database and the statistical information of specific fields are generated to generate a statistical table, including two parts:
    首先对源数据库内所有表的概述性统计,包括各个表内字段名称、数值类型、所有值中的最大长度、表内总行数以及空值所占比例;First, summarize the statistics of all tables in the source database, including the field names in each table, the value type, the maximum length of all values, the total number of rows in the table, and the proportion of empty values;
    其次对具体某一表内部具体术语的详细信息与出现频次作出统计,且按照出现频次高低由大到小排列,供后续术语映射优先选择出现频次较高的术语进行处理,系统会给出对于出现频次极低的术语是否有必要参与后续术语映射的建议,未定义时,默认所有术语参与映射,用户也可根据具体情况进行调整,以此确定不参与后续术语映射的最小出现频次阈值。Secondly, make statistics on the detailed information and frequency of occurrence of specific terms in a specific table, and arrange them in descending order of frequency of occurrence, so that subsequent term mapping will give priority to terms with higher frequency of occurrence. Whether it is necessary for a term with a very low frequency to participate in the subsequent term mapping suggestions, if it is not defined, all terms participate in the mapping by default. The user can also adjust according to the specific situation to determine the minimum frequency threshold for not participating in the subsequent term mapping.
  5. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述自动映射模块:针对在源数据库内存在国际通用医疗术语库标准编码的术语,在确定其编码所属标准后,选定其将要映射的目标术语集,若源数据库内术语所属标准术语集编码与目标术语集编码之间已存在可参考映射关系,则此部分术语可自动生成映射SQL语句,完成对源数据库内术语的自动映射和相应数据装载。The multi-center medical terminology standardization system based on a general medical terminology database according to claim 1, wherein the automatic mapping module: for terms in the source database that have international general medical terminology standard codes, determine After the code belongs to the standard, select the target term set to be mapped. If there is a reference mapping relationship between the standard term set code of the term in the source database and the target term set code, then this part of the term can automatically generate a mapping SQL statement , To complete the automatic mapping of terms in the source database and the corresponding data loading.
  6. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述模糊匹配模块中,模糊匹配的具体方法如下:The multi-center medical terminology standardization system based on a general medical term database according to claim 1, wherein the specific method of fuzzy matching in the fuzzy matching module is as follows:
    (1)术语分词:将通用医疗术语库中的所有词汇进行分词,并将每个分词进行频率统计,作为基础词频;将需要模糊匹配的源医疗术语M在匹配前进行分词。(1) Terminology segmentation: segment all vocabularies in the general medical terminology database, and perform frequency statistics for each segmentation as the basic word frequency; segment the source medical term M that needs fuzzy matching before matching.
    (2)模糊匹配:通过比较医疗术语间的概率差异作为相似度大小的标准,具体操作如下:(2) Fuzzy matching: By comparing the probability difference between medical terms as the standard of similarity, the specific operation is as follows:
    (2.1)从通用医疗术语库中筛选出所有包括分词的术语,并进行分词,组合为术语集A;(2.1) Filter out all terms including word segmentation from the general medical term database, and perform word segmentation, and combine them into term set A;
    (2.2)利用如下公式进行匹配度计算,求术语M、术语集A中所有术语的平均加权概率;其中,n为每个术语得到分词个数,P1、P2、P3、P4…Pn为每个分词在基础词频中对应的概率:(2.2) Use the following formula to calculate the matching degree to find the average weighted probability of all terms in term M and term set A; where n is the number of word segmentation for each term, and P1, P2, P3, P4...Pn is each The corresponding probability of word segmentation in basic word frequency:
    Figure PCTCN2020083586-appb-100001
    Figure PCTCN2020083586-appb-100001
    (2.3)将术语集A中所有标准术语的平均加权概率与需要模糊匹配的术语M做差,得到的数值取负作为匹配度,匹配度越大,二者相似度越高,公式如下:(2.3) Make the difference between the average weighted probability of all standard terms in the term set A and the term M that needs fuzzy matching, and take the negative value as the matching degree. The greater the matching degree, the higher the similarity between the two. The formula is as follows:
    S(M,A)=|D(M)-D(A)|S(M,A)=|D(M)-D(A)|
  7. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述自定义术语模块:事先定义约束,以避免自定义术语与已知的标准术语相互冲突;在添加自定义术语时,各医疗数据中心之间需保持已添加的自定义标准术语的一致性,防止重复添加,同时保证多中心医疗数据在经术语映射标准化之后能够实现数据共享。在添加自定义术语之前,需向多中心交互模块递交添加自定义术语的申请,申请内容包括:需要添加的自定义术语、自定义术语的具体描述、自定义术语的代码;待多中心交互模块的相关操作人员审核通过后,确定无类似重复医疗术语的自定义编码,则生成一条自定义标准术语编码,而后即可调用自动映射模块,完成术语映射以及所涵盖数据的装载;若审核未通过,则返回已有的自定义术语编码供该医疗数据中心完成后续映射或返回自定义术语生成失败原因,生成错误文档并向用户提示。The multi-center medical terminology standardization system based on a general medical term database according to claim 1, wherein the custom term module: defines constraints in advance to avoid conflicts between custom terms and known standard terms ; When adding custom terminology, medical data centers must maintain the consistency of the added custom standard terminology to prevent repeated additions, while ensuring that multi-center medical data can be shared after terminology mapping is standardized. Before adding a custom term, you need to submit an application for adding a custom term to the multi-center interaction module. The application content includes: the custom term that needs to be added, the specific description of the custom term, the code of the custom term; pending multi-center interaction module After passing the review, the relevant operators confirm that there is no custom code of similar repeated medical terms, then a custom standard term code is generated, and then the automatic mapping module can be called to complete the term mapping and the loading of the covered data; if the review fails , Then return the existing custom term code for the medical data center to complete subsequent mapping or return the reason for the failure of custom term generation, generate an error document and prompt the user.
  8. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述多中心交互模块负责各个医疗数据中心的通用医疗术语库及其术语编码的协调与统一,由多中心交互模块的最高权限人员审核协调自定义标准术语的使用问题。The multi-center medical terminology standardization system based on a general medical term library according to claim 1, wherein the multi-center interaction module is responsible for the coordination and unification of the general medical terminology database and its term coding of each medical data center , The highest authority personnel of the multi-center interaction module will review and coordinate the use of custom standard terms.
  9. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述增量更新模块用于已经操作过医疗术语映射的医疗数据中心的后续医疗术语标准化过程,主要依据术语映射单元产生的以往的术语标准化的映射记录实现对增量数据的更新,对于仍然无法完成标准化映射的医疗术语,重复执行自定义术语模块。The multi-center medical terminology standardization system based on a general medical term database according to claim 1, wherein the incremental update module is used in the subsequent medical terminology standardization process of a medical data center that has already operated medical terminology mapping , It is mainly based on the previous terminology standardized mapping records generated by the term mapping unit to update incremental data. For medical terms that still cannot be standardized, the custom term module is repeatedly executed.
  10. 根据权利要求1所述的一种基于通用医疗术语库的多中心医疗术语标准化系统,其特征在于,所述异常处理模块:用于保存系统运行时的所有日志,记录每个模块是否正常运行;分类保存错误日志,包括:系统运行时出现的错误,每个模块调用时出现的错误,各个模块运行时对于单个术语映射时出现的错误;分类保存未映射成功的术语,包括在自动分析模块中被忽略的和在自定义模块中被忽略的术语,生成失败术语文档;异常处理模块通过在数据库上设定时间戳,支持数据库回溯功能,支持用户将匹配后的数据库回溯到指定日期的数据。The multi-center medical terminology standardization system based on a general medical term database according to claim 1, wherein the exception handling module is used to save all logs during system operation and record whether each module is operating normally; Category save error logs, including: errors that occur when the system is running, errors that occur when each module is called, and errors that occur when each module is running for a single term mapping; category and save terms that are not successfully mapped, including in the automatic analysis module Ignored and ignored terms in the custom module, generate a failed term document; the exception handling module supports the database backtracking function by setting a time stamp on the database, and supports users to backtrack the matched database to the data on the specified date.
PCT/CN2020/083586 2019-07-12 2020-04-07 General medical termbase-based multi-center medical terminology standardization system WO2020233256A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2021533326A JP7093593B2 (en) 2019-07-12 2020-04-07 Multi-center medical term standardization system based on general-purpose medical term library

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910629244.9 2019-07-12
CN201910629244.9A CN110349639B (en) 2019-07-12 2019-07-12 Multi-center medical term standardization system based on general medical term library

Publications (1)

Publication Number Publication Date
WO2020233256A1 true WO2020233256A1 (en) 2020-11-26

Family

ID=68176052

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083586 WO2020233256A1 (en) 2019-07-12 2020-04-07 General medical termbase-based multi-center medical terminology standardization system

Country Status (3)

Country Link
JP (1) JP7093593B2 (en)
CN (1) CN110349639B (en)
WO (1) WO2020233256A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395854A (en) * 2020-12-02 2021-02-23 中国标准化研究院 Standard element consistency inspection method
CN112988966A (en) * 2021-03-04 2021-06-18 中建海峡建设发展有限公司 Voice interaction construction log management system and implementation method
CN113342793A (en) * 2021-06-18 2021-09-03 立信(重庆)数据科技股份有限公司 Investigation data standardization method and system
CN113704555A (en) * 2021-07-16 2021-11-26 杭州医康慧联科技股份有限公司 Feature management method based on medical direction federal learning
CN113764086A (en) * 2021-08-17 2021-12-07 卫宁健康科技集团股份有限公司 Nursing information processing system and method based on JHNEBP model
CN113836126A (en) * 2021-09-22 2021-12-24 上海妙一生物科技有限公司 Data cleaning method, device, equipment and storage medium
CN115080751A (en) * 2022-08-16 2022-09-20 之江实验室 Medical standard term management system and method based on general model
CN115712839A (en) * 2022-11-14 2023-02-24 国网山东省电力公司日照供电公司 Automatic matching system and method for communication model of relay protection device
CN115952770A (en) * 2023-03-15 2023-04-11 广州汇通国信科技有限公司 Data standardization processing method and device, electronic equipment and storage medium
CN116386799A (en) * 2023-06-05 2023-07-04 数据空间研究院 Medical data acquisition and standard conversion method and system

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349639B (en) * 2019-07-12 2022-01-04 之江实验室 Multi-center medical term standardization system based on general medical term library
CN111126018B (en) * 2019-11-25 2023-08-08 泰康保险集团股份有限公司 Form generation method and device, storage medium and electronic equipment
CN110990591A (en) * 2019-12-26 2020-04-10 北京亚信数据有限公司 Method and system for auditing transcoding quality of medical data
CN111291225B (en) * 2020-05-08 2020-08-11 成都金盘电子科大多媒体技术有限公司 Method and system for quickly verifying medical health information data standard
CN112035451A (en) * 2020-08-25 2020-12-04 上海灵长软件科技有限公司 Data verification optimization processing method and device, electronic equipment and storage medium
CN112069774A (en) * 2020-09-03 2020-12-11 微医云(杭州)控股有限公司 Data mapping method and device, electronic terminal and storage medium
CN112347266A (en) * 2020-09-11 2021-02-09 湖南中医药大学 Special term standardization system for children rehabilitation
CN112052667B (en) * 2020-09-27 2024-05-03 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for realizing medical coding mapping
CN112365939B (en) * 2020-10-14 2023-04-07 山东大学 Data management method and system based on medical health big data
CN112633005A (en) * 2020-11-11 2021-04-09 上海数创医疗科技有限公司 Electrocardio term semantic matching method
CN112883157B (en) * 2021-02-07 2023-04-07 武汉大学 Method and device for standardizing multi-source heterogeneous medical data
CN112951355B (en) * 2021-02-25 2023-05-02 武汉大学 Quality inspection function method and device for warehousing massive medical data
CN112817945A (en) * 2021-03-03 2021-05-18 江苏汇鑫融智软件科技有限公司 Medical heterogeneous system data warehouse construction method based on ESB
CN113239115B (en) * 2021-05-19 2023-06-02 中国医学科学院医学生物学研究所 Quick and accurate synchronization method for vaccine adverse reaction batch data
CN113377897B (en) * 2021-05-27 2022-04-22 杭州莱迈医疗信息科技有限公司 Multi-language medical term standard standardization system and method based on deep confrontation learning
CN113656604B (en) * 2021-10-19 2022-02-22 之江实验室 Medical term normalization system and method based on heterogeneous graph neural network
CN114003791B (en) * 2021-12-30 2022-04-08 之江实验室 Depth map matching-based automatic classification method and system for medical data elements
CN114461714B (en) * 2022-01-13 2024-03-29 湖北国际物流机场有限公司 BIM code conversion system
CN114595668A (en) * 2022-01-28 2022-06-07 北京医鸣技术有限公司 Method, platform, medium and equipment for standardizing medical diagnosis terms
CN116110560A (en) * 2023-04-13 2023-05-12 杭州璞睿生命科技有限公司 Method, device, equipment and medium for docking clinical diagnosis and treatment data to EDC system
CN116167354B (en) * 2023-04-19 2023-07-07 北京亚信数据有限公司 Medical term feature extraction model training and standardization method and device
CN116737697B (en) * 2023-08-10 2023-10-20 云筑信息科技(成都)有限公司 Method and device for managing main data of materials in construction industry and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452503A (en) * 2008-11-28 2009-06-10 上海生物信息技术研究中心 Isomerization clinical medical information shared system and method
US20160342746A1 (en) * 2015-05-21 2016-11-24 Naveen Sarabu Cloud-Based Medical-Terminology Manager and Translator
CN107978341A (en) * 2017-12-22 2018-05-01 南京昂特医信数据技术有限公司 Isomeric data adaptation method and its system under a kind of medicine semantic frame based on linguistic context
CN109408820A (en) * 2018-10-17 2019-03-01 长沙瀚云信息科技有限公司 A kind of medical terminology mapped system and method, equipment and storage medium
CN109446340A (en) * 2018-10-17 2019-03-08 长沙瀚云信息科技有限公司 A kind of Medicine standard term ontology management system and method, equipment and storage medium
CN110349639A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter medical terms standardized system based on common therapy terminology bank

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7580831B2 (en) 2002-03-05 2009-08-25 Siemens Medical Solutions Health Services Corporation Dynamic dictionary and term repository system
KR100538577B1 (en) 2003-07-14 2005-12-22 이지케어텍(주) Method For Standardization Of Computerization Of Medical Information
JP4955197B2 (en) 2004-09-07 2012-06-20 株式会社日本医療データセンター Receipt file generation system
JP4661415B2 (en) 2005-07-13 2011-03-30 株式会社日立製作所 Expression fluctuation processing system
US7610192B1 (en) * 2006-03-22 2009-10-27 Patrick William Jamieson Process and system for high precision coding of free text documents against a standard lexicon
US10204703B2 (en) 2014-11-10 2019-02-12 Accenture Global Services Limited Medical coding management system using an intelligent coding, reporting, and analytics-focused tool
JP2016200978A (en) 2015-04-10 2016-12-01 株式会社日立製作所 Training data generation device
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
KR101878217B1 (en) 2016-11-07 2018-07-13 경희대학교 산학협력단 Method, apparatus and computer program for medical data
WO2019016054A1 (en) 2017-07-18 2019-01-24 Koninklijke Philips N.V. Mapping of coded medical vocabularies
CN109033080B (en) * 2018-07-12 2023-03-24 上海金仕达卫宁软件科技有限公司 Medical term standardization method and system based on probability transfer matrix

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452503A (en) * 2008-11-28 2009-06-10 上海生物信息技术研究中心 Isomerization clinical medical information shared system and method
US20160342746A1 (en) * 2015-05-21 2016-11-24 Naveen Sarabu Cloud-Based Medical-Terminology Manager and Translator
CN107978341A (en) * 2017-12-22 2018-05-01 南京昂特医信数据技术有限公司 Isomeric data adaptation method and its system under a kind of medicine semantic frame based on linguistic context
CN109408820A (en) * 2018-10-17 2019-03-01 长沙瀚云信息科技有限公司 A kind of medical terminology mapped system and method, equipment and storage medium
CN109446340A (en) * 2018-10-17 2019-03-08 长沙瀚云信息科技有限公司 A kind of Medicine standard term ontology management system and method, equipment and storage medium
CN110349639A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter medical terms standardized system based on common therapy terminology bank

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, YANG ET AL.: "A Thinking on the Establishment of Medical Terms Database and Standardization of Terminology", JIANGSU HEALTHCARE ADMINISTRATION, vol. 27, no. 4,, 31 August 2016 (2016-08-31), DOI: 20200615112156Y *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395854B (en) * 2020-12-02 2022-11-22 中国标准化研究院 Standard element consistency inspection method
CN112395854A (en) * 2020-12-02 2021-02-23 中国标准化研究院 Standard element consistency inspection method
CN112988966A (en) * 2021-03-04 2021-06-18 中建海峡建设发展有限公司 Voice interaction construction log management system and implementation method
CN113342793A (en) * 2021-06-18 2021-09-03 立信(重庆)数据科技股份有限公司 Investigation data standardization method and system
CN113342793B (en) * 2021-06-18 2023-04-07 立信(重庆)数据科技股份有限公司 Research data standardization method and system
CN113704555A (en) * 2021-07-16 2021-11-26 杭州医康慧联科技股份有限公司 Feature management method based on medical direction federal learning
CN113704555B (en) * 2021-07-16 2023-11-07 杭州医康慧联科技股份有限公司 Feature management method based on medical direction federal learning
CN113764086A (en) * 2021-08-17 2021-12-07 卫宁健康科技集团股份有限公司 Nursing information processing system and method based on JHNEBP model
CN113836126A (en) * 2021-09-22 2021-12-24 上海妙一生物科技有限公司 Data cleaning method, device, equipment and storage medium
CN113836126B (en) * 2021-09-22 2024-01-30 上海妙一生物科技有限公司 Data cleaning method, device, equipment and storage medium
CN115080751B (en) * 2022-08-16 2022-11-11 之江实验室 Medical standard term management system and method based on general model
CN115080751A (en) * 2022-08-16 2022-09-20 之江实验室 Medical standard term management system and method based on general model
CN115712839A (en) * 2022-11-14 2023-02-24 国网山东省电力公司日照供电公司 Automatic matching system and method for communication model of relay protection device
CN115712839B (en) * 2022-11-14 2023-10-24 国网山东省电力公司日照供电公司 Automatic matching system and method for relay protection device communication model
CN115952770A (en) * 2023-03-15 2023-04-11 广州汇通国信科技有限公司 Data standardization processing method and device, electronic equipment and storage medium
CN116386799A (en) * 2023-06-05 2023-07-04 数据空间研究院 Medical data acquisition and standard conversion method and system
CN116386799B (en) * 2023-06-05 2023-08-18 数据空间研究院 Medical data acquisition and standard conversion method and system

Also Published As

Publication number Publication date
CN110349639A (en) 2019-10-18
CN110349639B (en) 2022-01-04
JP7093593B2 (en) 2022-06-30
JP2022508350A (en) 2022-01-19

Similar Documents

Publication Publication Date Title
WO2020233256A1 (en) General medical termbase-based multi-center medical terminology standardization system
CN110415831B (en) Medical big data cloud service analysis platform
US20130046529A1 (en) Method and System for Classification of Clinical Information
CN112151170A (en) Method for calculating a score of a medical advice for use as a medical decision support
CN112801488B (en) Real-time control optimization method and system for clinical test quality
CN111667894A (en) RETE algorithm rule engine-based hospital electronic medical record quality monitoring and management system
KR101926632B1 (en) A method of case learning and comments generation for clinical pathology examination using rule optimization
CN112420176A (en) Hierarchical diagnosis guide system based on structured information base
CN111984640A (en) Portrait construction method based on multi-element heterogeneous data
Azeroual et al. Without data quality, there is no data migration
Hart et al. Meeting health care research needs in a kimball integrated data warehouse
CN107194143A (en) Medicine information data processing method and system
CN116013448A (en) System for automatically generating statistical analysis plan and report of clinical test project
CN116010439A (en) Visual Chinese SQL system and query construction method
US20150356130A1 (en) Database management system
CN113270201A (en) Medical information data verification method and system and computer readable storage medium
Kirsten et al. Metadata management for data integration in medical sciences
Klann et al. Modeling the information-value decay of medical problems for problem list maintenance
Röhrig In search for methods to support electronic patient recruitment in a multi-ICU clinical trial
Arias The benefits of graph databases for the computation of clinical quality measures
Vardaki et al. A statistical metadata model for clinical trials’ data management
CN109144990A (en) A kind of power communication big data method for quality control based on metadata driven
CN117038002B (en) Method and device for generating observation variable in drug evaluation research
CN116825311B (en) DRG/DIP-based hospital management and control operation method and system
CN117995419A (en) Data quality control method, system, terminal and storage medium for medical data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20810756

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021533326

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20810756

Country of ref document: EP

Kind code of ref document: A1