CN113221528A

CN113221528A - Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model

Info

Publication number: CN113221528A
Application number: CN202110507026.5A
Authority: CN
Inventors: 吕旭东; 段会龙; 田琪; 韩喆僖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-08-06
Anticipated expiration: 2041-05-10
Also published as: CN113221528B

Abstract

The invention discloses an automatic generation and execution method of clinical data quality evaluation rules based on an openEHR model, which comprises the following steps: (1) establishing a mapping relation between data quality constraint knowledge of each node in an openEHR model and a data quality evaluation rule; (2) acquiring an openEHR model used, and analyzing the openEHR model into a JSON object in a fixed format; (3) processing each node in all openEHR models according to the mapping relation and the JSON object to generate a data quality evaluation rule of a fixed structure; (4) processing the data quality evaluation rule to generate a rule execution configuration file; (5) deploying relevant information of a database to be evaluated in a Spark rule execution engine; (6) and according to the rule execution configuration file and the relevant information of the database to be evaluated, the Spark rule execution engine executes the data quality evaluation rule by calling a function to obtain a data quality evaluation result.

Description

Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model

Technical Field

The invention belongs to the technical field of electronic medical record data quality evaluation, and particularly relates to an automatic generation and execution method of clinical data quality evaluation rules based on an openEHR model.

Background

The health medical big data is an important basic strategic resource of the country, the electronic medical record is taken as one of the core databases, and the stored data has important values in the aspects of medical treatment, scientific research, public health and the like. However, the value of the data is established on the basis of high quality or ready-to-use research, and the electronic medical record data in China generally has the quality problems of deletion, error, invalidity, incompleteness, inconsistency and the like, and the problems can directly influence the application effect of the data. Data quality assessment can help data users find data quality problems so as to adopt proper means to improve the data quality, and is an unavailable step in the process of integrating and utilizing electronic medical record data.

For the structured electronic medical record data, the data quality evaluation method based on the rules has strong universality, is easy to realize and is most widely used. For the method, the rule applied to the data is defined as a starting point for implementing data quality evaluation, the data quality evaluation result is directly influenced, and the method is also a link which needs manual participation and is time-consuming in the quality evaluation implementation process. Most of the current evaluation methods relate to creating a data quality Query, and writing data quality evaluation rules through Structured Query Language (SQL), but an electronic medical record contains hundreds of data items, and one data quality evaluation task usually needs to run hundreds of rules, which is high in workload, time-consuming and labor-consuming.

The existing research uses the idea of parameterization to define rules through structures such as variables, functions, parameters and the like, so that the rule definition process is simplified, but the parameters of each rule still need to be defined in practical application, and the problems of large workload, time consumption and labor consumption of rule definition are not effectively solved.

The medical information model is a standard information model, expresses clinical concepts in a standardized and reusable mode, provides a clinical data standard structure, can bind standard medical terms, and meets the requirement of consistency of clinical information expression and storage modes.

The OpenEHR model is one of representative layered medical information models, and is divided into a reference model and a prototype model. The reference model is a general basic model that defines the semantics and structure of information and is processed at the syntactic level, the prototype model is composed of prototypes and templates, a prototype represents a concept or set of data elements of an information domain, defined by constraining the data structure in the reference model, and templates are further assembled and constrained to meet specific scenario requirements. The construction and the use of the information model are the basis for establishing a standardized electronic medical record system, wherein the information model comprises related requirements on data quality, and the information model can be used as a knowledge source to automatically generate data quality evaluation rules.

Disclosure of Invention

In view of the above, an embodiment of the present invention provides an automatic generation and execution method for clinical data quality assessment rules based on an openEHR model, including the following steps:

(1) establishing a mapping relation between data quality constraint knowledge of each node in an openEHR model and a data quality evaluation rule;

(2) acquiring an openEHR model used, and analyzing the openEHR model into a JSON object in a fixed format and node information contained in the JSON object;

(3) processing each node in all openEHR models according to the mapping relation, the JSON object and the node information contained in the JSON object to generate a data quality evaluation rule with a fixed structure;

(4) processing the data quality evaluation rule to generate a rule execution configuration file;

(5) deploying relevant information of a database to be evaluated in a Spark rule execution engine;

(6) and according to the rule execution configuration file and the relevant information of the database to be evaluated, the Spark rule execution engine executes the data quality evaluation rule by calling a function to obtain a data quality evaluation result.

In one embodiment, in the step (1), when the mapping relationship between the data quality constraint knowledge and the data quality evaluation rule is constructed, the data types in the openEHR model are used as a classification basis, and the construction is performed by combining the attributes and keywords of each type.

In one embodiment, the step (1) specifically includes:

(1-1) analyzing according to data acquisition specifications issued in the construction process of a regional health information platform to obtain data quality evaluation requirements and data quality evaluation rules of the electronic medical record;

(1-2) analyzing the structure of the openEHR model according to data quality evaluation requirements, and extracting related data quality constraint knowledge in the openEHR model, wherein the openEHR model comprises an openEHR reference model and an openEHR prototype model;

and (1-3) establishing a mapping relation between the data quality evaluation rule of each node and the data quality constraint knowledge in the openEHR model by taking the data types defined in the openEHR reference model as classification bases.

In one embodiment, in the step (2), each piece of node information in the obtained JSON object includes a node name, a node path, a database table name corresponding to the node, a column name, a data type, and data constraint information structures corresponding to nodes of different data types are different.

In one embodiment, in step (3), the generated data quality evaluation rule includes: the rule identifier, the rule content and the database information corresponding to the node used by the rule, wherein the rule content is defined by a GDL (guidance Definition language) structure.

In one embodiment, in step (3), the rule content included in the generated data quality evaluation rule is bound to a node of an openEHR model, so as to facilitate multiplexing of the data quality evaluation rule.

In one embodiment, in step (3), the rule identifier included in the generated data quality evaluation rule includes: the openEHR template id, the node path, the local path and the keyword are used for identifying the automatically generated data quality evaluation rule, namely the data quality evaluation rule is generated according to certain constraint information of a certain node of a certain template, so that the rule is convenient to update and maintain.

In one embodiment, in step (4), the generated rule execution configuration file is used to define a rule execution flow, and includes: the method comprises the following steps of obtaining a database table name, a column name, a called method name, parameters required by the method, and a logic relation between rules, wherein the logic relation comprises AND, OR and separation, and the parameters required by the method are maximum values, minimum values, data formats and the like.

In one embodiment, in step (6), the Spark rule execution engine executes the data quality evaluation rule by calling a function, and includes:

defining a corresponding function according to the data quality evaluation requirement obtained by analysis to realize the function of a data quality evaluation rule;

taking the rule execution configuration file as input, analyzing parameters in the rule execution configuration file, and processing the created DataFrame of the database table to be evaluated;

for the rule of the logical connection, the input of the latter rule is the data in accordance with the former rule, and for the rule of the logical connection, the data in accordance with the rules at the two sides of the logical connection are merged as the integral execution result of the rule; and encapsulating the processed data quality evaluation result into an object and returning the object to the user.

In one embodiment, the data quality assessment result obtained in step (6) includes a total data amount, a failed data ID, and a failed data value.

The automatic generation and execution method of the clinical data quality evaluation rule based on the openEHR model provided by the embodiment can obviously reduce the workload of manually defining the evaluation rule, can automatically count the data quality evaluation result, and provides a mechanism for updating and maintaining the rule to help a user manage the rule without knowing the structure of a bottom database. Compared with the prior art, the invention has the beneficial technical effects that:

1) the time and labor cost for manually defining the rules are reduced; the data quality evaluation rule is automatically generated based on the data constraint knowledge in the openEHR model template, so that the workload of manually defining the rule can be remarkably reduced, and the time and labor cost for defining the rule are reduced.

2) The reuse of the rules is facilitated; the rule is directly bound with the template node of the openEHR model instead of the structure of the bottom database, the template node is composed of universal prototype nodes, and for the condition that the same prototype node is used but the structures of the bottom database are different, the rule content does not need to be changed, and the rule can be quickly multiplexed by re-acquiring the mapping relation between the node and the database.

3) The management of the rules is convenient; identifying the rules through the automatically generated rule identifiers can quickly locate the knowledge source of the automatically generated rules. After the user modifies the template, the rule can be quickly updated through the rule identifier, so that the rule management is facilitated.

4) The method has the advantages of universality; for the database based on the openEHR standard, the method can be used for automatically generating the rule to carry out quality evaluation on the data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow diagram of a method for automatic generation and execution of clinical data quality assessment rules based on an openEHR model in one embodiment;

FIG. 2 is a diagram illustrating a resolution format of an openEHR model DV _ COUNT type node in one embodiment;

FIG. 3 is a flow diagram of processing open EHR template nodes to automatically generate rules, in one embodiment;

FIG. 4 is a flow diagram that illustrates the functionality of a rule base update in one embodiment;

FIG. 5 is a diagram of a rule execution profile in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

FIG. 1 is a flow diagram of a method for automatically generating and executing clinical data quality assessment rules based on an openEHR model in one embodiment. As shown in fig. 1, the automatic generation and execution method of the clinical data quality evaluation rule provided by the embodiment includes the following steps:

step 1, establishing a mapping relation between data quality constraint knowledge of each node in an openEHR model and a data quality evaluation rule.

The establishment of the mapping relationship in step 1 is a basis for implementing the method, and in one embodiment, the establishment process of the mapping relationship in step 1 includes:

firstly, analyzing according to data acquisition specifications issued in the construction process of a regional health information platform to obtain data quality evaluation requirements and data quality evaluation rules of the electronic medical records.

In the embodiment, the data quality assessment requirements of the electronic medical records are analyzed according to data acquisition specifications issued in the construction process of the regional health information platform, the data quality assessment conditions are summarized, the data quality assessment requirements are classified according to a clinical data quality framework, and the data quality assessment requirements of each type of data are obtained.

And then, carrying out structural analysis on the openEHR model according to the data quality evaluation requirement, and extracting related data quality constraint knowledge in the openEHR model, wherein the openEHR model comprises an openEHR reference model and an openEHR prototype model.

In an embodiment, after performing a structural analysis on the openEHR model, an openEHR reference model and an openEHR prototype model may be determined, where the openEHR prototype model includes a prototype model and a template model, and constraint definitions are added to the openEHR prototype model and the openEHR reference model through the openEHR template.

And carrying out knowledge analysis on the openEHR prototype model according to the data quality evaluation requirement, and extracting related data quality constraint knowledge in the openEHR prototype model. The openEHR prototype is expressed using the Archetype Definition Language (ADL), where the cADL (constraint form of ADL) and dADL (data Definition form of ADL) syntaxes describe data constraints. The two grammars are analyzed according to the summarized data quality assessment requirements, and data quality constraint knowledge which can correspond to the assessment requirements is extracted.

And carrying out knowledge analysis on the openEHR reference model according to the data quality evaluation requirement, and extracting related data quality constraint knowledge in the openEHR reference model. And the information model of the general concept and the information model of the data type in the reference model are relatively related to the data quality assessment requirement, the structure and the attribute of the reference model are analyzed, and data quality constraint knowledge which can correspond to the data quality assessment requirement is extracted.

The data quality constraint knowledge obtained from the openEHR prototype model and the openEHR reference model is collectively referred to as data quality constraint knowledge obtained from the openEHR model. After data quality constraint knowledge obtained from the openEHR model is obtained, a mapping relation between a data quality evaluation rule and the data quality constraint knowledge in the openEHR model is established by taking the data types defined in the openEHR reference model as classification bases, and generally, one data type contains multiple data quality constraint knowledge and corresponds to multiple data quality evaluation requirements.

Although the data quality constraint knowledge and the data quality assessment requirement contained in the openEHR model are increased continuously with the development of informatization, the mapping relation established by the method can be expanded continuously to meet the requirement, the mapping relation has stability in a period of time, and the method has expandability.

And 2, acquiring the used openEHR model, and analyzing the openEHR model into a JSON object in a fixed format and node information contained in the JSON object.

In an embodiment, in order to extract the knowledge of the data quality constraint in the openEHR template, the template needs to be parsed. And (3) extracting data quality constraint knowledge of each node of the openEHR template by using the mapping relation determined in the step (1) to obtain the JSON object and node information contained in the JSON object.

In one embodiment, the extraction result of the DV _ COUNT type node is shown in fig. 2, and is represented by a JSON structure, and mainly includes a node path, node-corresponding database structure information, a node data type, a node name, and a node value range, which respectively correspond to "elementary path", "cdrInfo", "type", "ontology", and "range" keywords. Different data types contain different knowledge of data quality constraints and therefore have different structures, with the first four key structures being common.

And 3, processing each node in all openEHR models according to the mapping relation, the JSON object and the node information contained in the JSON object to generate a data quality evaluation rule with a fixed structure.

In the step 3, a JSON structure extracted from each node of the openEHR template model is mainly used for processing, and a data quality evaluation rule of a fixed structure is generated. As shown in fig. 3, the method specifically includes:

(a) acquiring a JSON object of a current node, judging whether the node is used in a database or not through a cdrInfo (clinical data center information) structure, namely judging whether the cdrInfo is empty or not, processing nodes of which the structure is not empty, and skipping nodes of which the structure is empty;

(b) cdrInfo is an array structure that needs to be traversed to process each database field information corresponding to each node. Each type of node contains non-null constraint Complex (null) knowledge, association constraint Element existln knowledge, and data type constraint Element type knowledge, so these three constraints are handled first. For example, if a certain node requires non-null, a corresponding rule is generated;

(c) judging the type of the node, and if the node is of a DV _ IDENTIFIER (IDENTIFIER) type, generating an Element is unique rule aiming at the uniqueness constraint of the node; if the node is of DV _ CODED _ TEXT (encoding) type, generating an Element CODED by rule aiming at the encoding requirement of the node; if the node is of DV _ COUNT or DV _ INTERVAL < DV _ COUNT > type, generating Compare (DataValue) and Element precision rules according to the data range and data precision requirement of the node; if the node is of DV _ DATETIME, DV _ DATE or DV _ TIME type, then an Element format rule is generated for the data format requirement.

(d) If the node is of DV _ QUANTITY type and contains the numerical range and unit information of the node, distinguishing the two information through localPath (local path), and then requiring the generation rule of the numerical range of each unit; the DV _ INTERVAL < DV _ query > type includes an upper node and a lower node, each node is of DV _ query type, and the upper node and the lower node need to be processed respectively according to the above process.

And repeating the process until all the nodes of all the openEHR templates are processed, and inserting all the generated data quality evaluation rules into the rule base. The generated data quality evaluation rule comprises: the rule identifier, the rule content and the database information corresponding to the nodes used by the rule, wherein the rule content is defined by using a GDL structure.

In order to identify the knowledge source of each rule, when generating a data quality evaluation rule, a rule identifier is first generated, and the rule is updated by the rule identifier to manage the rule base, in one embodiment, as shown in fig. 4, the specific process includes:

generating a rule identifier according to a name rule of 'openEHR template id + node path + local path + keyword';

if the rule identifier exists in the rule base, the rule is judged to need to be deleted or modified, and if a field of the database is not empty originally required, and the field can be defined to be empty after the template version is updated, the rule corresponding to the constraint is deleted.

And if the rule identifier does not exist in the rule base, processing the node information to generate a new rule, and adding the new rule to the rule base.

And 4, processing the data quality evaluation rule to generate a rule execution configuration file.

The rule execution configuration file is a definition of a rule execution process, is an input of a rule execution engine, and mainly comprises a database table name, a column name, a called method name, parameters required by a method such as a maximum value, a minimum value, a data format and the like, and a logical relationship between rules, wherein the logical relationship comprises and/or and separation. The format is shown in fig. 5. In one embodiment, step 4 specifically includes:

the rules to be executed are divided into four types for processing respectively: rules that relate to an associative relationship between two database tables, simple rules that relate to only one database table and one function, complex rules that are connected by only an and logic, and complex rules that are connected by an or logic;

only the complex rule connected with the logic may contain the constraint rule and the simple rule of the incidence relation, if the incidence relation constraint exists, the constraint rule is processed firstly;

complex rules logically connected by an or do not contain constraint rules of association by default;

if the complex rule of the OR logic connection comprises the AND logic, the processing method of the AND logic is called to be processed in a blocking mode.

And 5, deploying the address, the user name and the password information of the database to be evaluated in the Spark rule execution engine.

In one embodiment, step (5) creates a DataFrame from any JDBC-compatible database using the JDBC method defined in the Spark DataFrame reader class, as an object of subsequent data processing, without modifying the original data of the database to be evaluated. JDBC compliant databases include MySQL, PostgressSQL, H2, Oracle, SQL Server, SAP Hana, and DB 2.

And 6, according to the rule execution configuration file and the relevant information of the database to be evaluated, executing a data quality evaluation rule by a Spark rule execution engine through a calling function to obtain a data quality evaluation result.

In one embodiment, step 6 specifically includes:

creating a SparkSession as an entry point of Spark;

processing the logic relation and rule parameters of the rule execution configuration file;

creating a DataFrame for a target database table as an object of subsequent data processing;

calling corresponding functions to process data;

if a rule contains multiple constraints, in the processing process, the multiple constraints are executed according to the sequence defined in the rule configuration file, the data result meeting the previous constraint, namely the DataFrame, is used as the input of the next rule constraint, and the data meeting all the constraints of the current rule are finally obtained along with the continuous reduction of the execution data volume of the rule.

And taking a difference set between the original data set corresponding to the current rule and the data set finally conforming to the rule to obtain the data set not conforming to the rule.

And for each rule, counting the total data processed by the rule, the data which do not accord with the rule, the data ID which do not accord with the rule and the data value which do not accord with the rule.

The automatic generation and execution method of the clinical data quality evaluation rule based on the openEHR model provided by the embodiment can obviously reduce the workload of manually defining the evaluation rule, can automatically count the data quality evaluation result, and provides a mechanism for updating and maintaining the rule to help a user manage the rule without knowing the structure of a bottom database.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. An automatic generation and execution method of clinical data quality evaluation rules based on an openEHR model is characterized by comprising the following steps:

2. The method for automatically generating and executing clinical data quality assessment rules based on openEHR model according to claim 1, wherein in the step (1), when the mapping relationship between the data quality constraint knowledge and the data quality assessment rules is constructed, the data types in the openEHR model are used as classification bases, and the construction is carried out by combining the attributes and keywords of each type.

3. The automatic generation and execution method of clinical data quality assessment rules based on openEHR model according to claim 1, wherein step (1) specifically comprises:

4. The method for automatically generating and executing clinical data quality assessment rules based on openEHR model according to claim 1, wherein in step (2), each node information in the obtained JSON object includes node name, node path, database table name corresponding to the node, column name, data type, and data constraint information, and the data constraint information structure of the nodes of different data types is different.

5. The method for automatically generating and executing clinical data quality assessment rules based on openEHR model according to claim 1, wherein in step (3), the generated data quality assessment rules include: the rule identifier, the rule content and the database information corresponding to the nodes used by the rule, wherein the rule content is defined by using a GDL structure.

6. The method for automatically generating and executing clinical data quality assessment rules based on openEHR model according to claim 5, wherein in step (3), the rule contents contained in the generated data quality assessment rules are bound with the nodes of the openEHR model, so as to facilitate multiplexing of the data quality assessment rules.

7. The method for automatically generating and executing clinical data quality assessment rules based on openEHR model according to claim 5, wherein in step (3), the rule identifier included in the generated data quality assessment rules comprises: the openEHR template id, the node path, the local path and the keyword are used for identifying the automatically generated data quality evaluation rule, namely the data quality evaluation rule is generated according to certain constraint information of a certain node of a certain template, so that the rule is convenient to update and maintain.

8. The automatic generation and execution method of clinical data quality assessment rules based on openEHR model according to claim 1, wherein in step (4), the generated rule execution configuration file is used to define a rule execution flow, and comprises: the name of a database table, the name of a column, the name of a called method, parameters required by the method and the logic relationship among the rules, wherein the logic relationship comprises AND, OR and separation.

9. The method for automatically generating and executing clinical data quality assessment rules based on openEHR model according to claim 3, wherein in step (6), the Spark rule execution engine executes the data quality assessment rules by calling a function, including:

10. The method for automatically generating and executing clinical data quality assessment rules based on openEHR models according to claim 1 or 9, wherein the data quality assessment results obtained in step (6) comprise total data amount, unqualified data ID and unqualified data value.