CN112286926B

CN112286926B - Method for combing data quality rules based on affair handling data supply and demand maps

Info

Publication number: CN112286926B
Application number: CN202011575584.7A
Authority: CN
Inventors: 周万
Original assignee: Jiangsu Shudui Technology Co ltd
Current assignee: Jiangsu Shudui Technology Co ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-03-30
Anticipated expiration: 2040-12-28
Also published as: CN112286926A

Abstract

The invention discloses a method for combing data quality rules based on a business data supply and demand map, which comprises the following steps: constructing a government affair data map ontology model; constructing a data supply and demand relationship map; setting a data element set needing to be combed, and calculating a sequence dependency relationship diagram of data elements in the set and affairs handling matters in a supply and demand relationship map; obtaining a sequence dependency relationship diagram of the data elements according to the sequence dependency relationship diagram of the transaction matters; and generating a data quality rule according to the sequence dependency relationship graph of the data elements. The invention automatically or semi-automatically combs to form the data element list and the data quality rule, thereby reducing the workload of manual combing and reducing the possibility of data omission.

Description

Method for combing data quality rules based on affair handling data supply and demand maps

Technical Field

The invention relates to a method for generating data quality rules, in particular to a method for combing data quality rules based on a business data supply and demand map.

Background

The functional architecture of the quality management system for most of the structured data at present is described as follows: the data quality control system aims at finding, positioning and solving various data quality problems in time, ensuring the stability and reliability of the data quality and being responsible for carrying out full-flow monitoring and management on the data quality.

Wherein the content of the data quality rule comprises:

collecting rules: the acquisition procedure rules are algorithms and rules for the data quality management subsystem to extract the required data quality information.

And (3) monitoring rules: the monitoring rule is a verification rule for the data quality management subsystem to carry out quality index detection on the collected quality data.

And (3) alarm rules: the alarm rule is a method for sending alarm information when the monitoring rule is executed and an exception violating the allowable range of the rule occurs, and comprises two parts, namely an alarm mode rule and an alarm subscription rule.

The definition of the monitoring rule aiming at the data quality usually needs to invest a large amount of manpower for carding, and requires that the service knowledge of technicians participating in carding is very rich, professional knowledge in the government field is well known, and the consumed labor cost is huge.

At present, a large data center in each province or city collects data of each business department, integrates and processes the data, and forms data required to be used by other departments. But data quality monitoring of data gathered from various departments is required. However, the data quality rule, which is the basis for data quality monitoring, is often a huge amount of data collected from each department, and the data tables and fields are numerous, which requires a huge amount of labor cost.

Typically, the method is divided into several steps

The first step is as follows: and combing related government affair services aiming at a certain subdivision theme, generally checking a data list under the theme, and refining the data list into a specific data table. And analyzing the business meaning and the source department of the data. For example, the subject of 'death by nature' needs to be combed with related services of public security, civil administration, court and other departments to obtain a data table and a list of the subject of 'death by nature'.

The second step is that: and combing the data type, data format, value range and the like of each data table field according to the business meaning to form a data element list as the basis of the technical quality requirement of the data.

And generating a death topic database metadata directory table. Quality rules for technology classes can now be generated based on format and value range.

The third step: and combing the business constraint relation among the data elements according to the business meaning to form the basis of the quality requirement of the business class.

At this time, the rule information is divided into three types,

1. the constraint relationship between the data elements requires that certain expressions be met as a result.

2. And carrying out statistical analysis on the values corresponding to the data elements, wherein the result requirement accords with a certain expression.

3. And operating the values corresponding to the plurality of data elements, and performing statistical analysis, wherein the result requirement conforms to a certain expression.

4. But each data element must comply with some sort of value rule.

For the data elements in the above description, the naming, value range definition, format type, etc. follow the following rules:

the specific contents of the naming rules of the data elements are shown in GB/T19488.1-2004E-government data element part 1: design and management practice. Examples are as follows:

a) uniqueness of

Rule 1: in a certain context, the name of a data element should be unique, and the name includes several elements, such as an object word, a property word, an expression word and a qualifier.

Example (c): in the data element "code of citizen's place of birth, city and county," citizen "is an object word," code "is a characteristic word, and" code "is an expression word.

b) Grammar rules

1) Rule 2: the appearance sequence of the elements in the data element name is arranged according to the position of the object word, the characteristic word and the representation word;

2) rule 3: the limiting words are positioned in front of the limited components, and can carry out semantic limitation on the object words, the characteristic words and the representation words;

3) rule 4: when the expression word is repeated or partially repeated with the characteristic word, the redundant word may be omitted.

Example (c): in the data element "guardian name", the "name" is an expression word of the "guardian name", and a redundant word "name" is omitted because the expression word "name" is semantically overlapped with the characteristic word "name".

c) Semantic rules

1) Rule 5: one and only one object class word in the name of the data element is used for representing things or concepts under a certain context and is a dominant part in the data element;

2) rule 6: one and only one characteristic word in the data element name is a remarkable and distinguishing characteristic of the object;

3) rule 7: there should be one and only one representative word in the data element name that describes the format of the set of valid values of the data element.

Example (c):

in two data elements of a city and county code and a guardian name of the national place of birth, the components "citizen" and "guardian" are object words. The components "city and county of the place of birth" and "name" are characteristic words. The expressions are "code" and "name", respectively.

4) Rule 8: there may be qualifiers in the data element names, which are used to qualify an object class word, property word, or expression word, indicating the uniqueness of the object in a particular context.

When the government affair data elements are combed, the data elements with the expression of 'date' have the business constraint relation in time, and the constraint relation can be used as a quality rule to check the quality of the data. However, a large amount of manpower is required to be invested in the process of combing the constraint relations, and mechanical combing quality rules are adopted, such as birth date < death date < cremation date < funeral date < cancellation date < current date; the chronological order of date of birth < time to announce death < time to apply for revocation < current date was manually combed.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method for automatically or semi-automatically combing to form a data element list and data quality rules, so that the workload of manual combing is reduced, and the possibility of data omission is reduced.

The purpose of the invention is realized by the following technical scheme.

A method for combing data quality rules based on a supply and demand atlas of event data comprises the following steps:

1) extracting words related to an ontology model in policy regulations from a policy document of a government affair service affair pair, organizing the words in a relation network mode comprising a plurality of connecting lines, and constructing a government affair data atlas ontology model, wherein the words related to the ontology model refer to element information related to the government affair service and comprise affair handling materials, affair handling organs, parties and rule labels, and the relation network of the connecting lines is a reference relation among elements of the government affair service;

2) importing a government affair data map body model, and constructing a data supply and demand relation map according to handling materials and handling materials of government affair service matters;

3) setting a data element set needing to be combed, and calculating a sequence dependency relationship diagram of data elements in the set and affairs handling matters in a supply and demand relationship map;

4) obtaining a sequence dependency relationship diagram of the data elements according to the sequence dependency relationship diagram of the transaction matters;

5) and generating a data quality rule according to the sequence dependency relationship graph of the data elements.

Further, the step 3) is specifically to calculate, for all data elements indicating the word "date", the transaction items for generating the corresponding data, and includes the steps of:

setting all data element sets with the expression of 'date' as A and all affair handling sets in the supply and demand relation map as X;

traversing all the data elements Ai in the set A, removing the date and the time in the name of the data elements Ai to obtain a vocabulary Bi, wherein the set of all Bi is B, and recording the corresponding relation between Ai and Bi;

traversing all the items Xi in the set X, splicing the names, descriptions, output material names and description information of the Xi to form a text string Yi, wherein the set of all the text strings is Y, and recording the corresponding relation between the Xi and the Yi;

traversing all Bi in the B, and calculating the correlation Rj of each Bi and all Yj in the set Y;

calculating the maximum value of all Rj, then obtaining Yk corresponding to the maximum value of Rj, and obtaining Xk corresponding to Yk according to the corresponding relation between X and Y;

and after traversing is finished, obtaining the transaction items Xk corresponding to each Bi, and obtaining the transaction items Xk generating the data corresponding to each data element Ai according to the corresponding relation between Bi and Ai.

Further, the step 4) specifically comprises:

a) traversing the set A, searching the corresponding Xk from the data element index library for each Ai, and marking a mark S to be sorted;

b) starting from root Y0, set Y0 to current node C;

c) traversing all downstream nodes Yi of the node C;

d) when the node Yi does not carry the S mark, traversing the child node Sj of the node Yi, and when the Sj is not the child node of Yi, setting the Sj as the child node of Yi;

e) deleting the parent-child relationship among the node Sj, all the father nodes and all the child nodes, and deleting the node Sj;

f) when the node Yi carries the S mark, setting Yi as a current node C, returning to the step C), and forming a relational graph spectrogram after removing redundant nodes;

g) and (C) repeatedly executing the steps b) and C) according to the spectrogram of the relational graph without the redundant nodes, and generating an inequality with Yi being larger than C, namely obtaining the sequential dependent expression among the nodes.

Further, the calculation method for calculating the correlation Rj of each Bi with all Yj in the set Y includes TF-IDF algorithm or BM25 algorithm.

Compared with the prior art, the invention has the advantages that: most of government affair data related to natural persons or legal persons are collected in the process of performing work in each government department, and the process of performing work is mostly the process of handling government affair service matters of the natural persons or the legal persons.

Drawings

Fig. 1 is a government affairs data map body model map according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of the materials and corresponding upstream matters depending on the household membership shift matters.

FIG. 3 is a flow chart of the present invention.

FIG. 4 is a chronological dependency graph of the data supply and demand graph defined by the acquired office material.

FIG. 5 is a precedence dependency diagram for data elements A1-A10.

Detailed Description

The invention is described in detail below with reference to the drawings and specific examples.

Examples

taking the 'adolescent registration' as an example, the refining method is as follows:

as a "government affair", this affair is called "transaction affair" in the specific transaction process.

The application for transaction is a "principal".

The transaction is executed by a corresponding government agency, called an "office".

The handling process requires the submission of the relevant "handling material". Such as certificates, protocols, certificates, documents, applications, and the like.

These materials are also issued by corresponding authorities, called "material opening agencies".

After the corresponding matters are transacted, subsequent government matters can be transacted, and the subsequent matters are called 'recommended matters'.

There are 2 kinds of handling ways, namely entrusted handling and in-person handling, and the entrusted handling needs to provide an entrustment book.

The committee requires notarization by an authority department, called "committee notarization organ".

And finally, extracting an ontology model and importing the ontology model into a graph database to form an ontology graph, as shown in figure 1.

2) Importing a government affair data map body model, and constructing a data supply and demand relation map according to handling materials and handling materials of government affair service matters; and constructing a government affair data map according to the ontology model. Each government entity, in accordance with its defined obligations, combs on the relevant government matters such as marriage registration, accommodative registration, id card transaction, divorce registration, membership transfer, etc., according to the policy document. As shown in fig. 2, the household moves the material on which the transaction depends and the corresponding upstream transaction.

As shown in fig. 3, the step 3) is specifically to calculate the transaction items for generating the corresponding data for all the data elements with the word "date", and includes the steps of:

traversing all the data elements Ai in the set A, removing the date and the time in the name of the data elements Ai to obtain a vocabulary Bi, wherein the set of all Bi is B, and recording the corresponding relation between Ai and Bi; for example, data element list: birth date, death date, cremation date, funeral date, cancellation date, marriage date and divorce date.

After dropping the significand, we get: birth, death, cremation, funeral and interment, cancellation, marriage and divorce.

traversing all Bi in the B, and calculating the correlation Rj of each Bi and all Yj in the set Y; the calculation mode adopted for calculating the correlation Rj of each Bi and all Yj in the set Y comprises a TF-IDF algorithm or a BM25 algorithm.

As shown in FIG. 4, it is assumed that the precedence dependency relationship of the data supply and demand graph defined by the transaction material obtained in the second step is as follows, wherein the node A0 is birth and the node A10 is death logoff.

The step 4) is specifically as follows:

a) traversing the set A, searching the corresponding Xk from the data element index library for each Ai, and marking a mark S to be sorted; for example, as shown in fig. 5 for

nodes

2, 5, 6, 8, and 9. Nodes A0, A10 are marked with an S, as shown by

nodes

0, 10 in FIG. 5.

b) Starting from root Y0, set Y0 to current node C;

c) traversing all downstream nodes Yi of the node C;

f) when the node Yi carries an S mark, setting Yi as a current node C, returning to the step C), forming a relational graph spectrogram after removing redundant nodes, and obtaining a precedence order dependency graph of the data elements A1-A10, as shown in FIG. 5;

g) and (C) repeatedly executing the steps b) and C) according to the spectrogram of the relational graph after the redundant nodes are removed, and generating inequalities with Yi being larger than C, such as A4> A1, A2> A1, A3> A1, A4> A2, A5> A2, and A5> A3, namely obtaining the sequential dependent expressions among the nodes.

Claims

1. A method for combing data quality rules based on a supply and demand atlas of event data is characterized by comprising the following steps:

5) generating a data quality rule according to the sequence dependency relationship diagram of the data elements;

the step 3) is specifically to calculate and generate the transaction items of the corresponding data for all the data elements with the expression of 'date', and the steps comprise:

traversing all the items Xi in the set X, splicing the names, the descriptions, the output material names and the output material descriptions of the Xi to form a text string Yi, wherein the set of all the text strings is Y, and recording the corresponding relation between the Xi and the Yi;

calculating the maximum value of all Rj, wherein the maximum value of Rj corresponds to Yk, and obtaining Xk corresponding to Yk according to the corresponding relation between X and Y;

2. The method for combing the data quality rules based on the business affairs data supply and demand graph according to claim 1, wherein the step 4) is specifically as follows:

b) starting from root Y0, set Y0 to current node C;

c) traversing all downstream nodes Yi of the node C;

3. The method for combing the data quality rules based on the business affairs data supply and demand graph according to claim 1 or 2, wherein the calculation method for calculating the correlation Rj of each Bi and all Yj in the set Y comprises TF-IDF algorithm or BM25 algorithm.