CN112286926B - Method for combing data quality rules based on affair handling data supply and demand maps - Google Patents

Method for combing data quality rules based on affair handling data supply and demand maps Download PDF

Info

Publication number
CN112286926B
CN112286926B CN202011575584.7A CN202011575584A CN112286926B CN 112286926 B CN112286926 B CN 112286926B CN 202011575584 A CN202011575584 A CN 202011575584A CN 112286926 B CN112286926 B CN 112286926B
Authority
CN
China
Prior art keywords
data
node
demand
supply
traversing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011575584.7A
Other languages
Chinese (zh)
Other versions
CN112286926A (en
Inventor
周万
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Shudui Technology Co ltd
Original Assignee
Jiangsu Shudui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Shudui Technology Co ltd filed Critical Jiangsu Shudui Technology Co ltd
Priority to CN202011575584.7A priority Critical patent/CN112286926B/en
Publication of CN112286926A publication Critical patent/CN112286926A/en
Application granted granted Critical
Publication of CN112286926B publication Critical patent/CN112286926B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The invention discloses a method for combing data quality rules based on a business data supply and demand map, which comprises the following steps: constructing a government affair data map ontology model; constructing a data supply and demand relationship map; setting a data element set needing to be combed, and calculating a sequence dependency relationship diagram of data elements in the set and affairs handling matters in a supply and demand relationship map; obtaining a sequence dependency relationship diagram of the data elements according to the sequence dependency relationship diagram of the transaction matters; and generating a data quality rule according to the sequence dependency relationship graph of the data elements. The invention automatically or semi-automatically combs to form the data element list and the data quality rule, thereby reducing the workload of manual combing and reducing the possibility of data omission.

Description

Method for combing data quality rules based on affair handling data supply and demand maps
Technical Field
The invention relates to a method for generating data quality rules, in particular to a method for combing data quality rules based on a business data supply and demand map.
Background
The functional architecture of the quality management system for most of the structured data at present is described as follows: the data quality control system aims at finding, positioning and solving various data quality problems in time, ensuring the stability and reliability of the data quality and being responsible for carrying out full-flow monitoring and management on the data quality.
Wherein the content of the data quality rule comprises:
collecting rules: the acquisition procedure rules are algorithms and rules for the data quality management subsystem to extract the required data quality information.
And (3) monitoring rules: the monitoring rule is a verification rule for the data quality management subsystem to carry out quality index detection on the collected quality data.
And (3) alarm rules: the alarm rule is a method for sending alarm information when the monitoring rule is executed and an exception violating the allowable range of the rule occurs, and comprises two parts, namely an alarm mode rule and an alarm subscription rule.
The definition of the monitoring rule aiming at the data quality usually needs to invest a large amount of manpower for carding, and requires that the service knowledge of technicians participating in carding is very rich, professional knowledge in the government field is well known, and the consumed labor cost is huge.
At present, a large data center in each province or city collects data of each business department, integrates and processes the data, and forms data required to be used by other departments. But data quality monitoring of data gathered from various departments is required. However, the data quality rule, which is the basis for data quality monitoring, is often a huge amount of data collected from each department, and the data tables and fields are numerous, which requires a huge amount of labor cost.
Typically, the method is divided into several steps
The first step is as follows: and combing related government affair services aiming at a certain subdivision theme, generally checking a data list under the theme, and refining the data list into a specific data table. And analyzing the business meaning and the source department of the data. For example, the subject of 'death by nature' needs to be combed with related services of public security, civil administration, court and other departments to obtain a data table and a list of the subject of 'death by nature'.
The second step is that: and combing the data type, data format, value range and the like of each data table field according to the business meaning to form a data element list as the basis of the technical quality requirement of the data.
And generating a death topic database metadata directory table. Quality rules for technology classes can now be generated based on format and value range.
The third step: and combing the business constraint relation among the data elements according to the business meaning to form the basis of the quality requirement of the business class.
At this time, the rule information is divided into three types,
1. the constraint relationship between the data elements requires that certain expressions be met as a result.
2. And carrying out statistical analysis on the values corresponding to the data elements, wherein the result requirement accords with a certain expression.
3. And operating the values corresponding to the plurality of data elements, and performing statistical analysis, wherein the result requirement conforms to a certain expression.
4. But each data element must comply with some sort of value rule.
For the data elements in the above description, the naming, value range definition, format type, etc. follow the following rules:
the specific contents of the naming rules of the data elements are shown in GB/T19488.1-2004E-government data element part 1: design and management practice. Examples are as follows:
a) uniqueness of
Rule 1: in a certain context, the name of a data element should be unique, and the name includes several elements, such as an object word, a property word, an expression word and a qualifier.
Example (c): in the data element "code of citizen's place of birth, city and county," citizen "is an object word," code "is a characteristic word, and" code "is an expression word.
b) Grammar rules
1) Rule 2: the appearance sequence of the elements in the data element name is arranged according to the position of the object word, the characteristic word and the representation word;
2) rule 3: the limiting words are positioned in front of the limited components, and can carry out semantic limitation on the object words, the characteristic words and the representation words;
3) rule 4: when the expression word is repeated or partially repeated with the characteristic word, the redundant word may be omitted.
Example (c): in the data element "guardian name", the "name" is an expression word of the "guardian name", and a redundant word "name" is omitted because the expression word "name" is semantically overlapped with the characteristic word "name".
c) Semantic rules
1) Rule 5: one and only one object class word in the name of the data element is used for representing things or concepts under a certain context and is a dominant part in the data element;
2) rule 6: one and only one characteristic word in the data element name is a remarkable and distinguishing characteristic of the object;
3) rule 7: there should be one and only one representative word in the data element name that describes the format of the set of valid values of the data element.
Example (c):
in two data elements of a city and county code and a guardian name of the national place of birth, the components "citizen" and "guardian" are object words. The components "city and county of the place of birth" and "name" are characteristic words. The expressions are "code" and "name", respectively.
4) Rule 8: there may be qualifiers in the data element names, which are used to qualify an object class word, property word, or expression word, indicating the uniqueness of the object in a particular context.
When the government affair data elements are combed, the data elements with the expression of 'date' have the business constraint relation in time, and the constraint relation can be used as a quality rule to check the quality of the data. However, a large amount of manpower is required to be invested in the process of combing the constraint relations, and mechanical combing quality rules are adopted, such as birth date < death date < cremation date < funeral date < cancellation date < current date; the chronological order of date of birth < time to announce death < time to apply for revocation < current date was manually combed.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method for automatically or semi-automatically combing to form a data element list and data quality rules, so that the workload of manual combing is reduced, and the possibility of data omission is reduced.
The purpose of the invention is realized by the following technical scheme.
A method for combing data quality rules based on a supply and demand atlas of event data comprises the following steps:
1) extracting words related to an ontology model in policy regulations from a policy document of a government affair service affair pair, organizing the words in a relation network mode comprising a plurality of connecting lines, and constructing a government affair data atlas ontology model, wherein the words related to the ontology model refer to element information related to the government affair service and comprise affair handling materials, affair handling organs, parties and rule labels, and the relation network of the connecting lines is a reference relation among elements of the government affair service;
2) importing a government affair data map body model, and constructing a data supply and demand relation map according to handling materials and handling materials of government affair service matters;
3) setting a data element set needing to be combed, and calculating a sequence dependency relationship diagram of data elements in the set and affairs handling matters in a supply and demand relationship map;
4) obtaining a sequence dependency relationship diagram of the data elements according to the sequence dependency relationship diagram of the transaction matters;
5) and generating a data quality rule according to the sequence dependency relationship graph of the data elements.
Further, the step 3) is specifically to calculate, for all data elements indicating the word "date", the transaction items for generating the corresponding data, and includes the steps of:
setting all data element sets with the expression of 'date' as A and all affair handling sets in the supply and demand relation map as X;
traversing all the data elements Ai in the set A, removing the date and the time in the name of the data elements Ai to obtain a vocabulary Bi, wherein the set of all Bi is B, and recording the corresponding relation between Ai and Bi;
traversing all the items Xi in the set X, splicing the names, descriptions, output material names and description information of the Xi to form a text string Yi, wherein the set of all the text strings is Y, and recording the corresponding relation between the Xi and the Yi;
traversing all Bi in the B, and calculating the correlation Rj of each Bi and all Yj in the set Y;
calculating the maximum value of all Rj, then obtaining Yk corresponding to the maximum value of Rj, and obtaining Xk corresponding to Yk according to the corresponding relation between X and Y;
and after traversing is finished, obtaining the transaction items Xk corresponding to each Bi, and obtaining the transaction items Xk generating the data corresponding to each data element Ai according to the corresponding relation between Bi and Ai.
Further, the step 4) specifically comprises:
a) traversing the set A, searching the corresponding Xk from the data element index library for each Ai, and marking a mark S to be sorted;
b) starting from root Y0, set Y0 to current node C;
c) traversing all downstream nodes Yi of the node C;
d) when the node Yi does not carry the S mark, traversing the child node Sj of the node Yi, and when the Sj is not the child node of Yi, setting the Sj as the child node of Yi;
e) deleting the parent-child relationship among the node Sj, all the father nodes and all the child nodes, and deleting the node Sj;
f) when the node Yi carries the S mark, setting Yi as a current node C, returning to the step C), and forming a relational graph spectrogram after removing redundant nodes;
g) and (C) repeatedly executing the steps b) and C) according to the spectrogram of the relational graph without the redundant nodes, and generating an inequality with Yi being larger than C, namely obtaining the sequential dependent expression among the nodes.
Further, the calculation method for calculating the correlation Rj of each Bi with all Yj in the set Y includes TF-IDF algorithm or BM25 algorithm.
Compared with the prior art, the invention has the advantages that: most of government affair data related to natural persons or legal persons are collected in the process of performing work in each government department, and the process of performing work is mostly the process of handling government affair service matters of the natural persons or the legal persons.
Drawings
Fig. 1 is a government affairs data map body model map according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of the materials and corresponding upstream matters depending on the household membership shift matters.
FIG. 3 is a flow chart of the present invention.
FIG. 4 is a chronological dependency graph of the data supply and demand graph defined by the acquired office material.
FIG. 5 is a precedence dependency diagram for data elements A1-A10.
Detailed Description
The invention is described in detail below with reference to the drawings and specific examples.
Examples
A method for combing data quality rules based on a supply and demand atlas of event data comprises the following steps:
1) extracting words related to an ontology model in policy regulations from a policy document of a government affair service affair pair, organizing the words in a relation network mode comprising a plurality of connecting lines, and constructing a government affair data atlas ontology model, wherein the words related to the ontology model refer to element information related to the government affair service and comprise affair handling materials, affair handling organs, parties and rule labels, and the relation network of the connecting lines is a reference relation among elements of the government affair service;
taking the 'adolescent registration' as an example, the refining method is as follows:
as a "government affair", this affair is called "transaction affair" in the specific transaction process.
The application for transaction is a "principal".
The transaction is executed by a corresponding government agency, called an "office".
The handling process requires the submission of the relevant "handling material". Such as certificates, protocols, certificates, documents, applications, and the like.
These materials are also issued by corresponding authorities, called "material opening agencies".
After the corresponding matters are transacted, subsequent government matters can be transacted, and the subsequent matters are called 'recommended matters'.
There are 2 kinds of handling ways, namely entrusted handling and in-person handling, and the entrusted handling needs to provide an entrustment book.
The committee requires notarization by an authority department, called "committee notarization organ".
And finally, extracting an ontology model and importing the ontology model into a graph database to form an ontology graph, as shown in figure 1.
2) Importing a government affair data map body model, and constructing a data supply and demand relation map according to handling materials and handling materials of government affair service matters; and constructing a government affair data map according to the ontology model. Each government entity, in accordance with its defined obligations, combs on the relevant government matters such as marriage registration, accommodative registration, id card transaction, divorce registration, membership transfer, etc., according to the policy document. As shown in fig. 2, the household moves the material on which the transaction depends and the corresponding upstream transaction.
3) Setting a data element set needing to be combed, and calculating a sequence dependency relationship diagram of data elements in the set and affairs handling matters in a supply and demand relationship map;
4) obtaining a sequence dependency relationship diagram of the data elements according to the sequence dependency relationship diagram of the transaction matters;
5) and generating a data quality rule according to the sequence dependency relationship graph of the data elements.
As shown in fig. 3, the step 3) is specifically to calculate the transaction items for generating the corresponding data for all the data elements with the word "date", and includes the steps of:
setting all data element sets with the expression of 'date' as A and all affair handling sets in the supply and demand relation map as X;
traversing all the data elements Ai in the set A, removing the date and the time in the name of the data elements Ai to obtain a vocabulary Bi, wherein the set of all Bi is B, and recording the corresponding relation between Ai and Bi; for example, data element list: birth date, death date, cremation date, funeral date, cancellation date, marriage date and divorce date.
After dropping the significand, we get: birth, death, cremation, funeral and interment, cancellation, marriage and divorce.
Traversing all the items Xi in the set X, splicing the names, descriptions, output material names and description information of the Xi to form a text string Yi, wherein the set of all the text strings is Y, and recording the corresponding relation between the Xi and the Yi;
traversing all Bi in the B, and calculating the correlation Rj of each Bi and all Yj in the set Y; the calculation mode adopted for calculating the correlation Rj of each Bi and all Yj in the set Y comprises a TF-IDF algorithm or a BM25 algorithm.
Calculating the maximum value of all Rj, then obtaining Yk corresponding to the maximum value of Rj, and obtaining Xk corresponding to Yk according to the corresponding relation between X and Y;
and after traversing is finished, obtaining the transaction items Xk corresponding to each Bi, and obtaining the transaction items Xk generating the data corresponding to each data element Ai according to the corresponding relation between Bi and Ai.
As shown in FIG. 4, it is assumed that the precedence dependency relationship of the data supply and demand graph defined by the transaction material obtained in the second step is as follows, wherein the node A0 is birth and the node A10 is death logoff.
The step 4) is specifically as follows:
a) traversing the set A, searching the corresponding Xk from the data element index library for each Ai, and marking a mark S to be sorted; for example, as shown in fig. 5 for nodes 2, 5, 6, 8, and 9. Nodes A0, A10 are marked with an S, as shown by nodes 0, 10 in FIG. 5.
b) Starting from root Y0, set Y0 to current node C;
c) traversing all downstream nodes Yi of the node C;
d) when the node Yi does not carry the S mark, traversing the child node Sj of the node Yi, and when the Sj is not the child node of Yi, setting the Sj as the child node of Yi;
e) deleting the parent-child relationship among the node Sj, all the father nodes and all the child nodes, and deleting the node Sj;
f) when the node Yi carries an S mark, setting Yi as a current node C, returning to the step C), forming a relational graph spectrogram after removing redundant nodes, and obtaining a precedence order dependency graph of the data elements A1-A10, as shown in FIG. 5;
g) and (C) repeatedly executing the steps b) and C) according to the spectrogram of the relational graph after the redundant nodes are removed, and generating inequalities with Yi being larger than C, such as A4> A1, A2> A1, A3> A1, A4> A2, A5> A2, and A5> A3, namely obtaining the sequential dependent expressions among the nodes.

Claims (3)

1. A method for combing data quality rules based on a supply and demand atlas of event data is characterized by comprising the following steps:
1) extracting words related to an ontology model in policy regulations from a policy document of a government affair service affair pair, organizing the words in a relation network mode comprising a plurality of connecting lines, and constructing a government affair data atlas ontology model, wherein the words related to the ontology model refer to element information related to the government affair service and comprise affair handling materials, affair handling organs, parties and rule labels, and the relation network of the connecting lines is a reference relation among elements of the government affair service;
2) importing a government affair data map body model, and constructing a data supply and demand relation map according to handling materials and handling materials of government affair service matters;
3) setting a data element set needing to be combed, and calculating a sequence dependency relationship diagram of data elements in the set and affairs handling matters in a supply and demand relationship map;
4) obtaining a sequence dependency relationship diagram of the data elements according to the sequence dependency relationship diagram of the transaction matters;
5) generating a data quality rule according to the sequence dependency relationship diagram of the data elements;
the step 3) is specifically to calculate and generate the transaction items of the corresponding data for all the data elements with the expression of 'date', and the steps comprise:
setting all data element sets with the expression of 'date' as A and all affair handling sets in the supply and demand relation map as X;
traversing all the data elements Ai in the set A, removing the date and the time in the name of the data elements Ai to obtain a vocabulary Bi, wherein the set of all Bi is B, and recording the corresponding relation between Ai and Bi;
traversing all the items Xi in the set X, splicing the names, the descriptions, the output material names and the output material descriptions of the Xi to form a text string Yi, wherein the set of all the text strings is Y, and recording the corresponding relation between the Xi and the Yi;
traversing all Bi in the B, and calculating the correlation Rj of each Bi and all Yj in the set Y;
calculating the maximum value of all Rj, wherein the maximum value of Rj corresponds to Yk, and obtaining Xk corresponding to Yk according to the corresponding relation between X and Y;
and after traversing is finished, obtaining the transaction items Xk corresponding to each Bi, and obtaining the transaction items Xk generating the data corresponding to each data element Ai according to the corresponding relation between Bi and Ai.
2. The method for combing the data quality rules based on the business affairs data supply and demand graph according to claim 1, wherein the step 4) is specifically as follows:
a) traversing the set A, searching the corresponding Xk from the data element index library for each Ai, and marking a mark S to be sorted;
b) starting from root Y0, set Y0 to current node C;
c) traversing all downstream nodes Yi of the node C;
d) when the node Yi does not carry the S mark, traversing the child node Sj of the node Yi, and when the Sj is not the child node of Yi, setting the Sj as the child node of Yi;
e) deleting the parent-child relationship among the node Sj, all the father nodes and all the child nodes, and deleting the node Sj;
f) when the node Yi carries the S mark, setting Yi as a current node C, returning to the step C), and forming a relational graph spectrogram after removing redundant nodes;
g) and (C) repeatedly executing the steps b) and C) according to the spectrogram of the relational graph without the redundant nodes, and generating an inequality with Yi being larger than C, namely obtaining the sequential dependent expression among the nodes.
3. The method for combing the data quality rules based on the business affairs data supply and demand graph according to claim 1 or 2, wherein the calculation method for calculating the correlation Rj of each Bi and all Yj in the set Y comprises TF-IDF algorithm or BM25 algorithm.
CN202011575584.7A 2020-12-28 2020-12-28 Method for combing data quality rules based on affair handling data supply and demand maps Active CN112286926B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011575584.7A CN112286926B (en) 2020-12-28 2020-12-28 Method for combing data quality rules based on affair handling data supply and demand maps

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011575584.7A CN112286926B (en) 2020-12-28 2020-12-28 Method for combing data quality rules based on affair handling data supply and demand maps

Publications (2)

Publication Number Publication Date
CN112286926A CN112286926A (en) 2021-01-29
CN112286926B true CN112286926B (en) 2021-03-30

Family

ID=74426401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011575584.7A Active CN112286926B (en) 2020-12-28 2020-12-28 Method for combing data quality rules based on affair handling data supply and demand maps

Country Status (1)

Country Link
CN (1) CN112286926B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679221A (en) * 2017-10-19 2018-02-09 武汉大学 Towards the time-space data acquisition and Services Composition scheme generation method of mitigation task
CN109214969A (en) * 2017-06-30 2019-01-15 勤智数码科技股份有限公司 A kind of data combing system and method
CN111192012A (en) * 2019-12-27 2020-05-22 腾讯云计算(北京)有限责任公司 Item processing method, item processing device, server and storage medium
CN111694963A (en) * 2020-05-11 2020-09-22 电子科技大学 Key government affair flow identification method and device based on item association network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214969A (en) * 2017-06-30 2019-01-15 勤智数码科技股份有限公司 A kind of data combing system and method
CN107679221A (en) * 2017-10-19 2018-02-09 武汉大学 Towards the time-space data acquisition and Services Composition scheme generation method of mitigation task
CN111192012A (en) * 2019-12-27 2020-05-22 腾讯云计算(北京)有限责任公司 Item processing method, item processing device, server and storage medium
CN111694963A (en) * 2020-05-11 2020-09-22 电子科技大学 Key government affair flow identification method and device based on item association network

Also Published As

Publication number Publication date
CN112286926A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN110597870A (en) Enterprise relation mining method
US20040167884A1 (en) Methods and products for producing role related information from free text sources
CN109446221B (en) Interactive data exploration method based on semantic analysis
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN110851667A (en) Integrated analysis method and tool for multi-source large data
CN112559726A (en) Resume information filtering method, model training method, device, equipment and medium
CN106897437B (en) High-order rule multi-classification method and system of knowledge system
CN111858649A (en) Heterogeneous data fusion method based on ontology mapping
CN111899090A (en) Enterprise associated risk early warning method and system
Wang et al. Multiple valued logic approach for matching patient records in multiple databases
CN111984640A (en) Portrait construction method based on multi-element heterogeneous data
CN114780733A (en) DIKW atlas-based intelligent patent modification method, auxiliary response method and system
CN112286926B (en) Method for combing data quality rules based on affair handling data supply and demand maps
Wang et al. Automatic dialogue system of marriage law based on the parallel C4. 5 decision tree
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
CN115719289A (en) House data processing method, device, equipment and medium
CN112115271B (en) Knowledge graph construction method and device
US11880394B2 (en) System and method for machine learning architecture for interdependence detection
CN114490571A (en) Modeling method, server and storage medium
LU504881B1 (en) Intelligent collection method and system for engineering archives based on enabling thinking
Gunawan et al. Data pre-processing in record linkage to find the same companies from different databases
Katz et al. Digitization of the australian Parliamentary Debates, 1998–2022
Pah et al. PRESIDE: A Judge Entity Recognition and Disambiguation Model for US District Court Records
Mpofu et al. Data wrangling for virtual attendance: A conceptual model
Subitha et al. An effective method for matching patient records from multiple databases using neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant